09:00 - 10:00
"Sorry I didn’t hear you" maybe the first empathetic utterance by a commercial machine. Since the late 1990s when speech companies began providing their customer-service software to other numerous companies, which was programmed to use different phrases, people have gotten used to speaking to virtual agents. As people interact more often with voice controlled “personal assistants”, they expect machines to be more natural and easier to communicate wtih than traditional dialog systems. It has been shown that humans need to establish rapport with each other and with machines to achieve effective communication. Establishing rapport includes the ability to recognize different emotions, and understand other high level communication features such as humor, sarcasm and intent. In order to make such communication possible, the machines need to extract emotions from human speech and behavior and can accordingly decide the emotionally correct response of the agent. We propose Zara the Supergirl as a prototype system. It is a virtual agent with an animated cartoon character to present itself on the screen. Along the way it will get ’smarter’ and more empathetic, by gathering more user data and learning from it. In this talk, I will give an overview of multi-channel recognition and expression of emotion and intent. I will present our work so far in the areas of deep learning of emotion and sentiment recognition, as well as humor recognition. I hope to explore the future direction of virtual agents and robot development and how it can help improve people's lives.
Pascale Fung is a Professor at the Department of Electronic & Computer Engineering at The Hong Kong University of Science & Technology. She is an elected Fellow of the Institute of Electrical and Electronic Engineers (IEEE) and Fellow of the International Speech Communication Association. She co-founded the Human Language Technology Center (HLTC) at HKUST and is an affiliated faculty with both the Robotics Institute and the Big Data Institute. She is the founding chair of the Women Faculty Association at HKUST.
Prof. Fung's research interests lie primarily in building intelligent systems that can understand and empathize with humans, enabled by spoken language understanding, speech, facial expression, and sentiment recognition. She is blogs for the World Economic Forum on societal impacts of spoken language processing and machine learning. She is a member of the Global Future Council on AI and Robotics of the World Economic Forum.
09:00 - 10:00
Neural networks have played an increasingly larger part in various components of speech recognizers over the last few years. However, the most popular models remain the Artificial Neural Network - Hidden Markov Models (ANN-HMM) inspired models that rely on HMMs to provide a generative story for the data. Recent work on attention based models have side-stepped the use of HMMs and are able to train speech recognition systems with discriminative objectives that make no assumptions about the probability distributions, other than the ability of neural networks to model them. While these models are able to achieve state of the art without the use of pronunciation dictionaries or external language models, they currently lag slightly behind carefully tuned DNN-HMM systems with all the bells and whistles. In this talk, we summarize some of this recent work, and identify systematic issues that the models currently face. We will also describe methodological improvements that should help these models.
Navdeep Jaitly is a Research Scientist at Google. He received his PhD from the University of Toronto under the supervision of Geoffrey Hinton. His interests lie in Deep Learning, Speech Recognition and Computational Biology. At University of Toronto he worked on deep generative and discriminative models focussing mostly on speech data. During an internship at Google in 2011, he showed that Deep Neural Networks improved the accuracy of state of the art speech recognizers built on thousands of hours of data. He has since been working on end-to-end models for sequential data, applying these techniques to speech recognition for the purpose of developing purely neural models. Prior to his PhD he was a Senior Research Scientist at the Pacific National Laboratory and at Caprion Pharmaceuticals working on algorithms for analysis of Proteomics and Mass Spectrometry data.
09:00 - 10:00
Jibo is a robot that understands speech, has a moving body that helps him communicate more effectively and express emotions. Jibo has cameras and microphones to make sense of the world around him, including recognizing and tracking people both from the audio as well as for the visual standpoint. He uses speech recognition, natural language processing, and dialog management to entertain spoken conversations. He can recognize people both from their voice as well as their face. He has a display to show text and images, an animated eye that can morph into shapes, and a touch interface that can be used as a complementary input modality. Jibo talks with his own unique Jibo voice, uses robotic sounds to complement his speech, and his body is animated in sync with his speech, or for the purpose of expressing emotions and performing other activities. By having a complete set of sensorial inputs and outputs, Jibo embodies one the richest human-machine interaction commercial devices with an SDK that can be used by third party developers to build many applications.
In this talk I will take the audience through the journey in which we embarked since we started building such a complex device, and the overt and hidden challenges faced by this endeavor.
Roberto Pieraccini, a scientist, technologist, and the author of “The Voice in the Machine,” (MIT Press, 2012) has been at the forefront of speech, language, and machine learning innovation for more than 30 years. He is widely known as a pioneer in the fields of statistical natural language understanding and machine learning for automatic dialog systems, and their practical application to industrial solutions. As a researcher he worked at CSELT (Italy), Bell laboratories, AT&T Labs, and IBM T.J. Watson. He led the dialog technology team at SpeechWorks Int.l, he was the CTO of SpeechCycle, and the CEO of the International Computer Science Institute (ICSI) in Berkeley. He now leads the Advanced Conversational Technologies team at Jibo. http://robertopieraccini.com