IEEE ICASSP 2022

2022 IEEE International Conference on Acoustics, Speech and Signal Processing

7-13 May 2022
  • Virtual (all paper presentations)
22-27 May 2022
  • Main Venue: Marina Bay Sands Expo & Convention Center, Singapore
27-28 October 2022
  • Satellite Venue: Crowne Plaza Shenzhen Longgang City Centre, Shenzhen, China

ICASSP 2022
IEP-15: Multi-modal speech interaction system applied in vehicle
Thu, 12 May, 22:00 - 22:45 China Time (UTC +8)
Thu, 12 May, 14:00 - 14:45 UTC
Location: Gather Area P
Virtual
Gather.Town
Expert
Presented by: Gao Jianqing, iFlytek

Speech interaction systems have been widely used in smart devices, but due to low accuracy of speech recognition in noisy scene and natural language understanding in complicated scene, human-machine interaction is far from natural compared to human-human interaction. For example, speech interaction systems still rely on speech wake-up to initiate interaction. Moreover, when two or more people speak simultaneously or people interact with machine in noisy scene, the performance of speech recognition is still poor, resulting in a poor user experience. This proposal proposes a multimodal speech interaction solution applied in automotive scenarios. Main speaker is detected using lip movement to assist speech. At the same time, an end-to-end model from speech to intent is used to distinguish whether the main speaker is speaking to the machine or to the passenger in the car. Through this solution, a natural interaction that does not depend on speech wake-up is implemented. The success rate of interaction is comparable to that of traditional speech recognition when the false wake-up rate is 0.6 times in 24 hours. In addition, a multi-modal and multi-channel based speech separation technology is proposed. The speech, the spatial information provided by the microphone array and the lip movement of the main speaker are input into the CLDNN network to learn the mask of the main speaker relative to the mixed speech. Through this method, the main speaker's speech is enhanced, and the background noise and the interfering speaker's speech are suppressed. In the scene of multiple people talking at the same time and background music, the gain of speech recognition accuracy is more than 30%. This proposal implements a new way of speech interaction. A more reliable, natural, and robust human-computer interaction is achieved through the combination of speech and vision. The solution has been widely used by Chinese car manufacturers, and can be extended to smart homes, robots and other fields.

Biography

The author graduated from the University of Science and Technology of China with a Ph.D. degree and is a senior engineer. He is currently the deputy dean of the iFLYTEK AI Research Institute, responsible for the research and development of intelligent voice technology. He led the team to complete the research and development of the second and third generation speech recognition systems of iFLYTEK, which greatly improved the performance of speech recognition systems in complicated scenes. He successfully developed a series of speech transcription products represented by iFLYREC which provided the online speech transcription service. The developed speech interaction system was widely used in automotive, smart home and other fields. In recent years, more than 40 patents have been authorized or published.