Mon, 23 May, 06:00 - 09:30 UTC
Summary
Speech recognition technology is completing a dramatic change — the move to an all-neural architecture replacing the conventional stack of independently trained neural and non-neural subsystems. The neural architecture improves accuracy over a wide range of use cases, challenges the boundary between speech recognition and language understanding allowing for jointly trained models, enables multi-task learning simultaneously solving transcription, segmentation, confidence estimation, and potentially more tasks. The neural architecture also achieves superior memory and compute compression, enabling streaming low-latency speech recognition at the edge, where resources are constrained. When applied as end-to-end all-neural SLU (ASR + NLU), the tradeoff between compression vs accuracy is even more favorable. The neural architecture enables truly multi-lingual systems that support within-sentence code switching. The neural architecture helps to reduce reliance on human labeling thanks to unsupervised pre-training, teacher/student semi-supervised training, and the ability to learn to incorporate user feedback signals, and to learn from other modalities.
While the neural architecture has shown great results and provides leeway for significant future improvements, it also presents new challenges. Personalization and adaptation are much easier to do in the conventional factored stack by adapting the finite state language models, a property that is lost with end-to-end all-neural models. Making adaptation effective and practical for all-neural systems remains a challenge, one that requires focused innovation and investment on building new sophisticated neural architecture solutions. Rare-word modeling is a challenge for neural architectures which learn acoustics and language jointly from audio/text pairs, whereas conventional architectures can use much larger text-only data sets for training the language models.
In this workshop, we will provide an overview of the all-neural architecture developed by the Alexa ASR group, dive deep into some of the challenges and future opportunities, and conduct a panel discussion and Q&A session on the impact, and the future of the all-neural approach to speech recognition.
Workshop Co-chairs
- Jennifer Shumway
- Ariya Rastrow
- Björn Hoffmeister
- Chris Ho
Panel Members
- Ariya Rastrow (Sr Principal Scientist)
- Andreas Stolcke (Sr Principal Scientist)
- Shalini Ghosh (Principal Scientist)
- Björn Hoffmeister (Director of Science)
Main Presentations
-
The new area of All-Neural ASR: An Overview
Presenters: Björn Hoffmeister, Ariya Rastrow -
Pre-Training and Multi-Modal Training
Presenter: Shalini Ghosh
Deep Dive Presentations
-
RescoreBERT: Discriminative speech recognition rescoring with BERT
Presenter: Yi Gu -
Bi/Multilingual ASR and LID using RNN-T
Presenter: Harish Arsikere -
Multi-turn RNN-T for streaming recognition of multi-party speech
Presenters: Anna Piunova and Ilya Sklyar -
Being greedy does not hurt: Sampling strategies for end-to-end speech recognition
Presenter: Jahn Heymann -
Lattice-attention in ASR rescoring
Presenters: Prabhat Pandey and Sergio Duarte Torres -
Multi-task RNN-T with semantic decoder for streamable spoken language understanding
Presenter: Feng-Ju (Claire) Chang