IEEE ICASSP 2022

2022 IEEE International Conference on Acoustics, Speech and Signal Processing

7-13 May 2022
  • Virtual (all paper presentations)
22-27 May 2022
  • Main Venue: Marina Bay Sands Expo & Convention Center, Singapore
27-28 October 2022
  • Satellite Venue: Crowne Plaza Shenzhen Longgang City Centre, Shenzhen, China

ICASSP 2022
WS-2: The new era of all-neural SLU: opportunities and challenges ahead
Mon, 23 May, 14:00 - 17:30 China Time (UTC +8)
Mon, 23 May, 06:00 - 09:30 UTC
Location: Roselle Junior Ballroom 4612-3
In-Person
Live-Stream
Workshop
Summary

Speech recognition technology is completing a dramatic change — the move to an all-neural architecture replacing the conventional stack of independently trained neural and non-neural subsystems. The neural architecture improves accuracy over a wide range of use cases, challenges the boundary between speech recognition and language understanding allowing for jointly trained models, enables multi-task learning simultaneously solving transcription, segmentation, confidence estimation, and potentially more tasks. The neural architecture also achieves superior memory and compute compression, enabling streaming low-latency speech recognition at the edge, where resources are constrained. When applied as end-to-end all-neural SLU (ASR + NLU), the tradeoff between compression vs accuracy is even more favorable. The neural architecture enables truly multi-lingual systems that support within-sentence code switching. The neural architecture helps to reduce reliance on human labeling thanks to unsupervised pre-training, teacher/student semi-supervised training, and the ability to learn to incorporate user feedback signals, and to learn from other modalities.

While the neural architecture has shown great results and provides leeway for significant future improvements, it also presents new challenges. Personalization and adaptation are much easier to do in the conventional factored stack by adapting the finite state language models, a property that is lost with end-to-end all-neural models. Making adaptation effective and practical for all-neural systems remains a challenge, one that requires focused innovation and investment on building new sophisticated neural architecture solutions. Rare-word modeling is a challenge for neural architectures which learn acoustics and language jointly from audio/text pairs, whereas conventional architectures can use much larger text-only data sets for training the language models.

In this workshop, we will provide an overview of the all-neural architecture developed by the Alexa ASR group, dive deep into some of the challenges and future opportunities, and conduct a panel discussion and Q&A session on the impact, and the future of the all-neural approach to speech recognition.

Workshop Co-chairs
  • Jennifer Shumway
  • Ariya Rastrow
  • Björn Hoffmeister
  • Chris Ho
Panel Members
  • Ariya Rastrow (Sr Principal Scientist)
  • Andreas Stolcke (Sr Principal Scientist)
  • Shalini Ghosh (Principal Scientist)
  • Björn Hoffmeister (Director of Science)
Main Presentations
  • The new area of All-Neural ASR: An Overview
    Presenters: Björn Hoffmeister, Ariya Rastrow
  • Pre-Training and Multi-Modal Training
    Presenter: Shalini Ghosh
Deep Dive Presentations