2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-22.4
Paper Title	MULTI-DIALECT SPEECH RECOGNITION IN ENGLISH USING ATTENTION ON ENSEMBLE OF EXPERTS
Authors	Amit Das, Kshitiz Kumar, Jian Wu, Microsoft, United States
Session	SPE-22: Speech Recognition 8: Multilingual Speech Recognition
Location	Gather.Town
Session Time:	Wednesday, 09 June, 15:30 - 16:15
Presentation Time:	Wednesday, 09 June, 15:30 - 16:15
Presentation	Poster
Topic	Speech Processing: [SPE-MULT] Multilingual Recognition and Identification
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	In the presence of a wide variety of dialects, training dialect-specific models for each dialect is a demanding task. Previous studies have explored training a single model that is robust across multiple dialects. These studies have used either multi-condition training, multi-task learning, end-to-end modeling, or ensemble modeling. In this study, we further explore using a single model for multi-dialect speech recognition using ensemble modeling. First, we build an ensemble of dialect-specific models (or experts). Then we linearly combine the outputs of the experts using attention weights generated by a long short-term memory (LSTM) network. For comparison purposes, we train a model that jointly learns to recognize and classify dialects using multi-task learning and a second model using multi-condition training. We train all of these models with about 60,000 hours of speech data collected in American English, Canadian English, British English, and Australian English. Experimental results reveal that our best proposed model achieved an average 4.74% word error rate reduction (WERR) compared to the strong baseline model.