2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-55.6
Paper Title	How Phonotactics Affect Multilingual and Zero-shot ASR Performance
Authors	Siyuan Feng, Delft University of Technology, Netherlands; Piotr Żelasko, Laureano Moro-Velázquez, Johns Hopkins University, United States; Ali Abavisani, Mark Hasegawa-Johnson, University of Illinois at Urbana-Champaign, United States; Odette Scharenborg, Delft University of Technology, Netherlands; Najim Dehak, Johns Hopkins University, United States
Session	SPE-55: Language Identification and Low Resource Speech Recognition
Location	Gather.Town
Session Time:	Friday, 11 June, 14:00 - 14:45
Presentation Time:	Friday, 11 June, 14:00 - 14:45
Presentation	Poster
Topic	Speech Processing: [SPE-MULT] Multilingual Recognition and Identification
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.