Paper ID | SPE-9.1 | ||
Paper Title | Transformer-Transducers for Code-Switched Speech Recognition | ||
Authors | Siddharth Dalmia, Carnegie Mellon University, United States; Yuzong Liu, Srikanth Ronanki, Katrin Kirchhoff, Amazon, United States | ||
Session | SPE-9: Speech Recognition 3: Transformer Models 1 | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-MULT] Multilingual Recognition and Identification | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | We live in a world where 60% of the population can speak two or more languages fluently. Members of these communities constantly switch between languages when having a conversation. As automatic speech recognition (ASR) systems are being deployed to the real-world, there is a need for practical systems that can handle multiple languages both within an utterance or across utterances. In this paper, we present an end-to-end ASR system using a transformer-transducer model architecture for code-switched speech recognition. We propose three modifications over the vanilla model in order to handle various aspects of code-switching. First, we introduce two auxiliary loss functions to handle the low-resource scenario of code-switching. Second, we propose a novel mask-based training strategy with language ID information to improve the label encoder training towards intra-sentential code-switching. Finally, we propose a multi-label/multi-audio encoder structure to leverage the vast monolingual speech corpora towards code-switching. We demonstrate the efficacy of our proposed approaches on the SEAME dataset, a public Mandarin-English code-switching corpus, achieving a mixed error rate of 18.5% and 26.3% on test_man and test_sge sets respectively. |