Paper ID | HLT-1.3 |
Paper Title |
SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT |
Authors |
Wen-Chin Huang, Nagoya University, Japan; Chia-Hua Wu, Shang-Bao Luo, Academia Sinica, Taiwan; Kuan-Yu Chen, National Taiwan University of Science and Technology, Taiwan; Hsin-Min Wang, Academia Sinica, Taiwan; Tomoki Toda, Nagoya University, Japan |
Session | HLT-1: Language Modeling 1: Fusion and Training for End-to-End ASR |
Location | Gather.Town |
Session Time: | Tuesday, 08 June, 13:00 - 13:45 |
Presentation Time: | Tuesday, 08 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-GASR] General Topics in Speech Recognition |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance. |