Presentation # | 10 |
Session: | ASR I |
Location: | Kallirhoe Hall |
Session Time: | Wednesday, December 19, 10:00 - 12:00 |
Presentation Time: | Wednesday, December 19, 10:00 - 12:00 |
Presentation: |
Poster
|
Topic: |
Speech recognition and synthesis: |
Paper Title: |
An Exploration of Directly Using Word as Acoustic Modeling Unit for Speech Recognition |
Authors: |
Chunlei Zhang, The University of Texas at Dallas, United States; Chengzhu Yu, Chao Weng, Jia Cui, Dong Yu, Tencent AI Lab, United States |
Abstract: |
Conventional acoustic models for automatic speech recognition (ASR) are usually constructed from sub-word unit (e.g., context-dependent phoneme, grapheme, wordpiece etc.). Recent studies demonstrate that connectionist temporal classification (CTC) based acoustic-to-word (A2W) models are also promising for ASR. Such structures have drawn increasing attention as they can directly target words as output units, which simplify ASR pipeline by avoiding additional pronunciation lexicon, or even language model. In this study, we systematically explore to use word as acoustic modeling unit for conversational speech recognition. By replacing senone alignment with word alignment in a convolutional bidirectional LSTM architecture and employing a lexicon-free weighted finite-state transducer (WFST) based decoding, we greatly simplify conventional hybrid speech recognition system. On Hub5-2000 Switchboard/CallHome test sets with 300-hour training data, we achieve a WER that is close to the senone based hybrid systems with a WFST based decoding. |