Presentation # | 8 |
Session: | ASR III (End-to-End) |
Location: | Kallirhoe Hall |
Session Time: | Friday, December 21, 10:00 - 12:00 |
Presentation Time: | Friday, December 21, 10:00 - 12:00 |
Presentation: |
Poster
|
Topic: |
Speech recognition and synthesis: |
Paper Title: |
COMBINING DE-NOISING AUTO-ENCODER AND RECURRENT NEURAL NETWORKS IN END-TO-END AUTOMATIC SPEECH RECOGNITION FOR NOISE ROBUSTNESS |
Authors: |
Tzu-Hsuan Ting, Chia-Ping Chen, National Sun Yat-sen University, Taiwan |
Abstract: |
In this paper, we propose an end-to-end noise-robust automatic speech recognition system through deep-learning implementation of de-noising auto-encoders and recurrent neural networks. We use batch normalization and a novel design for the front-end de-noising auto-encoder, which mimics a two-stage prediction of a single-frame clean feature vector from multi-frame noisy feature vectors. For the back-end word recognition, we use an end-to-end system based on bidirectional recurrent neural network with long short-term memory cells. The LSTM-BiRNN is trained via connectionist temporal classification criterion. Its performance is compared to a baseline backend based on hidden Markov models and Gaussian mixture models (HMM-GMM). Our experimental results show that the proposed novel front-end de-noising auto-encoder outperforms the best record we can find for the Aurora 2.0 clean-condition training tasks by an absolute improvement of 1.2% (6.0% vs. 7.2%). In addition, the proposed end-to-end back-end architecture is as good as the traditional HMM-GMM back-end recognizer. |