Presentation # | 12 |
Session: | ASR I |
Location: | Kallirhoe Hall |
Session Time: | Wednesday, December 19, 10:00 - 12:00 |
Presentation Time: | Wednesday, December 19, 10:00 - 12:00 |
Presentation: |
Poster
|
Topic: |
Speech recognition and synthesis: |
Paper Title: |
IMPROVING VERY DEEP TIME-DELAY NEURAL NETWORK WITH VERTICAL-ATTENTION FOR EFFECTIVELY TRAINING CTC-BASED ASR SYSTEMS |
Authors: |
Sheng Li, Xugang Lu, Ryoichi Takashima, Peng Shen, National Institute of Information and Communications Technology, Japan; Tatsuya Kawahara, National Institute of Information and Communications Technology (NICT) / Kyoto University, Japan; Hisashi Kawai, National Institute of Information and Communications Technology, Japan |
Abstract: |
The very deep neural network has recently been proposed for speech recognition and achieves significant performance. It has excellent potential for integration with end-to-end (E2E) training. Connectionist temporal classification (CTC) has shown great potential in E2E acoustic modeling. In this study, we investigate deep architectures and techniques which are suitable for CTC-based acoustic modeling. We propose a very deep residual time-delay CTC neural network (VResTD-CTC). How to select a suitable deep architecture optimized with the CTC objective function is crucial for obtaining the state of the art performance. Excellent performances can be obtained by selecting deep architecture for non-E2E ASR systems modeling with tied-triphone states. However, these optimized structures do not guarantee to achieve better or comparable performances on E2E (e.g., CTC-based) systems modeling with dynamic acoustic units. For solving this problem and further leveraging the system performance, we introduce the vertical-attention mechanism to reweight the residual blocks at each time step. Speech recognition experiments show our proposed model significantly outperforms the DNN and LSTM-based (both bidirectional and unidirectional) CTC baseline models. |