Technical Program

Paper Detail

Presentation #	6
Session:	ASR IV
Location:	Kallirhoe Hall
Session Time:	Friday, December 21, 13:30 - 15:30
Presentation Time:	Friday, December 21, 13:30 - 15:30
Presentation:	Poster
Topic:	Speech recognition and synthesis:
Paper Title:	LEVERAGING SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR ENHANCING ACOUSTIC-TO-WORD SPEECH RECOGNITION
Authors:	Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke Sakai, Tatsuya Kawahara, Kyoto University, Japan
Abstract:	Encoder-decoder models for acoustic-to-word (A2W) automatic speech recognition (ASR) are attractive for their simplicity of architecture and run-time latency while achieving state-of-the-art performances. However, word-based models commonly suffer from the out-of-vocabulary (OOV) word problem. They also cannot leverage text data to improve their language modeling capability. Recently, sequence-to-sequence neural speech synthesis models trainable from corpora have been developed and shown to achieve naturalness comparable to recorded human speech. In this paper, we explore how we can leverage the current speech synthesis technology to tailor the ASR system for a target domain by preparing only a relevant text corpus. From a set of target domain texts, we generate speech features using a sequence-to-sequence speech synthesizer. These artificial speech features together with real speech features from conventional speech corpora are used to train an attention-based A2W model. Experimental results show that the proposed approach improves the word accuracy significantly compared to the baseline trained only with the real speech, although synthetic part of the training data comes only from a single female speaker voice.