Technical Program

Paper Detail

Presentation #8
Session:Deep Learning for Speech Synthesis
Location:Kallirhoe Hall
Session Time:Tuesday, December 18, 14:00 - 17:00
Presentation Time:Tuesday, December 18, 14:00 - 17:00
Presentation: Invited talk, Discussion, Oral presentation, Poster session
Topic: Speech recognition and synthesis:
Paper Title: MULTI-SCALE ALIGNMENT AND CONTEXTUAL HISTORY FOR ATTENTION MECHANISM IN SEQUENCE-TO-SEQUENCE MODEL
Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, Nara Institute of Science and Technology, Japan
Abstract: A sequence-to-sequence model is a neural network module for mapping two sequences of different lengths. The sequence-to-sequence model has three core modules: encoder, decoder, and attention. Attention is the bridge that connects the encoder and decoder modules and improves model performance in many tasks. In this paper, we propose two ideas to improve sequence-to-sequence model performance by enhancing the attention module. First, we maintain the history of the location and the expected context from several previous time-steps. Second, we apply multiscale convolution from several previous attention vectors to the current decoder state. We utilized our proposed framework for sequence-to-sequence speech recognition and text-to-speech systems. The results reveal that our proposed extension can improve performance significantly compared to a standard attention baseline.