Paper ID | SPE-4.4 | ||
Paper Title | IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS | ||
Authors | Cheng Gong, Longbiao Wang, Tianjin University, China; Zhenhua Ling, University of Science and Technology of China, China; Shaotong Guo, Tianjin University, China; Ju Zhang, Huiyan Technology (Tianjin) Co., Ltd, China; Jianwu Dang, Japan Advanced Institute of Science and Technology, Japan | ||
Session | SPE-4: Speech Synthesis 2: Controllability | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-SYNT] Speech Synthesis and Generation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | State-of-the-art neural text-to-speech (TTS) networks are trained with a large amount of speech data, which significantly improves the quality of synthetic speech compared with traditional approaches. However, the prosody and controllability of the generated speech is still insufficient, especially in tonal languages. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence or words. In this study, we extended Tacotron2 with a pitch prediction task to capture discrete pitch-related representations. Specifically, the learned pitch-related suprasegmental information is fed simultaneously with traditional character features into the decoder to generate final Mel spectrogram. Experiments show that the proposed method can improve the quality of the generated speech (mean opinion score of 4.37 vs. 4.22). Moreover, we demonstrated that we can easily achieve word-level pitch control during generation by changing local pitch-related representations before passing them to the decoder network. |