Technical Program

Paper Detail

Presentation #4
Session:Deep Learning for Speech Synthesis
Location:Kallirhoe Hall
Session Time:Tuesday, December 18, 14:00 - 17:00
Presentation Time:Tuesday, December 18, 14:00 - 17:00
Presentation: Invited talk, Discussion, Oral presentation, Poster session
Topic: Speech recognition and synthesis:
Paper Title: HIERARCHICAL RNNS FOR WAVEFORM-LEVEL SPEECH SYNTHESIS
Authors: Qingyun Dou, Moquan Wan, Gilles Degottex, Zhiyi Ma, Mark Gales, University of Cambridge, United Kingdom
Abstract: Speech synthesis technology has a wide range of applications such as voice assistants. In recent years waveform-level synthesis systems have achieved state-of-the-art performance, as they overcome the limitations of vocoder-based synthesis systems. A range of waveform-level synthesis systems have been proposed; this paper investigates the performance of hierarchical Recurrent Neural Networks (RNNs) for speech synthesis. First, the form of network conditioning is discussed, comparing linguistic features and vocoder features from a vocoder-based synthesis system. It is found that compared with linguistic features, conditioning on vocoder features requires less data and modeling power, and yields better performance when there is limited data. By conditioning the hierarchical RNN on vocoder features, this paper develops a neural vocoder, which is capable of high quality synthesis when there is sufficient data. Furthermore, this neural vocoder is flexible, as conceptually it can map any sequence of vocoder features to speech, enabling efficient synthesizer porting to a target speaker. Subjective listening tests demonstrate that the neural vocoder outperforms a high quality baseline, and that it can change its voice to a very different speaker, given less than 15 minutes of data for fine tuning.