Presentation # | 4 |
Session: | Deep Learning for Speech Synthesis |
Session Time: | Tuesday, December 18, 14:00 - 17:00 |
Presentation Time: | Tuesday, December 18, 14:00 - 17:00 |
Presentation: |
Invited talk, Discussion, Oral presentation, Poster session
|
Topic: |
Speech recognition and synthesis: |
Paper Title: |
HIERARCHICAL RNNS FOR WAVEFORM-LEVEL SPEECH SYNTHESIS |
Authors: |
Qingyun Dou; University of Cambridge | | |
| Moquan Wan; University of Cambridge | | |
| Gilles Degottex; University of Cambridge | | |
| Zhiyi Ma; University of Cambridge | | |
| Mark Gales; University of Cambridge | | |
Abstract: |
Speech synthesis technology has a wide range of applications such as voice assistants. In recent years waveform-level synthesis systems have achieved state-of-the-art performance, as they overcome the limitations of vocoder-based synthesis systems. A range of waveform-level synthesis systems have been proposed; this paper investigates the performance of hierarchical Recurrent Neural Networks (RNNs) for speech synthesis. First, the form of network conditioning is discussed, comparing linguistic features and vocoder features from a vocoder-based synthesis system. It is found that compared with linguistic features, conditioning on vocoder features requires less data and modeling power, and yields better performance when there is limited data. By conditioning the hierarchical RNN on vocoder features, this paper develops a neural vocoder, which is capable of high quality synthesis when there is sufficient data. Furthermore, this neural vocoder is flexible, as conceptually it can map any sequence of vocoder features to speech, enabling efficient synthesizer porting to a target speaker. Subjective listening tests demonstrate that the neural vocoder outperforms a high quality baseline, and that it can change its voice to a very different speaker, given less than 15 minutes of data for fine tuning. |