Presentation # | 4 |
Session: | Voice Conversion and TTS |
Session Time: | Friday, December 21, 10:00 - 12:00 |
Presentation Time: | Friday, December 21, 10:00 - 12:00 |
Presentation: |
Poster
|
Topic: |
Special session on Speech Synthesis: |
Paper Title: |
NEURAL TTS VOICE CONVERSION |
Authors: |
Zvi Kons; IBM Research | | |
| Slava Shechtman; IBM Research | | |
| Alex Sorin; IBM Research | | |
| Ron Hoory; IBM Research | | |
| Carmel Rabinovitz; IBM Research | | |
| Edmilson Da Silva Morais; IBM Research | | |
Abstract: |
Recently, speaker adaptation of neural TTS models received significant interest, and several studies focusing on this topic have been published. All of them explore an adaptation of an initial multi-speaker model trained on a corpus containing from tens to hundreds of individual speaker voices. In this work we focus on a challenging task of TTS voice conversion where an initial system is trained on a single-speaker data and then need to be adapted to a variety of external speaker voices. The TTS voice conversion setup represents a very important use case. Transcribed multi-speaker datasets might be unavailable for many languages while any TTS technology provider is expected to have at least one suitable single-speaker dataset per supported language. We present a neural TTS system comprising separate prosody generator and synthesizer DNN models. The system is trained on a high quality proprietary male speaker dataset. We show that the system models can be converted to a variety of external male and female ordinary voices and an extremely expressive artist’s voice and present crowd-base subjective evaluation results. |