Paper ID | SPE-33.5 |
Paper Title |
UNSUPERVISED LEARNING FOR MULTI-STYLE SPEECH SYNTHESIS WITH LIMITED DATA |
Authors |
Shuang Liang, Chenfeng Miao, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao, Ping An Technology, China |
Session | SPE-33: Speech Synthesis 5: Prosody & Style |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 13:00 - 13:45 |
Presentation Time: | Thursday, 10 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-SYNT] Speech Synthesis and Generation |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Existing multi-style speech synthesis methods require either style labels or large amounts of unlabeled training data, making data acquisition difficult. In this paper, we present an unsupervised multi-style speech synthesis method that can be trained with limited data. We leverage instance discriminator to guide a style encoder to learn meaningful style representations from a multi-style dataset. Furthermore, we employ information bottleneck to filter out style-irrelevant information in the representations, which can improve speech quality and style similarity. Our method is able to produce desirable speech using a fairly small dataset, where the baseline GST-Tacotron fails. ABX tests show that our model significantly outperforms {GST-Tacotron} in both emotional speech synthesis task and multi-speaker speech synthesis task. In addition, we demonstrate that our method is able to learn meaningful style features with only 50 training samples per style. |