SLT 2018 • Technical Program • 2018 IEEE Workshop on Spoken Language Technology (SLT) | 18-21 December 2018

My SLT 2018 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Presentation #

Session:

Deep Learning for Speech Synthesis

Session Time:

Tuesday, December 18, 14:00 - 17:00

Presentation Time:

Tuesday, December 18, 14:00 - 17:00

Presentation:

Invited talk, Discussion, Oral presentation, Poster session

Topic:

Speech recognition and synthesis:

Paper Title:

SYNTHETIC-TO-NATURAL SPEECH WAVEFORM CONVERSION USING CYCLE-CONSISTENT ADVERSARIAL NETWORKS

Authors:

Kou Tanaka; NTT corporation

Takuhiro Kaneko; NTT corporation

Nobukatsu Hojo; NTT corporation

Hirokazu Kameoka; NTT corporation

Abstract:

We propose a learning-based filter that allows us to directly modify a waveform of synthetic speech to that of natural speech. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice conversion are convenient especially for a limited data because it is possible to represent and process interpretable acoustic features over a compact space, such as the fundamental frequency and mel-cepstrum. However, the well-known problem leading the quality degradation of generated speech is an over-smoothing effect lacking some detailed structure of generated/converted acoustic features. To address this issue, we propose a synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks which doesn't require any explicit assumption about speech waveform in adversarial learning. In contrast to current techniques, since our modification is performed at the waveform level, we expect that the proposed method also enables to generate ``vocoder less'' sounding speech even if the input speech is synthesized by using the vocoder framework. The experimental results demonstrate that our proposed method achieves to 1) alleviate the over-smoothing effect of the acoustic features nevertheless the direct modification method for the waveform and 2) dramatically improve the naturalness of the generated speech sounds.