Technical Program

Paper Detail

Presentation #2
Session:Deep Learning for Speech Synthesis
Location:Kallirhoe Hall
Session Time:Tuesday, December 18, 14:00 - 17:00
Presentation Time:Tuesday, December 18, 14:00 - 17:00
Presentation: Invited talk, Discussion, Oral presentation, Poster session
Topic: Special session on Speech Synthesis:
Paper Title: A SPECTRALLY WEIGHTED MIXTURE OF LEAST SQUARE ERROR AND WASSERSTEIN DISCRIMINATOR LOSS FOR GENERATIVE SPSS
Authors: Gilles Degottex, ObEN, Inc. - University of Cambridge, United Kingdom; Mark Gales, University of Cambridge, United Kingdom
Abstract: Generative networks can create an artificial spectrum based on its conditional distribution estimate instead of predicting only the mean value, as the Least Square (LS) solution does. This is promising since the LS predictor is known to oversmooth features leading to muffling effects. However, modeling a whole distribution instead of a single mean value requires more data and thus also more computational resources. With only one hour of recording, as often used with LS approaches, the resulting spectrum is noisy and sounds full of artifacts. In this paper, we suggest a new loss function, by mixing the LS error and the loss of a discriminator trained with Wasserstein GAN, while weighting this mixture differently through the frequency domain. Using listening tests, we show that, using this mixed loss, the generated spectrum is smooth enough to obtain a decent perceived quality. While making our source code available online, we also hope to make generative networks more accessible with lower the necessary resources.