Login Paper Search My Schedule Paper Index Help

My SLT 2018 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Presentation #1
Session:Deep Learning for Speech Synthesis
Session Time:Tuesday, December 18, 14:00 - 17:00
Presentation Time:Tuesday, December 18, 14:00 - 17:00
Presentation: Invited talk, Discussion, Oral presentation, Poster session
Topic: Special session on Speech Synthesis:
Paper Title: PREDICTING EXPRESSIVE SPEAKING STYLE FROM TEXT IN END-TO-END SPEECH SYNTHESIS
Authors: Daisy Stanton; Google 
 Yuxuan Wang; Google 
 RJ Ryan; Google 
Abstract: Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as ``virtual'' speaking style labels within Tacotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training, nor auxiliary inputs for inference. We show that, when trained on an expressive speech dataset, our system can render text with more pitch and energy variation than two state-of-the-art baseline models. We further demonstrate that TP-GSTs can synthesize speech with background noise removed, and corroborate these analyses with positive results on human-rated listener preference audiobook tasks. Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style. We provide a website with audio samples for each of our findings.