IEEE ICASSP 2021 || Toronto, Ontario, Canada || 6-11 June 2021

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID

SPE-4.3

Paper Title

PROSODIC CLUSTERING FOR PHONEME-LEVEL PROSODY CONTROL IN END-TO-END SPEECH SYNTHESIS

Authors

Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Innoetics, Samsung Electronics, Greece; Taehoon Kim, June Sig Sung, Hyoungmin Park, Mobile Communications Business, Samsung Electronics, South Korea; Aimilios Chalamandaris, Pirros Tsiakoulis, Innoetics, Samsung Electronics, Greece

Session

SPE-4: Speech Synthesis 2: Controllability

Location

Gather.Town

Session Time:

Tuesday, 08 June, 13:00 - 13:45

Presentation Time:

Tuesday, 08 June, 13:00 - 13:45

Presentation

Poster

Topic

Speech Processing: [SPE-SYNT] Speech Synthesis and Generation

IEEE Xplore Open Preview

Click here to view in IEEE Xplore

Abstract

This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

My ICASSP 2021 Schedule

Paper Detail