2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-11.2
Paper Title NON-PARALLEL MANY-TO-MANY VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM A TEXT-TO-SPEECH MODEL
Authors Xinyuan Yu, Brian Mak, The Hong Kong University of Science and Technology, Hong Kong SAR China
SessionSPE-11: Voice Conversion 1: Non-parallel Conversion
LocationGather.Town
Session Time:Tuesday, 08 June, 16:30 - 17:15
Presentation Time:Tuesday, 08 June, 16:30 - 17:15
Presentation Poster
Topic Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract In this paper, we present a simple but novel framework to train a non-parallel many-to-many voice conversion (VC) model based on the encoder-decoder architecture. It is observed that an encoder-decoder text-to-speech (TTS) model and an encoder-decoder VC model have the same structure. Thus, we propose to pre-train a multi-speaker encoder-decoder TTS model and transfer knowledge from the TTS model to a VC model by (1) adopting the TTS acoustic decoder as the VC acoustic decoder, and (2) forcing the VC speech encoder to learn the same speaker-agnostic linguistic features from the TTS text encoder so as to achieve speaker disentanglement in the VC encoder output. We further control the conversion of the pitch contour from source speech to target speech, and condition the VC decoder on the converted pitch contour during inference. Subjective evaluation shows that our proposed model is able to handle VC between any speaker pairs in the training speech corpus of over 200 speakers with high naturalness and speaker similarity.