2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-11.3
Paper Title NON-PARALLEL MANY-TO-MANY VOICE CONVERSION USING LOCAL LINGUISTIC TOKENS
Authors Chao Wang, Yibiao Yu, Soochow University, China
SessionSPE-11: Voice Conversion 1: Non-parallel Conversion
LocationGather.Town
Session Time:Tuesday, 08 June, 16:30 - 17:15
Presentation Time:Tuesday, 08 June, 16:30 - 17:15
Presentation Poster
Topic Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract The VQ-VAE based voice conversion models have lately received increasing attention in non-parallel many to many voice conversion, where the encoder extracts the speaker-invariant linguistic content from the input speech using vector quantization and the decoder produces the target speech from the encoder output, conditioned on the target speaker representation. However, it is challenging for the encoder to find a proper balance between removing the speaker information and preserving the linguistic content, which degrades the converted speech quality. To address this issue, we propose the Local Linguistic Tokens (LLTs) model to learn high-quality speaker-invariant linguistic embeddings using the multi-head attention module, which has shown great success in extracting speaking style embeddings in Global Style Tokens (GSTs). Instead of vector quantization, the multi-head attention module makes the encoder preserve more linguistic content to enhance the converted speech quality. Both objective and subjective experimental results revealed that, compared with the state-of-the-art VQ-VAE model, the proposed LLTs model achieved significantly better speech quality and comparable speaker similarity. The converted samples are available online for listening.