IEEE ICASSP 2022

2022 IEEE International Conference on Acoustics, Speech and Signal Processing

7-13 May 2022
  • Virtual (all paper presentations)
22-27 May 2022
  • Main Venue: Marina Bay Sands Expo & Convention Center, Singapore
27-28 October 2022
  • Satellite Venue: Crowne Plaza Shenzhen Longgang City Centre, Shenzhen, China

ICASSP 2022
ST-10: A latinamerican multi-accent TTS with voice cloning capabilities
Wed, 11 May, 23:00 - 23:45 China Time (UTC +8)
Wed, 11 May, 15:00 - 15:45 UTC
Location: Gather Area P
Virtual
Gather.Town
Show & Tell
Presented by: Leonardo Daniel Pepino, Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Argentina. Pablo Ernesto Riera, Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Argentina. Ricardo Germán Barchi, Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Argentina. Joaquín Giaccio Rittel (no affiliation)

In this demo we aim to show our Text To Speech (TTS) system, based in Fastspeech, which is capable of synthesizing high quality Spanish voices. Moreover, we will be showing a custom algorithm which makes possible using our model for voice cloning, at a low computational cost. The presentation will be structured in two parts: 1. General overview of our machine learning end-to-end text-to-speech algorithm. 2. Audio demos. 2.1. Multimedia content generation. 2.2. Voice cloning of famous voices and voice reconstruction of patients suffering phonation pathologies. 3. Q&A. Nowadays, very few spanish TTS exist in the market, and most of them have a neutral, Spain or Mexican accent. We focused our efforts in providing an argentinian spanish TTS, which feels more natural to argentinian users and constitutes the main novelty of our project. We also experimented with accents from other latinamerican countries like Colombia and Chile, by curating a multi-accent spanish dataset. The innovations we hope to show to the signal processing community at ICASSP are: - Multi-accent model: our model supports a wide variety of spanish accents (Argentina, Chile, Colombia, España, México, Perú, Puerto Rico, Uruguay and Venezuela) - Voice cloning: our model can be adapted to a particular speaker with less than 10 minutes of speech from the target speaker. - Prosody control: duration, pitch and energy of the synthetic voice can be controlled, allowing for voice conversion. This project has a high impact to the signal processing community as the speech technologies focused in latinamerican countries are scarce, and we think that this project can promote interest in developing high quality speech synthesis technologies for countries in that region. We plan to allow interaction with our TTS through a website during the conference, so that attendees can listen to the synthesized voices in realtime and play with the different available parameters.