Paper ID | SPE-4.2 | ||
Paper Title | FCL-TACO2: TOWARDS FAST, CONTROLLABLE AND LIGHTWEIGHT TEXT-TO-SPEECH SYNTHESIS | ||
Authors | Disong Wang, The Chinese University of Hong Kong, Hong Kong SAR China; Liqun Deng, Yang Zhang, Nianzu Zheng, Yu Ting Yeung, Xiao Chen, Huawei Noah's Ark Lab, China; Xunying Liu, Helen Meng, The Chinese University of Hong Kong, Hong Kong SAR China | ||
Session | SPE-4: Speech Synthesis 2: Controllability | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-SYNT] Speech Synthesis and Generation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Sequence-to-sequence (seq2seq) learning has greatly improved text-to-speech (TTS) synthesis performance, but effective implementation on resource-restricted devices remains challenging as seq2seq models are usually computationally expensive and memory intensive. To achieve fast inference speed and small model size while maintain high-quality speech, we propose FCL-taco2, a Fast, Controllable and Lightweight (FCL) TTS model based on Tacotron2. FCL-taco2 adopts a novel semi-autoregressive (SAR) mode for phoneme level based parallel mel-spectrograms generation conditioned on prosody features, leading to faster inference speed and higher prosody controllability than Tacotron2. Besides, knowledge distillation (KD) is leveraged to compress a relatively large FCL-taco2 model to its small version with minor loss of speech quality. Experimental results on English (EN) and Chinese (CN) datasets show that the small version of FCL-taco2 achieves comparable performance with Tacotron2 in terms of speech quality, while it has a 4.8x smaller footprint with 17.7x and 18.5x faster inference speeds on average for EN and CN experiments respectively. Besides, execution on mobile devices shows that the proposed model can achieve faster than real-time speech synthesis. Our code and audio samples are released1. |