Technical Program

Paper Detail

Presentation #	2
Session:	ASR III (End-to-End)
Location:	Kallirhoe Hall
Session Time:	Friday, December 21, 10:00 - 12:00
Presentation Time:	Friday, December 21, 10:00 - 12:00
Presentation:	Poster
Topic:	Speech recognition and synthesis:
Paper Title:	COMBINING END-TO-END AND ADVERSARIAL TRAINING FOR LOW-RESOURCE SPEECH RECOGNITION
Authors:	Jennifer Drexler, James Glass, Massachusetts Institute of Technology, United States
Abstract:	In this paper, we develop an end-to-end automatic speech recognition (ASR) model designed for a common low-resource scenario: no pronunciation dictionary or phonemic transcripts, very limited transcribed speech, and much larger non-parallel text and speech corpora. Our semi-supervised model is built on top of an encoder decoder model with attention and takes advantage of non-parallel speech and text corpora in several ways: a denoising text autoencoder that shares parameters with the ASR decoder, a speech autoencoder that shares parameters with the ASR encoder, and adversarial training that encourages the speech and text encoders to use the same embedding space. We show that a model with this architecture significantly outperforms the baseline in this low-resource condition. We additionally perform an ablation evaluation, demonstrating that all of our added components contribute substantially to the overall performance of our model. We propose several avenues for further work, noting in particular that a model with this architecture could potentially enable fully unsupervised speech recognition.