2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-32.2
Paper Title	A FURTHER STUDY OF UNSUPERVISED PRETRAINING FOR TRANSFORMER BASED SPEECH RECOGNITION
Authors	Dongwei Jiang, Wubo Li, Ruixiong Zhang, Miao Cao, Ne Luo, Yang Han, Wei Zou, Kun Han, Xiangang Li, Didi Chuxing, China
Session	SPE-32: Speech Recognition 12: Self-supervised, Semi-supervised, Unsupervised Training
Location	Gather.Town
Session Time:	Thursday, 10 June, 13:00 - 13:45
Presentation Time:	Thursday, 10 June, 13:00 - 13:45
Presentation	Poster
Topic	Speech Processing: [SPE-GASR] General Topics in Speech Recognition
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	The construction of an effective good speech recognition system typically requires large amounts of transcribed data, which is expensive to collect. To overcome this problem, many unsupervised pretraining methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and transformer backbone. However, many aspects of MPC have yet to be fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pretraining data speaking style, its extension on streaming model, and strategies for better transferring learned knowledge from pretraining stage to downstream tasks. The experimental results demonstrated that pretraining data with a matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided an 8.46% relative error reduction on the streaming model trained on HKUST. Additionally, the combination of target data adaption and layerwise discriminative training facilitated the knowledge transfer of MPC, which realized 3.99% relative error reduction on AISHELL over a strong baseline.