SLT 2018 • Technical Program • 2018 IEEE Workshop on Spoken Language Technology (SLT) | 18-21 December 2018

My SLT 2018 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Presentation #

Session:

ASR IV

Session Time:

Friday, December 21, 13:30 - 15:30

Presentation Time:

Friday, December 21, 13:30 - 15:30

Presentation:

Poster

Topic:

Speech recognition and synthesis:

Paper Title:

MULTICHANNEL ASR WITH KNOWLEDGE DISTILLATION AND GENERALIZED CROSS CORRELATION FEATURE

Authors:

Wenjie Li; Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics

Yu Zhang; Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics

Pengyuan Zhang; Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics

Fengpei Ge; Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics

Abstract:

Multi-channel signal processing techniques have played an important role in the far-field automatic speech recognition (ASR) as the separate front-end enhancement part. How- ever, they often meet the mismatch problem. In this paper, we proposed a novel architecture of acoustic model, in which the multi-channel speech without preprocessing was utilized directly. Besides the strategy of knowledge distillation and the generalized cross correlation (GCC) adaptation were em- ployed. We use knowledge distillation to transfer knowledge from a well-trained close-talking model to distant-talking s- cenarios in every frame of the multichannel distant speech. Moreover, the GCC between microphones, which contains the spatial information, is supplied as an auxiliary input to the neural network. We observe good compensation of those two techniques. Evaluated with the AMI and ICSI meeting corpo- ra, the proposed methods achieve relative WER improvement of 7.7% and 7.5% over the model trained directly on the con- catenated multi-channel speech.