SLT 2018 • Technical Program • 2018 IEEE Workshop on Spoken Language Technology (SLT) | 18-21 December 2018

My SLT 2018 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Presentation #

Session:

ASR IV

Session Time:

Friday, December 21, 13:30 - 15:30

Presentation Time:

Friday, December 21, 13:30 - 15:30

Presentation:

Poster

Topic:

Speech recognition and synthesis:

Paper Title:

Multi-channel multi-speaker overlapped speech recognition with location guided speech extraction network

Authors:

Zhuo Chen; Microsoft Cloud & AI

Xiong Xiao; Microsoft Cloud & AI

Takuya Yoshioka; Microsoft Cloud & AI

Jinyu Li; Microsoft Cloud & AI

Hakan Erdogan; Microsoft Cloud & AI

Yifan Gong; Microsoft Cloud & AI

Abstract:

Although advances in close-talk speech recognition have resulted in relatively low error rates, the recognition performance in far-field environments is still limited due to low signal-to-noise ratio, reverberation, and overlapped speech from simultaneous speakers which is especially more difficult. To solve these problems, beamforming and speech separation networks were previously proposed. However, they usually suffer from the leaky speech phenomenon or limited performance due to poor generalization. In this work, we propose a simple yet effective method for multi-channel far-field overlapped speech recognition. In the proposed system, three different features are formed for each target speaker, namely, spectral, spatial, and angle features. Then a neural network is trained using all features with a target of the clean speech of the required speaker. An iterative optimization procedure is proposed in which the mask-based beamforming and mask estimation are performed alternatively. The proposed system were evaluated with real recorded meetings with different levels of overlapping ratios. The results show that the proposed system achieves more than 24\% relative word error rate (WER) reduction than fixed beamforming with oracle selection. Moreover, as overlap ratio rises from 20\% to 70+\%, only 3.8\% WER increase is observed for the proposed system.