2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SS-13.3
Paper Title	CONVOLUTIVE TRANSFER FUNCTION INVARIANT SDR TRAINING CRITERIA FOR MULTI-CHANNEL REVERBERANT SPEECH SEPARATION
Authors	Christoph Boeddeker, Paderborn University, Germany; Wangyou Zhang, Shanghai Jiao Tong University, China; Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, NTT Corporation, Japan; Yanmin Qian, Shanghai Jiao Tong University, China; Reinhold Haeb-Umbach, Paderborn University, Germany
Session	SS-13: Recent Advances in Multichannel and Multimodal Machine Learning for Speech Applications
Location	Gather.Town
Session Time:	Thursday, 10 June, 16:30 - 17:15
Presentation Time:	Thursday, 10 June, 16:30 - 17:15
Presentation	Poster
Topic	Special Sessions: Recent Advances in Multichannel and Multimodal Machine Learning for Speech Applications
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective function. For the objective we propose to use a convolutive transfer function invariant Signal-to-Distortion Ratio (CI-SDR) based loss. While this is a well-known evaluation metric (BSS Eval), it has not been used as a training objective before. To show the effectiveness, we demonstrate the performance on LibriSpeech based reverberant mixtures. On this task, the proposed system approaches the error rate obtained on single-source non-reverberant input, i.e., LibriSpeech test_clean, with a difference of only 1.2 percentage points, thus outperforming a conventional permutation invariant training based system and alternative objectives like Scale Invariant Signal-to-Distortion Ratio by a large margin.