2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SS-13.1
Paper Title	AN EMPIRICAL STUDY OF VISUAL FEATURES FOR DNN BASED AUDIO-VISUAL SPEECH ENHANCEMENT IN MULTI-TALKER ENVIRONMENTS
Authors	Shrishti Saha Shetu, Soumitro Chakrabarty, Emanuël Habets, Fraunhofer IIS, Germany
Session	SS-13: Recent Advances in Multichannel and Multimodal Machine Learning for Speech Applications
Location	Gather.Town
Session Time:	Thursday, 10 June, 16:30 - 17:15
Presentation Time:	Thursday, 10 June, 16:30 - 17:15
Presentation	Poster
Topic	Special Sessions: Recent Advances in Multichannel and Multimodal Machine Learning for Speech Applications
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned features are fused to utilize the information from both modalities. There have been various studies on suitable audio input features and network architectures, however, to the best of our knowledge, there is no study in the literature that has looked into the visual features that are most well-suited for this specific task. In this work, we perform an empirical study of the most commonly used visual features for DNN based AVSE, the pre-processing requirements for each of these features, and investigate their influence on the performance. Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing makes their use difficult in low resource systems. For such systems, optical flow or raw pixels-based features are better suited.