Paper ID | MLSP-10.6 | ||
Paper Title | EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION | ||
Authors | Zixuan Peng, Yu Lu, Shengfeng Pan, Yunfeng Liu, Zhuiyi Technology, China | ||
Session | MLSP-10: Deep Learning for Speech and Audio | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Machine Learning for Signal Processing: [MLR-LMM] Learning from multimodal data | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Emotion recognition from speech is a challenging task. Recent advances in deep learning have led bi-directional recurrent neural network (Bi-RNN) and attention mechanism as a standard method for speech emotion recognition, extracting and attending multi-modal features - audio and text, and then fused for downstream emotion classification tasks. In this paper, we propose a simple yet efficient neural network architecture to exploit both acoustic and lexical information from speech. The proposed framework using multi-scale convolutional layers (MSCNN) to obtain both audio and text hidden representations. Then, a statistical pooling unit (SPU) is used to further extract the features in each modality. Besides, an attention module can be built on top of the MSCNN-SPU (audio) and MSCNN (text) to further improve the performance. Extensive experiments show that the proposed model outperforms previous state-of-the-art methods on IEMOCAP dataset with four emotion categories (i.e., angry, happy, sad and neutral) in both weighted accuracy (WA) and unweighted accuracy (UA), with an improvement of 5.0% and 5.2% respectively under the ASR setting. |