2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-30.2
Paper Title	IMPROVING AUDIO ANOMALIES RECOGNITION USING TEMPORAL CONVOLUTIONAL ATTENTION NETWORK
Authors	Qiang Huang, Thomas Hain, University of Sheffield, United Kingdom
Session	SPE-30: Speech Processing 2: General Topics
Location	Gather.Town
Session Time:	Wednesday, 09 June, 16:30 - 17:15
Presentation Time:	Wednesday, 09 June, 16:30 - 17:15
Presentation	Poster
Topic	Speech Processing: [SPE-SPER] Speech Perception and Psychoacoustics
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Anomalous audio in speech recordings is often caused by speaker voice distortion, external noise, or even electric interferences. These obstacles have become a serious problem in some fields, such as high-quality music mixing and speech analysis. In this paper, a novel approach using a temporal convolutional attention network (TCAN) is proposed to tackle this problem. The use of temporal conventional network (TCN) can capture long range patterns using a hierarchy of temporal convolutional filters. To enhance the ability to tackle audio anomalies in different acoustic conditions, an attention mechanism is used in TCN, where a self-attention block is added after each temporal convolutional layer. This aims to highlight the target related features. To evaluate the performance of the proposed model, audio recordings are collected from the TIMIT dataset, and are then changed by adding five different types of audio distortions: gaussian noise, magnitude drift, random dropout, reduction of temporal resolution, and time warping. Distortions are mixed at different signal-to-noise ratios (SNRs) (5dB, 10dB, 15dB, 20dB, 25dB, 30dB). The experimental results show that the use of proposed model can yield better classification performances than some strong baseline methods, by approximate 3$\sim$ 10\% relative improvements.