2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	IVMSP-29.4
Paper Title	AGGREGATION ARCHITECTURE AND ALL-TO-ONE NETWORK FOR REAL-TIME SEMANTIC SEGMENTATION
Authors	Kuntao Cao, Xi Huang, Jie Shao, University of Electronic Science and Technology of China, China
Session	IVMSP-29: Semantic Segmentation
Location	Gather.Town
Session Time:	Friday, 11 June, 13:00 - 13:45
Presentation Time:	Friday, 11 June, 13:00 - 13:45
Presentation	Poster
Topic	Image, Video, and Multidimensional Signal Processing: [IVARS] Image & Video Analysis, Synthesis, and Retrieval
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Deep convolutional neural network has demonstrated its outstanding performance in the field of image semantic segmentation. However, the enormous computational complexity of existing high-precision networks limits the application of the model in real-time segmentation tasks. How to achieve a good trade-off between accuracy and speed becomes a challenge. Existing solutions can be roughly divided into three categories according to the network architecture: dilation, encoder-decoder, and multi-pathway, each of which has its advantages. In this paper, we make the following contributions: (i) First, unlike the previous three architectures, we propose a new aggregation architecture as the network backbone. (ii) Second, a multi-level auxiliary loss design model is used for the training phase, which can improve the model segmentation effect. (iii) According to this aggregation structure, an all-to-one network (ATONet) for real-time semantic segmentation is proposed, which achieves a good trade-off between speed and accuracy by assembling the features of all blocks. (iv) Finally, the proposed network achieves the accuracy of 74.4% and 70.1% mIoU with the inference speed of 42.7 FPS and 93.5 FPS on the Cityscapes and CamVid datasets.