2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDIVMSP-29.4
Paper Title AGGREGATION ARCHITECTURE AND ALL-TO-ONE NETWORK FOR REAL-TIME SEMANTIC SEGMENTATION
Authors Kuntao Cao, Xi Huang, Jie Shao, University of Electronic Science and Technology of China, China
SessionIVMSP-29: Semantic Segmentation
LocationGather.Town
Session Time:Friday, 11 June, 13:00 - 13:45
Presentation Time:Friday, 11 June, 13:00 - 13:45
Presentation Poster
Topic Image, Video, and Multidimensional Signal Processing: [IVARS] Image & Video Analysis, Synthesis, and Retrieval
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract Deep convolutional neural network has demonstrated its outstanding performance in the field of image semantic segmentation. However, the enormous computational complexity of existing high-precision networks limits the application of the model in real-time segmentation tasks. How to achieve a good trade-off between accuracy and speed becomes a challenge. Existing solutions can be roughly divided into three categories according to the network architecture: dilation, encoder-decoder, and multi-pathway, each of which has its advantages. In this paper, we make the following contributions: (i) First, unlike the previous three architectures, we propose a new aggregation architecture as the network backbone. (ii) Second, a multi-level auxiliary loss design model is used for the training phase, which can improve the model segmentation effect. (iii) According to this aggregation structure, an all-to-one network (ATONet) for real-time semantic segmentation is proposed, which achieves a good trade-off between speed and accuracy by assembling the features of all blocks. (iv) Finally, the proposed network achieves the accuracy of 74.4% and 70.1% mIoU with the inference speed of 42.7 FPS and 93.5 FPS on the Cityscapes and CamVid datasets.