2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDMLSP-16.3
Paper Title CHANNEL-WISE MIX-FUSION DEEP NEURAL NETWORKS FOR ZERO-SHOT LEARNING
Authors Guowei Wang, Tianjin University, China; Naiyang Guan, National Innovation Institute of Defense Technology, China; Hanjia Ye, Nanjing University, China; Xiaodong Yi, Hang Cheng, Junjie Zhu, National Innovation Institute of Defense Technology, China
SessionMLSP-16: ML and Graphs
LocationGather.Town
Session Time:Wednesday, 09 June, 14:00 - 14:45
Presentation Time:Wednesday, 09 June, 14:00 - 14:45
Presentation Poster
Topic Machine Learning for Signal Processing: [MLR-TRL] Transfer learning
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract Zero-shot learning (ZSL), with the assistance of the seen class image and additional semantic knowledge, generalizes its classification ability to the unseen class by aligning the visual-semantic space embeddings. Few previous methods have researched whether discriminative visual features are helpful to recognize different classes while neglecting the rich semantic information from the surrounding background. This paper proposes a channel-wise mix-fusion ZSL model (CMFZ) to contextualize the ZSL classifier’s discriminative information by incorporating much richer visual semantic information from both objects and their semantic surrounding environments. In particular, the channel-wise connection module (CCM) learns to construct the relationship between the object and its surroundings. A collaborative channel-wise activation module (CAM) is adopted to learn from a more delicate scale image attained from the cropping module. It highlights the most distinct channels representing the object’s discriminative regions to eliminate inadvertently introduced background noise. Furthermore, the representation ability of the learned mapping is enhanced by integrating the visual semantic features processed by CCM and CAM. Experimental results show that CMFZ outperforms the state-of-the-art ZSL methods and verifies the effectiveness of incorporating visual semantic information.