2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	MLSP-16.3
Paper Title	CHANNEL-WISE MIX-FUSION DEEP NEURAL NETWORKS FOR ZERO-SHOT LEARNING
Authors	Guowei Wang, Tianjin University, China; Naiyang Guan, National Innovation Institute of Defense Technology, China; Hanjia Ye, Nanjing University, China; Xiaodong Yi, Hang Cheng, Junjie Zhu, National Innovation Institute of Defense Technology, China
Session	MLSP-16: ML and Graphs
Location	Gather.Town
Session Time:	Wednesday, 09 June, 14:00 - 14:45
Presentation Time:	Wednesday, 09 June, 14:00 - 14:45
Presentation	Poster
Topic	Machine Learning for Signal Processing: [MLR-TRL] Transfer learning
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Zero-shot learning (ZSL), with the assistance of the seen class image and additional semantic knowledge, generalizes its classification ability to the unseen class by aligning the visual-semantic space embeddings. Few previous methods have researched whether discriminative visual features are helpful to recognize different classes while neglecting the rich semantic information from the surrounding background. This paper proposes a channel-wise mix-fusion ZSL model (CMFZ) to contextualize the ZSL classifier’s discriminative information by incorporating much richer visual semantic information from both objects and their semantic surrounding environments. In particular, the channel-wise connection module (CCM) learns to construct the relationship between the object and its surroundings. A collaborative channel-wise activation module (CAM) is adopted to learn from a more delicate scale image attained from the cropping module. It highlights the most distinct channels representing the object’s discriminative regions to eliminate inadvertently introduced background noise. Furthermore, the representation ability of the learned mapping is enhanced by integrating the visual semantic features processed by CCM and CAM. Experimental results show that CMFZ outperforms the state-of-the-art ZSL methods and verifies the effectiveness of incorporating visual semantic information.