2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDHLT-10.4
Paper Title ALIGN OR ATTEND? TOWARD MORE EFFICIENT AND ACCURATE SPOKEN WORD DISCOVERY USING SPEECH-TO-IMAGE RETRIEVAL
Authors Liming Wang, University of Illinois, Urbana-Champaign, United States; Xinsheng Wang, Delft University of Technology, Netherlands; Mark Hasegawa-Johnson, University of Illinois, Urbana-Champaign, United States; Odette Scharenborg, Delft University of Technology, Netherlands; Najim Dehak, Johns Hopkins University, Netherlands
SessionHLT-10: Multi-modality in Language
LocationGather.Town
Session Time:Wednesday, 09 June, 16:30 - 17:15
Presentation Time:Wednesday, 09 June, 16:30 - 17:15
Presentation Poster
Topic Speech Processing: [SPE-GASR] General Topics in Speech Recognition
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.