Paper ID | HLT-10.4 |
Paper Title |
ALIGN OR ATTEND? TOWARD MORE EFFICIENT AND ACCURATE SPOKEN WORD DISCOVERY USING SPEECH-TO-IMAGE RETRIEVAL |
Authors |
Liming Wang, University of Illinois, Urbana-Champaign, United States; Xinsheng Wang, Delft University of Technology, Netherlands; Mark Hasegawa-Johnson, University of Illinois, Urbana-Champaign, United States; Odette Scharenborg, Delft University of Technology, Netherlands; Najim Dehak, Johns Hopkins University, Netherlands |
Session | HLT-10: Multi-modality in Language |
Location | Gather.Town |
Session Time: | Wednesday, 09 June, 16:30 - 17:15 |
Presentation Time: | Wednesday, 09 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-GASR] General Topics in Speech Recognition |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively. |