| Paper ID | HLT-10.4 | 
    | Paper Title | ALIGN OR ATTEND? TOWARD MORE EFFICIENT AND ACCURATE SPOKEN WORD DISCOVERY USING SPEECH-TO-IMAGE RETRIEVAL | 
	| Authors | Liming Wang, University of Illinois, Urbana-Champaign, United States; Xinsheng Wang, Delft University of Technology, Netherlands; Mark Hasegawa-Johnson, University of Illinois, Urbana-Champaign, United States; Odette Scharenborg, Delft University of Technology, Netherlands; Najim Dehak, Johns Hopkins University, Netherlands | 
  | Session | HLT-10: Multi-modality in Language | 
  | Location | Gather.Town | 
  | Session Time: | Wednesday, 09 June, 16:30 - 17:15 | 
  | Presentation Time: | Wednesday, 09 June, 16:30 - 17:15 | 
  | Presentation | Poster | 
	 | Topic | Speech Processing: [SPE-GASR] General Topics in Speech Recognition | 
  
	
    | IEEE Xplore Open Preview | Click here to view in IEEE Xplore | 
  
	
    | Virtual Presentation | Click here to watch in the Virtual Conference | 
  
  
    | Abstract | Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively. |