Technical Program

Paper Detail

Presentation #7
Session:ASR II
Location:Kallirhoe Hall
Session Time:Thursday, December 20, 13:30 - 15:30
Presentation Time:Thursday, December 20, 13:30 - 15:30
Presentation: Poster
Topic: Multimodal processing:
Paper Title: LSTM LANGUAGE MODEL ADAPTATION WITH IMAGES AND TITLES FOR MULTIMEDIA AUTOMATIC SPEECH RECOGNITION
Authors: Yasufumi Moriya, Gareth Jones, Dublin City University, Ireland
Abstract: Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. The incorporation of visual features as additional contextual information as a means to improve ASR for this data has recently drawn attention from researchers. Our investigation extends existing work by using images and video titles to adapt a recurrent neural network (RNN) language model with long-short term memory (LSTM). Our language model is tested on an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5-10 was observed on both datasets. When the non-adapted model was combined with the image adaptation and video title adaptation models for n-best hypotheses re-ranking, word error rate (WER) is decreased by around 0.5% on the both datasets. By analysing output word probabilities of the model, it was found that both image adaptation and video title adaptation give the model more confidence in choice of contextually correct words.