Paper ID | SMR-4.11 | ||
Paper Title | DEEP AUDIO-VISUAL FUSION NEURAL NETWORK FOR SALIENCY ESTIMATION | ||
Authors | Shunyu Yao, Xiongkuo Min, Guangtao Zhai, Shanghai Jiao Tong University, China | ||
Session | SMR-4: Image and Video Sensing, Modeling, and Representation | ||
Location | Area F | ||
Session Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation | Poster | ||
Topic | Image and Video Sensing, Modeling, and Representation: Image & video representation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | In this work, we propose a deep audio-visual fusion model to estimate the saliency of videos. The model extracts visual and audio features with two separate branches and fuses them to generate the saliency map. We design a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. Then a multi-scale audio-visual fusion method is used to integrate different modalities. Furthermore, we propose a new dataset for audio-visual saliency estimation. The proposed dataset consists of 202 high quality video squences with a large range of motions, scenes and object types. Many of the videos have high audio-visual correspondence. Several experiments are conducted on different datasets. The results demonstrate that our model outperforms the previous state-of-the-art methods by a large margin and the proposed dataset can serve as a new benchmark for the audio-visual saliency estimation task. |