Paper ID | ARS-6.6 | ||
Paper Title | VIDEO MEMORABILITY PREDICTION VIA LATE FUSION OF DEEP MULTI-MODAL FEATURES | ||
Authors | Roberto Leyva, Victor Sanchez, University of Warwick, United Kingdom | ||
Session | ARS-6: Image and Video Interpretation and Understanding 1 | ||
Location | Area H | ||
Session Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation | Poster | ||
Topic | Image and Video Analysis, Synthesis, and Retrieval: Image & Video Interpretation and Understanding | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Video memorability is a cornerstone in social media platform analysis, as a highly memorable video is more likely to be noticed and shared. This paper proposes a new framework to fuse multi-modal information to predict the likelihood of remembering a video. The proposed framework relies on late fusion of text, visual and motion features. Specifically, two neural networks extract features from the captions describing the video’s content; two ResNet models extract visual features from specific frames, and two 3DResNet models, combined with Fisher Vectors, extract features from the video’s motion information. The extracted features are used to compute several memorability scores via Bayesian Ridge regression, which are then fused based on a greedy search of the optimal fusion parameters. Experiments demonstrate the superiority of the proposed framework on the MediaEval2019 dataset, outperforming the state-of-the-art. |