Paper ID | AUD-21.6 |
Paper Title |
MULTIMODAL METRIC LEARNING FOR TAG-BASED MUSIC RETRIEVAL |
Authors |
Minz Won, Universitat Pompeu Fabra, Spain; Sergio Oramas, Oriol Nieto, Fabien Gouyon, Pandora, United States; Xavier Serra, Universitat Pompeu Fabra, Spain |
Session | AUD-21: Music Information Retrieval and Music Language Processing 4: Structure and Alignment |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 14:00 - 14:45 |
Presentation Time: | Thursday, 10 June, 14:00 - 14:45 |
Presentation |
Poster
|
Topic |
Audio and Acoustic Signal Processing: [AUD-MIR] Music Information Retrieval and Music Language Processing |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Tag-based music retrieval is crucial to browse large-scale music libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, metric learning has proven its suitability for cross-modal retrieval tasks in other domains (e.g., text-to-image) by jointly learning a multimodal embedding space. In this paper, we investigate three ideas to successfully introduce multimodal metric learning for tag-based music retrieval: elaborate triplet sampling, acoustic and cultural music information, and domain-specific word embeddings. Our experimental results show that the proposed ideas enhance the retrieval system quantitatively and qualitatively. Furthermore, we release the MSD500: a subset of the Million Song Dataset (MSD) containing 500 cleaned tags, 7 manually annotated tag categories, and user taste profiles. |