Paper ID | SPE-46.4 |
Paper Title |
AGE-VOX-CELEB: MULTI-MODAL CORPUS FOR FACIAL AND SPEECH ESTIMATION |
Authors |
Naohiro Tawara, Atsunori Ogawa, Yuki Kitagishi, Hosana Kamiyama, NTT Corporation, Japan |
Session | SPE-46: Corpora and Other Resources |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-GASR] General Topics in Speech Recognition |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Estimating a speaker's age from her speech is more challenging than age estimation from her face because of insufficiently available public corpora. To tackle this problem, we construct a new audio-visual age corpus named {\it AgeVoxCeleb} by annotating age labels to VoxCeleb2 videos. AgeVoxCeleb is the first large-scale, balanced, and multi-modal age corpus that contains both video and speech of the same speakers from a wide age range. Using AgeVoxCeleb, our paper makes the following contributions: (i) A facial age estimation model can outperform a speech age estimation model by comparing the state-of-the-art models in each task. (ii) Facial age estimation is more robust against the difference between training and test sets. (iii) We developed cross-modal transfer learning from face to speech age estimation, showing that the estimated age with a facial age estimation model can be used to train a speech age estimation model. Proposed AgeVoxCeleb will be published in https://github.com/nttcslab-sp/agevoxceleb. |