SLT 2018 • Technical Program • 2018 IEEE Workshop on Spoken Language Technology (SLT) | 18-21 December 2018

My SLT 2018 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Presentation #

Session:

Speaker Recognition/Verification

Session Time:

Thursday, December 20, 10:00 - 12:00

Presentation Time:

Thursday, December 20, 10:00 - 12:00

Presentation:

Poster

Topic:

Speaker/language recognition:

Paper Title:

ATTENTION MECHANISM IN SPEAKER RECOGNITION: WHAT DOES IT LEARN IN DEEP SPEAKER EMBEDDING?

Authors:

Qiongqiong Wang; NEC Corporation

Koji Okabe; NEC Corporation

Kong Aik Lee; NEC Corporation

Hitoshi Yamamoto; NEC Corporation

Takafumi Koshinaka; NEC Corporation

Abstract:

This paper presents an experimental study on deep speaker embedding with an attention mechanism that has been found to be a powerful representation learning technique in speaker recognition. In this framework, an attention model works as a frame selector that computes an attention weight for each frame-level feature vector, in accord with which an utterance-level representation is produced at the pooling layer in a speaker embedding network. In general, an attention model is trained together with the speaker embedding network on a single objective function, and thus those two components are tightly bound to one another. In this paper, we consider the possibility that the attention model might be decoupled from its parent network and assist other speaker embedding networks and even conventional i-vector extractors. This possibility is demonstrated through a series of experiments on a NIST Speaker Recognition Evaluation (SRE) task, with 9.0% EER reduction and 3.8% minC_primary reduction when the attention weights are applied to i-vector extraction. Another experiment shows that DNN-based soft voice activity detection (VAD) can be effectively combined with the attention mechanism to yield further reduction of minC_primary by 6.6% and 1.6% in deep speaker embedding and i-vector systems, respectively.