SLT 2018 • Technical Program • 2018 IEEE Workshop on Spoken Language Technology (SLT) | 18-21 December 2018

My SLT 2018 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Presentation #

Session:

ASR I

Session Time:

Wednesday, December 19, 10:00 - 12:00

Presentation Time:

Wednesday, December 19, 10:00 - 12:00

Presentation:

Poster

Topic:

Speech recognition and synthesis:

Paper Title:

EFFICIENT BUILDING STRATEGY WITH KNOWLEDGE DISTILLATION FOR SMALL-FOOTPRINT ACOUSTIC MODELS

Authors:

Takafumi Moriya; NTT Corporation

Hiroki Kanagawa; NTT Corporation

Kiyoaki Matsui; NTT Corporation

Takaaki Fukutomi; NTT Corporation

Yusuke Shinohara; NTT Corporation

Yoshikazu Yamaguchi; NTT Corporation

Manabu Okamoto; NTT Corporation

Yushi Aono; NTT Corporation

Abstract:

In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car- and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.