Presentation # | 1 |
Session: | ASR II |
Location: | Kallirhoe Hall |
Session Time: | Thursday, December 20, 13:30 - 15:30 |
Presentation Time: | Thursday, December 20, 13:30 - 15:30 |
Presentation: |
Poster
|
Topic: |
Speech recognition and synthesis: |
Paper Title: |
Occam's Adaptation: A Comparison of Interpolation of Bases Adaptation Methods for Multi-Dialect Acoustic Modeling with LSTMs |
Authors: |
Mikaela Grace, Meysam Bastani, Eugene Weinstein, Google, United States |
Abstract: |
Multidialectal languages can pose challenges for acoustic modeling. Past research has shown that with a large training corpus but without explicit modeling of inter-dialect variability, training individual per-dialect models yields superior performance to that of a single model trained on the combined data~\cite{li2018multi, diak}. Our goal was thus to create a single multidialect acoustic model that would rival the performance of the dialect-specific models. Working in the context of deep Long-Short Term Memory (LSTM) acoustic models trained on up to 40K hours of speech, we explored several methods for training and incorporating dialect-specific information into the model, including 12 variants of interpolation-of-bases techniques related to Cluster Adaptive Training (CAT) and Factorized Hidden Layer (FHL) techniques. We found that with our model topology and large training corpus, simply appending the dialect-specific information to the feature vector resulted in a more accurate model than any of the more complex interpolation-of-bases techniques, while requiring less model complexity and fewer parameters. This simple adaptation yielded a single unified model for all dialects that, in most cases, outperformed individual models which had been trained per-dialect. |