Tue, 10 May, 14:00 - 14:45 UTC
Novelty
Voice assistants are becoming ubiquitous. In emerging contexts such as healthcare and finance, where the user base is broader than the market for consumer devices, poor performance on diverse populations will deter adoption. Models developed on ‘standard’ demographic datasets are not sufficiently robust to variability in human language production and behaviour. Evaluation of core (and evolving) assistant performance is ad hoc and intermittent, providing insufficient insight to focus and prioritize the case for engineering investment in areas of demographic underperformance.
We propose a multi-dimensional, quarterly benchmark to evaluate the evolution of voice assistant performance in both standard and diverse populations. Population dimensions include:
- age & gender
- regional dialects, ethnolects and regional sublects, e.g., of African American Language
- multilingualism, second-language & foreign-language accents
- intersections of all of the above
- Environmental dimensions include:
- noise levels & background speech
- distance from mic
- indoor/outdoor setting
- device hardware specifications & device type (wearable, in-car, smart speaker, etc)
Relevance & attractiveness to ICASSP
Improving human-centric signal processing requires a re-orientation to the variability of human language and behavioural signals. We can design better models by comprehensively anticipating the variability. Better post-production evaluation methods such as the proposed benchmark can help developers understand how their models will enable or deter standard and diverse users from interacting with their products.
Our benchmark is grounded in extensive experience of factors impacting signal variability in human language technology. We propose to cover core existing and emerging performance dimensions – Accuracy, Agreeableness, Adaptability and Acceleration – and core input dimensions – Skills (System Actions or Tasks), Environments and Demographics. All dimensions will be evaluated at a regular cadence to capture the impact of model upgrades and evolution of user behaviours.
Benchmark reports will combine human ratings, computational linguistic analysis and automated metrics to provide rich, actionable insights into the key factors impacting system performance. The intersection of linguistic variation, skill/task, and environment can impact performance in unforeseen ways. Two specific cases will be discussed: the elderly and African American Language speakers.
Inspirations and motivations
Appen provides a wide range of HCI evaluation and training data services. We have observed that evaluation of voice assistant models rarely reflects the complexity of real-world deployment. Appen’s proposed benchmark will provide insightful feedback to support model tuning for better real-world performance.
Biographies
Ilia Shifrin is Senior Director of AI Specialists at Appen. He oversees a team of 70 distinguished data and language professionals that enable global NLP solutions and provide enterprise-scale multilingual and multimodal data collection, data annotation, and AI evaluation services to the world's largest corporations. Ilia Shifrin is an avid data researcher, localization, and AI personalization perfectionist with over 15 years of leadership and hands-on R&D experience in language and data engineering.
David Brudenell is VP of Solutions & Advanced Research at Appen. As Vice President, David works with many of the most accomplished and deeply knowledgeable solution architects, engineers, project managers, technical and AI specialists in the machine learning and artificial intelligence industries.
Dr MingKuan Liu is Senior Director of Data Science & Machine Learning at Appen. MingKuan has decades of industry R&D expertise in speech recognition, natural language processing, search & recommendation, fraud activity detection, and e-Commerce areas.
Dr. Judith Bishop is Chair of the External Advisory Board for the MARCS Institute for Brain, Behaviour and Development. For over 17 years, Judith has led global teams delivering AI training data and evaluation products to global multinational, government, enterprise, and academic technology developers.