IEEE ICASSP 2022

2022 IEEE International Conference on Acoustics, Speech and Signal Processing

7-13 May 2022
  • Virtual (all paper presentations)
22-27 May 2022
  • Main Venue: Marina Bay Sands Expo & Convention Center, Singapore
27-28 October 2022
  • Satellite Venue: Crowne Plaza Shenzhen Longgang City Centre, Shenzhen, China

ICASSP 2022

Tutorials

Sun, 22 May, 14:00 - 17:30 China Time (UTC +8)
Sun, 22 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Yingbin Liang, Shaofeng Zou, and Yi Zhou

Reinforcement learning (RL) has driven machine learning from basic data-fitting to the new era of learning and planning through interacting with complex environments. Equipped with deep learning, RL has achieved tremendous successes in many commercialized applications, including autonomous driving, recommendation systems, wireless communications, robotics, and gaming. The success of RL is largely based on the foundational developments of scalable RL algorithms, which are inspired by optimization principles and have not been thoroughly understood until recently. This tutorial provides a comprehensive overview of fundamental and advanced RL formulations and algorithms, which leverage stochastic approximation/optimization techniques to learn an optimal policy for the underlying dynamic Markov decision process. This tutorial is anticipated to meet the high and timely demand of researchers, students, as well as practitioners to understand the state-of-the-art of this topic, to apply RL to various practical applications especially signal processing and communication problems, and to make further contributions to the field of RL as well as machine learning, deep learning and optimization in general.

The tutorial will include six sections: (1) Introduction to reinforcement learning; (2) Value function evaluation and stochastic approximation; (3) Value-based control algorithms and performance guarantee; (4) Policy gradient algorithms and nonconvex optimization; (5) Advanced RL and associated optimization; and (6) Conclusions and open problems.

Sun, 22 May, 14:00 - 17:30 China Time (UTC +8)
Sun, 22 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Christos Thrampoulidis, Samet Oymak, and Yue M. Lu

As a “blessing of dimensionality” in the age of big data, the very high-dimensional settings of many modern datasets allow one to develop and use powerful asymptotic methods to obtain precise characterizations that would otherwise be too complicated in moderate dimensions. Such asymptotic analysis has led to breakthrough results in many estimation and, more recently, many learning problems, and the underlying technical tools have generated significant interest within the signal processing community.

Motivated by the most recent successful application of such methods in exploring and exploiting new high-dimensional phenomena in estimation and learning, this tutorial has the two-fold goal of:

A. providing a friendly and guided tour of the underlying powerful technical tools (including the replica method, the leave-one-out analysis, approximate message passing, and Gaussian comparison inequalities), and,

B. demonstrating how these tools, combined with modern ideas from optimization, establish algorithmic tradeoffs and statistical foundations in modern estimation (eg. LASSO, massive MIMO) and learning (eg. double descent, model pruning) problems.

The tutorial welcomes graduate students of all levels and other signal-processing researchers that wish to make this set of methods, tools and ideas that we discuss part of their statistical signal-processing toolbox. Particular effort will be made to raise the “language barrier” caused by the unique terminologies used in the various disciplines—-statistical physics, random matrix theory, high-dimensional probability, statistical learning theory, information theory, and signal processing—-where these methods originated.

Sun, 22 May, 14:00 - 17:30 China Time (UTC +8)
Sun, 22 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Zhijian Ou

Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are radically different from other popular probabilistic models, which is self-normalized (i.e., sum to one), such as hidden Markov models (HMMs), auto-regressive models, Generative Adversarial Nets (GANs) and Variational Auto-encoders (VAEs). During these years, EBMs have attracted increasing interests not only from core machine learning but also from application domains such as speech, vision, natural language processing (NLP) and so on, with significant theoretical and algorithmic progress. To the best of our knowledge, there are no tutorials about EBMs with applications to speech and language processing. The sequential nature of speech and language also presents special challenges and needs treatment different from processing fix-dimensional data (e.g., images).

The purpose of this tutorial is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing, which is organized into four parts. First, we will introduce basics for EBMs, including classic models, recent models parameterized by neural networks, and various learning algorithms from the classic methods to the most advanced ones. The next three parts will present how to apply EBMs in three different scenarios respectively: 1) EBMs for language modeling, 2) EBMs for natural language labeling and speech recognition, and 3) EBMs for semi-supervised natural language labeling. In addition, we will introduce open-source toolkits to help the audience to get familiar with the techniques for developing and applying energy-based models.

Sun, 22 May, 14:00 - 17:30 China Time (UTC +8)
Sun, 22 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Shoichi Koyama and Natsuki Ueno

Sound field estimation is a fundamental problem in acoustic signal processing, which is aimed at reconstructing a spatial acoustic field from a discrete set of microphone measurements. This essential technology has a wide variety of applications, such as visualization/auralization of an acoustic field, spatial audio reproduction using a loudspeaker array or headphones, and active noise cancellation in a spatial region. In particular, VR/AR audio is one of the most attractive applications of this technology because it is necessary to capture a sound field in a large region with multiple microphones in 6DoF VR systems.

The sound field estimation has been studied for a number of years. Techniques based on the wave domain processing, i.e., basis expansion of a sound field into planewave or spherical wavefunctions, have been established in the last decade. In particular, spherical harmonic domain processing using a spherical microphone array has been intensively investigated. In recent years, the wave domain processing techniques have been incorporated into advanced signal processing and machine learning techniques. For example, a method based on infinite-dimensional expansion for sound field estimation, which corresponds to the kernel methods as a particular case, was recently proposed, and its advantages over the conventional methods have been validated in various applications

This tutorial offers participants an introduction to recent sound field estimation techniques. We also introduce several application examples of sound field estimation, such as spatial audio reproduction and spatial active noise control.

Mon, 23 May, 10:00 - 13:30 China Time (UTC +8)
Mon, 23 May, 02:00 - 05:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Yonina C. Eldar, Deniz Gündüz, Nir Shlezinger, and Kobi Cohen

The dramatic success of deep learning is largely due to the availability of data. Data samples are often acquired on edge devices, such as smart phones, vehicles and sensors, and in some cases cannot be shared due to privacy considerations. Federated learning is an emerging machine learning paradigm for training models across multiple edge devices holding local datasets, without explicitly exchanging the data. Learning in a federated manner differs from conventional centralized machine learning, and poses several core unique challenges and requirements, which are closely related to classical problems studied in signal processing. Consequently, signal processing are expected to play an important role in the success of federated learning and the transition of deep learning from the domain of centralized servers to mobile edge devices.

In this tutorial we present the leading approaches for facilitating the implementation of federated learning at large scale using signal processing tools. We shall discuss how the federated learning paradigm can be viewed from a signal processing perspective, dividing its flow into three main steps: model distributing, local training, and global aggregation. We will first focus on the global aggregation step, which involves conveying the local model updates from the users to the central server. We divide this step into three main phases which are carried out in a sequential fashion: (a) encoding of the local model updates at the edge users into messages conveyed to the server; (b) the transmission of the model updates and the allocation of the channel resources among the users; and (c) combining (post-processing) at the server. Then, we elaborate on signal processing aspects relevant to the distribution of the global model among the participating users. For each stage, we elaborate on the specific aspects of federated learning which can benefit from tools derived in the signal processing and communication literature with proper adaptation.

Mon, 23 May, 10:00 - 13:30 China Time (UTC +8)
Mon, 23 May, 02:00 - 05:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Osvaldo Simeone and Tianyi Chen

Deep learning has achieved remarkable success in many machine learning tasks such as image classification, speech recognition, and game playing. However, these breakthroughs are often difficult to translate into real-world applications such as communication systems, because deep learning models require a massive number of training samples, which are costly to obtain in practice. As a popular approach for addressing labeled data scarcity, few-shot meta-learning aims to optimize efficient learning algorithms that can adapt to new tasks quickly. While meta-learning is gaining significant interests in the ML community, its theoretical understanding is still lacking, and applications to signal processing and communication systems are currently in their nascent stage. In this tutorial, we will first provide a gentle introduction to meta-learning and to popular meta-learning algorithms. Then, we will highlight some applications to communication systems. We will also introduce a unified bi-level optimization framework to solve meta learning problems, and provide statistical learning theoretical tools to analyze the solution of meta learning. Finally, we will discuss open questions in terms of both theory and applications of meta learning.

Part I (60 mins)

  • Section 1: Introduction and background (20 mins)
    • Meta learning and its comparison to conventional and joint learning
  • Section 2: Popular algorithms for meta learning (40 mins)
    • MAML, iMAML, Bayesian MAML, FOMAML, Reptile, Prox-MAML

Part II (70 mins)

  • Section 3: Meta learning applications to communications (40 mins)
    • Applications to encoding/decoding
    • Applications to power allocation
    • Applications to precoding
  • Section 4: Optimization methods for meta learning: (30 mins)
    • History of bilevel optimization and its recent surge in meta learning
    • Alternating stochastic gradient methods for bilevel optimization
    • Mathematical tools for analyzing finite-sample convergence

Part III (50 mins)

  • Section 5: Statistical learning theory for meta learning (40 mins)
    • Information-theoretic bounds on meta-learning
    • PAC Bayes bounds on meta-learning
    • Comparison between meta learning and conventional learning in PAC bound
  • Section 6: Challenging yet promising open research directions. (10 mins)
    • Applications
    • Theory and algorithms
Mon, 23 May, 10:00 - 13:30 China Time (UTC +8)
Mon, 23 May, 02:00 - 05:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Yen-Chi Chen, Jun Qi, and Huck Yang

The state-of-the-art machine learning (ML), particularly based on deep neural networks (DNN), has enabled a wide spectrum of successful applications ranging from the everyday deployment of speech recognition and computer vision to the frontier of scientific research in synthetic biology. Despite rapid theoretical and empirical progress in DNN based regression and classification, DNN training algorithms are computationally expensive even beyond the physical limits of classical hardware. The imminent advent of quantum computing devices opens up new possibilities of exploiting quantum machine learning (QML) to improve the computational efficiency of ML algorithms in new domains. In particular, the advance of quantum hardware enables the QML algorithms to run in noisy intermediate-scale quantum (NISQ) devices. Furthermore, we could employ hybrid quantum-classical models that rely on optimizing parametric quantum circuits, which are resilient to quantum noise errors and admit many practical QML implementations on NISQ devices. In this tutorial, we discuss how to set up quantum neural networks and put forth the related applications in speech and language processing. The tutorial includes the sections of an introduction to quantum machine learning, optimizing quantum neural networks, and the use of variational quantum circuits for speech and language processing.

Mon, 23 May, 10:00 - 13:30 China Time (UTC +8)
Mon, 23 May, 02:00 - 05:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Feng Ji and Wee Peng Tay

There will be 3 parts separated by 2 breaks: Part 1: Dealing with high dimensional structures, Part 2: Generalizing the graph signal domain, Part 3: Working with uncertainty.

Graphs and networks are ubiquitous in science and technology, including imagine processing, research in Internet of Things and analysis of social network data. Since its emergence, graph signal processing (GSP) has become an important tool in such areas. The basic tools of GSP include graph Fourier transform, shift invariant filter bank, as well as graph neural network. Though a successful theory in many respects, GSP faces challenges in many circumstances and requires regular updates with new ideas. In the tutorial, we discuss a few recent developments in advanced graph signal processing, focusing on high dimensional signals and structures.

Mon, 23 May, 10:00 - 13:30 China Time (UTC +8)
Mon, 23 May, 02:00 - 05:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Xu Tan and Tao Qin

Text to speech (TTS), which aims to synthesize natural and intelligible speech from a given text, has been a hot research topic in the speech community and has become an important product service in the industry. As the development of deep learning, neural network-based TTS (referred to as neural TTS) has significantly improved the quality of synthesized speech in recent years. In this tutorial, we will introduce the recent advances in neural TTS, including four parts: 1) we first briefly overview the background and taxonomy of TTS; 2) we then introduce the research advances in the key components of neural TTS, including text analysis, acoustic model, and vocoder; 3) next, we review the research progress on some advanced topics in neural TTS, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS; 4) at last, we describe several challenges of TTS and discuss future research directions. This tutorial can serve both the academic researchers and industry practitioners working on TTS.

Mon, 23 May, 14:00 - 17:30 China Time (UTC +8)
Mon, 23 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Stefan Vlaski, Roula Nassif, and Ali H. Sayed

The abundance of data and proliferation of computational resources are leading to a significant shift in paradigm towards data-driven engineering design. While past practice has often relied on physical models or approximations thereof, it is common nowadays to rely on datasets that represent the behavior of a complex system, rather than employ an explicit mathematical model for it.

These datasets can arise from a multitude of sources. For example, data can be generated from mobile devices, or from sensors scattered throughout “smart cities” and “smart grids”, or even from vehicles on the road. A common feature of such datasets is their heterogeneity due to variations in data distributions on a local level. For example, variations in regional dialects within a country lead to datasets that can impact the training and performance of speech recognition models differently. Likewise, preferences by users vary across different regions in the world and these affect the training and performance of recommender systems. In a similar vein, regional differences in weather patterns, power usage, and traffic patterns affect the behavior and performance of many other monitoring systems.

Training a single model on heterogeneous data generally leads to poor sample efficiency and performance, resulting in models that perform “optimally” on average, but can yield poor performance on any given local dataset. This fact has sparked significant research activity over recent years on the topics of multitask and meta-learning, where the purpose is to extract translational information from heterogeneous data sources, while allowing for local variability. In this tutorial, we will present a unifying and up-to-date overview of multitask and meta-learning with a focus on learning from streaming data in federated and decentralized architectures.

Mon, 23 May, 14:00 - 17:30 China Time (UTC +8)
Mon, 23 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Yuejie Chi, Yuting Wei, and Yuxin Chen

As a paradigm for sequential decision making in unknown environments, reinforcement learning (RL) has received a flurry of attention in recent years. However, the explosion of model complexity in emerging applications and the presence of nonconvexity exacerbate the challenge of achieving efficient RL in sample-starved situations, where data collection is expensive, time-consuming, or even high-stakes (e.g., in clinical trials, autonomous systems, and online advertising). How to understand and enhance the sample and computational efficiencies of RL algorithms is thus of great interest and in imminent need. In this tutorial, we aim to present a coherent framework that covers important statistical and algorithmic developments in RL, highlighting the connections between new ideas and classical topics. Employing Markov Decision Processes (MDPs) as the central mathematical model, we start by introducing classical dynamic programming algorithms when precise descriptions of the environments are available. Equipped with this background, we present three distinctive approaches --- model-based algorithms, model-free value-based algorithms, and policy optimization. Our discussions gravitate around their sample complexity, computational efficiency, function approximation, as well as information-theoretic and algorithm-dependent lower bounds. We will systematically introduce the optimism principle for online RL, the pessimism principle for offline RL, and variance reduction strategies, which play a crucial role in achieving efficient RL in different scenarios.

Mon, 23 May, 14:00 - 17:30 China Time (UTC +8)
Mon, 23 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Pin-Yu Chen and Chao-Han Huck Yang

Despite the fact of achieving high standard accuracy in a variety of machine learning tasks, neural networks based prediction models have recently been identified as having the issue of lacking adversarial robustness. In particular, with the recent advance of the speech and language deep learning models, new challenges (e.g., end-to-end evaluation, data privacy) and opportunities (e.g., noise-aware adaptation) will be introduced.

This tutorial will provide an overview of recent advances in the research of adversarial robustness, featuring both comprehensive research topics and technical depth. We will cover three fundamental pillars in (1) adversarial robustness: attack, defense, verification, and recent advances in (2) adversarial reprogramming. Attack refers to the efficient generation of adversarial examples for robustness assessment under different attack assumptions (e.g., white-box or black-box attacks). Defense refers to adversary detection and robust training algorithms to enhance model robustness. Verification refers to attack-agnostic metrics and certification algorithms for proper evaluation of adversarial robustness and standardization.

For each pillar, we will emphasize the tight connection between signal processing and the research in adversarial robustness, ranging from fundamental techniques such as first-order and zero-order optimization, minimax optimization, geometric analysis, model compression, data filtering, and quantization, subspace analysis, active sampling, frequency component analysis to specific applications such as computer vision, automatic speech recognition, natural language processing, and data regression.

Finally, motivated by studies in adversarial robustness, model reprogramming will be introduced as an emerging and powerful technique for data-efficient transfer learning for large “foundation” pre-trained models with limited target domain data.

Mon, 23 May, 14:00 - 17:30 China Time (UTC +8)
Mon, 23 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang, and Katrin Kirchhoff

Although Deep Learning models have revolutionized the speech and audio processing field, they forced building specialist models for individual tasks and application scenarios. Deep neural models also bottlenecked dialects and languages with limited labeled data.

Self-supervised representation learning methods promise a single universal model to benefit a collection of tasks and domains. They recently succeeded in NLP and computer vision domains, reaching new performance levels while reducing required labels for many downstream scenarios. Speech representation learning is experiencing similar progress with three main categories: generative, contrastive, predictive. Other approaches relied on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources. This tutorial session will present self-supervised speech representation learning approaches and their connection to related research areas. Since many of the current methods focused solely on automatic speech recognition as a downstream task, we will review recent efforts on benchmarking learned representations to extend the application of such representations beyond speech recognition. A hands-on component of this tutorial will provide practical guidance on building and evaluating speech representation models.

Mon, 23 May, 14:00 - 17:30 China Time (UTC +8)
Mon, 23 May, 06:00 - 09:30 UTC
In-Person
Live-Stream
Tutorial

Presented by: Cédric Févotte and Vincent Y. F. Tan

More than twenty years has passed since nonnegative matrix factorization (NMF) was introduced in the seminal works of Paatero & Tapper (1994) and Lee & Seung (1999). Since then, NMF has had a major impact in various fields such as audio source separation, hyperspectral unmixing, user recommendation, text information retrieval, biometrics, network analysis, etc. Although more complex architectures such as neural networks can outperform factorization models in some supervised settings, NMF is based on a generative and interpretable model that remains a popular choice in many cases, in particular for unmixing tasks with little or no training data (i.e., in unsupervised settings). The tutorial is intended to review some of the most important advances in NMF over the last decade, with a focus on recent advances in optimization for NMF (including state-of-the-art algorithms such as majorization-minimization, convergence properties, large-scale implementations, sparse and temporal regularization), model selection for NMF (including the choice of a proper measure of fit, rank estimation) and recent extensions of NMF (including robust NMF in the presence of outliers, separable NMF, positive semi-definite matrix factorization, NMF-based ranking models). The tutorial will be targeted at both the beginner with no prior experience in NMF and the more knowledgeable practitioner in its more advanced material.

Wed, 26 Oct, 08:00 - 11:30 China Time (UTC +8)
Wed, 26 Oct, 00:00 - 03:30 UTC
In-Person
Tutorial

Presented by: Zhijian Ou

Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are radically different from other popular probabilistic models, which is self-normalized (i.e., sum to one), such as hidden Markov models (HMMs), auto-regressive models, Generative Adversarial Nets (GANs) and Variational Auto-encoders (VAEs). During these years, EBMs have attracted increasing interests not only from core machine learning but also from application domains such as speech, vision, natural language processing (NLP) and so on, with significant theoretical and algorithmic progress. To the best of our knowledge, there are no tutorials about EBMs with applications to speech and language processing. The sequential nature of speech and language also presents special challenges and needs treatment different from processing fix-dimensional data (e.g., images).

The purpose of this tutorial is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing, which is organized into four parts. First, we will introduce basics for EBMs, including classic models, recent models parameterized by neural networks, and various learning algorithms from the classic methods to the most advanced ones. The next three parts will present how to apply EBMs in three different scenarios respectively: 1) EBMs for language modeling, 2) EBMs for natural language labeling and speech recognition, and 3) EBMs for semi-supervised natural language labeling. In addition, we will introduce open-source toolkits to help the audience to get familiar with the techniques for developing and applying energy-based models.

Wed, 26 Oct, 13:00 - 16:30 China Time (UTC +8)
Wed, 26 Oct, 05:00 - 08:30 UTC
In-Person
Tutorial

Presented by: Xu Tan and Tao Qin

Text to speech (TTS), which aims to synthesize natural and intelligible speech from a given text, has been a hot research topic in the speech community and has become an important product service in the industry. As the development of deep learning, neural network-based TTS (referred to as neural TTS) has significantly improved the quality of synthesized speech in recent years. In this tutorial, we will introduce the recent advances in neural TTS, including four parts: 1) we first briefly overview the background and taxonomy of TTS; 2) we then introduce the research advances in the key components of neural TTS, including text analysis, acoustic model, and vocoder; 3) next, we review the research progress on some advanced topics in neural TTS, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS; 4) at last, we describe several challenges of TTS and discuss future research directions. This tutorial can serve both the academic researchers and industry practitioners working on TTS.