Plenary Speakers

Kristen Grauman

Kristen Grauman

The University of Texas at Austin, Austin, TX, USA

Look and Listen: Audio-Visual Learning in Video

Abstract

Perception systems that can both see and hear have great potential to unlock real-world video understanding. When the two modalities work together, they can improve data efficiency for machine learning algorithms by connecting the dots between the interacting signals. I will present our recent work exploring audio-visual video analysis in terms of both semantic and spatial perception. First, we consider visually-guided audio source separation: given video with multiple sounding objects, which sounds come from which visual objects? The proposed methods can focus on a human speaker’s voice amidst busy ambient sounds, split the sounds of multiple instruments playing simultaneously, or simply provide a semantic prior for the category of a visible object. Then, turning to activity recognition, we leverage audio as a fast “preview” for an entire video clip in order to concentrate expensive visual feature computation where it is most needed. Finally, moving from those semantic tasks to spatial audio understanding, we introduce approaches for self-supervised feature learning that leverage sounds heard during training to embed geometric cues into visual encoders. The resulting representations benefit spatially grounded tasks like depth estimation, immersive 3D sound generation for video, and even audio source separation.

Biography

Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Scientist in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on visual recognition, video, and embodied perception. Before joining UT Austin in 2007, she received her Ph.D. at MIT. She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, and recipient of the 2013 Computers and Thought Award. Kristen and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She currently serves as an Associate Editor-in-Chief for PAMI and previously served as a Program Chair of CVPR 2015 and NeurIPS 2018.

David Minnen

David Minnen

Google

Current frontiers in neural image compression: the rate-distortion-computation trade-off and optimizing for subjective visual quality

Abstract

Learning-based methods for image and video compression hold the promise of significant advances in rate-distortion performance. To date, however, neural image compression has largely underdelivered relative to traditional measures of image quality despite using considerably more computation for decoding. Optimizing explicitly for alternative quality metrics that better correlate with human preferences leads to significant rate savings over standard codecs, but the resulting reconstructions often see little benefit on subjective evaluation tests. Similarly, smaller models can improve decode speed, but they typically lead to a commensurate drop-in rate-distortion performance. This talk will cover state of the art neural compression models along with two crucial research directions for learning-based compression. First, we’ll explore the rate-distortion-computation trade-off across different architectures to better understand how neural methods might be able to achieve the decode speed required by typical applications. And, second, we’ll discuss recent research that combines perceptual metrics and adversarial methods to boost subjective quality and provide a nearly 50% rate savings over standard codecs.

Biography

David Minnen is a Staff Research Scientist at Google where he works on deep learning for image and video compression. His research explores learning-based methods for nonlinear transform coding, which has primarily looked at ways to improve neural architectures for entropy modeling and spatial adaptivity. His current work focuses on understanding the rate-distortion-computation trade-off for learning-based codecs and on optimizing compression models for rate vs. subjective quality using adversarial techniques and learned perceptual metrics.

Previously, Dr. Minnen developed user preference models for real-time frame selection for smart camera applications on Android. This research supported automatic detection of action sequences, best shot selection, and generating video summaries and collages. Earlier research includes a vision-based hand tracking and pose classification system for interactive gestural interfaces at Oblong Industries and an energy disaggregation algorithm to support whole-home energy monitoring, anomaly detection, and conservation at Belkin.

Dr. Minnen received his Ph.D. in 2008 from the Georgia Institute of Technology where his research was funded by an NSF Graduate Research Fellowship. His thesis explored the unsupervised analysis of time series data to identify short, recurring temporal patterns. These patterns likely correspond to primitive components in the underlying process, such as phonemes in speech data or basic movements captured by on-body inertial sensors, and can be used to infer higher-level temporal structures with minimal supervision.

Mihaela van der Schaar

Mihaela van der Schaar

University of Cambridge, London, England

From image processing to machine learning in healthcare

Abstract

Medicine stands apart from other areas where machine learning can be applied. While we have seen advances in other fields with lots of data, it is not the volume of data that makes medicine so hard, it is the challenges arising from extracting actionable information from the complexity of the data. It is these challenges that make medicine the most exciting area for anyone who is really interested in the frontiers of machine learning – giving us real-world problems where the solutions are ones that are societally important and which potentially impact on us all. Think Covid 19!

In this talk I will show how machine learning is transforming medicine and how medicine is driving new advances in machine learning as well as image processing, including new methodologies in automated machine learning, interpretable and explainable machine learning, dynamic forecasting, and causal inference.

Biography

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge, a Fellow at the Alan Turing Institute in London, and a Chancellor’s Professor at UCLA.

Mihaela was elected IEEE Fellow in 2009. She has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

Mihaela’s work has also led to 35 USA patents (many widely cited and adopted in standards) and 45+ contributions to international standards for which she received 3 International ISO (International Organization for Standardization) Awards.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise spans signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.

In addition to leading the van der Schaar Lab, Mihaela is founder and director of the Cambridge Centre for AI in Medicine (CCAIM).

John Tsotsos

John Tsotsos

York University, Toronto, Canada

Back to the Future: The Science of Computational Vision

Abstract

Sight and how we perceive our world have inspired philosophers, scientists and engineers from ancient times. Since the early 1960’s, researchers have approached the problem of how to best process and interpret images influenced, often quite strongly, by our knowledge of biological vision. One can see in many early papers the belief that as a scientific goal, we might be able to computationally capture important aspects of human vision sufficiently well so as to make testable predictions. In other words, that we could construct a falsifiable theory of human vision using the language of computation. This was an exciting challenge and pointed the community towards the future; but there were many skeptics of this possibility too. On the practical side, successes initially were few and far between but this has rapidly improved to the point where computer vision is productively applied to all sorts of tasks.

What is not commonly observed is that computer vision methods were informed by knowledge that was current at the time of their inception. Our understanding of both visuospatial behavior and of cognitive neurophysiology has changed, quite dramatically, in the intervening 60 years. What was once inspiring may no longer reflect reality; yet it surprisingly persists. From feature detection to the overall architecture of a vision system to learning strategies, I will traverse time from the mid-1900’s to the present, connecting computational thinking to biological inspiration and provide updates to that inspiration that point to new directions for computational vision science. We will see the original challenge that we could construct a falsifiable theory of human vision using the language of computation still grounds the science of computational vision.

Biography

John K. Tsotsos received his Hons. BASc in Engineering Science, MSc in Computer Science and PhD in Computer Science all from the University of Toronto in 1974, 1976 and 1980 respectively. He then joined the University of Toronto on faculty in both Departments of Computer Science and of Medicine. He founded in 1980, and led, the internationally respected computer vision group at the University of Toronto. He moved to York University in 2000 where he is Distinguished Research Professor of Vision Science, while maintaining Adjunct Professorships at the University of Toronto in the Departments of Computer Science and of Ophthalmology and Vision Sciences. He directed York’s renowned Centre for Vision Research from 2000-2006 and is the founding Director of York’s Centre for Innovation in Computing at Lassonde.

Tsotsos’ research has always focused on how images are processed, understood and used. He and his lab have examined many aspects of attentional processing in machines and in humans and he is best known internationally for this body of work. His lab is also well-recognized as a pioneer in active object recognition and visual search. Both research threads led to embodiments on practical robots, notably an early children’s autonomous wheelchair project named PLAYBOT. Overall, his seminal contributions to computational vision span early and intermediate visual representations, computational complexity of perception, visual attention, robotic active perception, and medical image analysis, particularly in cardiology.

Honours include: Fellow, Artificial Intelligence and Robotics Program of the Canadian Institute for Advanced Research 1985-1995; Tier I Canada Research Chair of Computational Vision 2003-2024; Canadian Image Processing and Pattern Recognition Society Award for Research Excellence and Service, 2006; 1st President’s Research Excellence Award from York University, 2009; Fellow of the Royal Society of Canada, 2010; the Royal Society of Canada’s 2015 Sir John William Dawson Medal for sustained excellence in multidisciplinary research, the first computer scientist so honored; Fellow IEEE, 2018. He was a recipient of the 2020 Lifetime Achievement Award from Canada's national computer science society CS-Can | Info-Can.