Sample-efficient Audio-Visual Learning of Scene Acoustics

Arjun Somayazulu*, Sagnik Majumder*, Changan Chen,
Ziad Al-Halah, Kristen Grauman
UT Austin
*Equal contribution

An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Whereas traditional methods for constructing such models assume dense geometry and/or sound measurements throughout the environment, we explore how to infer room impulse responses (RIRs) based on a sparse set of images and echoes observed in the space, as well as how to choose where to collect these audio- visual observations. Towards that goal, we first introduce a transformer-based method that uses self-attention to build a rich acoustic context, then infers the RIRs of arbitrary query source-receiver locations through cross-attention. Then, motivated by real-world physical constraints in collecting these observations, we further introduce active acoustic sampling, a new task in which a mobile agent jointly constructs the environment acoustic model and spatial occupancy map on- the-fly from sparse audio-visual observations. We train a reinforcement learning (RL) policy that guides agent navigation toward optimal acoustic data sampling positions, rewarding information gain for the full environment model. Evaluating on diverse unseen 3D indoor environments, our method outperforms the state- of-the-art and---in a major departure from traditional methods---generalizes to novel environments in a few-shot manner. Furthermore, when augmented with our active sampling policy, it successfully guides an embodied agent to acoustically informative positions given real-world exploration constraints, outperforming both traditional navigation agents and prior acoustic rendering methods.

Paper video

Task description, approach, prediction examples and downstream applications.


Copyright © 2025 University of Texas at Austin