Few-Shot Audio-Visual Learning of Environment Acoustics

Sagnik Majumder1, Changan Chen1,2*, Ziad Al-Halah1*, Kristen Grauman1,2
1UT Austin,2Facebook AI Research
* Equal contribution
Accepted to NeurIPS 2022

Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and---in a major departure from traditional methods---generalizing to novel environments in a few-shot manner.

Qualitative Results

Task description, prediction examples and downstream applications.


  title={Few-Shot Audio-Visual Learning of Environment Acoustics},
  author={Sagnik Majumder and Changan Chen and Ziad Al-Halah and Kristen Grauman},

Thanks to Tushar Nagarajan and Kumar Ashutosh for feedback on paper drafts.

Copyright © 2022 University of Texas at Austin