Active Audio-Visual Separation of Dynamic Sound Sources

Sagnik Majumder1, Ziad Al-Halah1, Kristen Grauman1,2
1UT Austin,2Facebook AI Research

In review.

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple time-varying audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, improving its own estimates for past timesteps via self-attention. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.

Qualitative Results

Simulation demos and navigation examples.


	title={Active Audio-Visual Separation of Dynamic Sound Sources},
	author={Majumder, Sagnik and Al-Halah, Ziad and Grauman, Kristen},
	journal={arXiv preprint arXiv:2202.00850},


Thanks to Tushar Nagarajan and Kumar Ashutosh for feedback on paper drafts.

Copyright © 2022 University of Texas at Austin