SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

SoundSpaces 2.0: A Simulation Platform for
Visual-Acoustic Learning

Changan Chen^*1,4, Carl Schissler^*2, Sanchit Garg^*2, Philip Kobernik², Alexander Clegg⁴,
Paul Calamia², Dhruv Batra^3,4, Philip W Robinson², Kristen Grauman^1,4

¹UT Austin,²Reality Labs at Meta,³Georgia Tech,⁴FAIR, Meta AI

NeurIPS 2022

[Website]

[Code]

[Paper]

[Bibtex]

We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, mapping, source localization and separation, and acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the advantages of allowing continuous spatial sampling, generalization to novel environments, and configurable microphone and material properties. To our best knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. We showcase the simulator's properties and benchmark its performance against real-world audio measurements. In addition, through two downstream tasks covering embodied navigation and far-field automatic speech recognition, highlighting sim2real performance for the latter. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.

Teaser

5-minute Presentation Video

Supplementary Video

References

(1) Changan Chen*, Unnat Jain*, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman. SoundSpaces: Audio-Visual Navigation in 3D Environments. In ECCV 2020 [Bibtex]

(2) Ruohan Gao, Changan Chen, Carl Schissler, Ziad Al-Halah, Kristen Grauman. VisualEchoes: Spatial Image Representation Learning through Echolocation. In ECCV 2020 [Bibtex]

(3) Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman. Audio-Visual Waypoints for Navigation. ICLR 2021 [Bibtex]

Acknowledgements

UT Austin is supported in part by DARPA Lifelong Learning Machines and UT Austin IFML NSF AI Institute.