Novel-view Acoustic Synthesis

Abstract

We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.

Two Large-scale Multi-view Audio-visual Datasets

We collect two first-of-their-kind large-scale multi-view audio-visual datasets, one real (Replay-NVAS) and one synthetic (SoundSpaces-NVAS). The former captures 46 scenarios from 8 different viewpoints, amounting to 37 hours of video data. The latter renders 1.3K audio-visual data for a total of 1,000 speakers, 120 3D scenes and 200K viewpoints.

Replay-NVAS
(wear headphone to hear the spatial sound)

SoundSpaces-NVAS

Presentation Video

Visually Guided Acoustic Synthesis (ViGAS)

Given the input audio, we first separate out the ambient sound to focus on the sound of interest. We take the source audio and source visual to localize the active speaker on the 2D image. We also extract the visual acoustic features of the environment by running an encoder on the source visual. We concatenate and fuse the active speaker feature, source visual features, and the target pose. We feed audio and visual features into the acoustic synthesis network, which has several stacked audio-visual fusion blocks. In each block, audio-visual features are processed by dilated conv1d layers. Lastly, the previously separated ambient sound is added back to the waveform.

Qualitative Results

In this SoundSpaces-NVAS example, we show that for the same source viewpoint, ViGAS predicts different target audio depending on the target viewpoint.

In this Replay-NVAS example, we show the source, target audio and our prediction. See more examples in Supp. video.

Supplementary Video

This video includes examples for the Replay-NVAS dataset and the SoundSpaces-NVAS dataset as well our model's prediction on both datasets. Listen with a headphone for the spatial sound.