concept

Abstract

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to bin- aural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry- aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce two new large video datasets: one with realistic binaural audio simulated for real-world scanned environments, and the other with pseudo-binaural audio obtained from ambisonic sounds in YouTube 360° videos. On three datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

Qualitative Results

In the qualitative video, we show (a) examples of our SimBinaural and YouTube-Binaural dataset; (b) example results of the binaural audio prediction task on both SimBinaural, FAIR-Play and YouTube-Binaural datasets; and (c) examples of the interface for the user studies. Please wear a headset or earphones (both left and right) to watch the video.


SimBinaural Dataset

The SimBinaural Dataset is available to download here. If you have any questions, please send an email to Rishabh.


YouTube-Binaural Dataset

The YouTube-Binaural Dataset is available to download here. If you have any questions, please send an email to Rishabh.