Learning Audio-Visual Dereverberation

1University of Texas at Austin, 2FAIR, Meta AI

We propose to leverage the acoustic cues in the visual input to help dereverberate the audio.

Abstract

Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene. In support of this new task, we develop a large-scale dataset SoundSpaces-Speech that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over audio-only methods.

Presentation Video

Method

We convert the input speech to a spectrogram and use overlapping sliding windows to obtain segments. For visual inputs, we use ResNet18 networks to extract features. We feed the spectrogram segment to a UNet encoder, tile and concatenate with the encoder’s output, then use the UNet decoder to predict the clean spectrogram. During inference, we stitch the predicted spectrogams back into a full spectrogram and use Griffin-Lim to reconstruct the output dereverberated waveform.

Method

Qualitative Results

We show qualitative results for one synthetic and one real examples. For both, we display the panorama image, the clean audio, reverberant audio and our model's prediction.

Qualitative Results
                                                           
Qualitative Results
                                                           

Supplementary Video

This video includes examples for audio-visual data in simulation and in the real-world. We demonstrate examples of our dereverbration model applied to these inputs.

BibTeX

@inproceedings{chen2023av_dereverb,
    title = {Learning Audio-Visual Dereverberation},
    author = {Changan Chen and Wei Sun and David Harwath and Kristen Grauman},
    year = {2023},
    booktitle = {ICASSP},
    }