VisualEchoes: Spatial Image Representation Learning through Echolocation

ECCV 2020

       Ruohan Gao       Changan Chen       Ziad Al-Halah       Carl Schissler       Kristen Grauman       

The University of Texas at Austin     Facebook Reality Lab     Facebook AI Research

[Main] [Supp] [Data] [Bibtex]


Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning—monocular depth estimation, surface normal estimation, and visual navigation—with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.

1-Minute Overview

Echolocation Simulation

We show an example of the agent navigating in one replica scene and performing echolocation. The agent emits 3ms chirp signals from 20Hz to 20kHz and receives echo responses from the room. Echoes resulting from the emitted chirps reflect the scene geometry.

10-Minute Talk


R. Gao, C. Chen, Z. Al-Halah, C. Schissler, K. Grauman. "VisualEchoes: Spatial Image Representation Learning through Echolocation". In ECCV, 2020. [bibtex]

  title = {VisualEchoes: Spatial Image Representation Learning through Echolocation},
  author = {Gao, Ruohan and Chen, Changan and Al-Halab, Ziad and Schissler, Carl and Grauman, Kristen},
  booktitle = {ECCV},
  year = {2020}


UT Austin is supported in part by DARPA Lifelong Learning Machines and ONR PECASE. RG is supported by Google PhD Fellowship and Adobe Research Fellowship.