Abstract

First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.

Qualitative Results

Explore the constructed topological maps, and learned scene affordances here


Downloadable Data

Precomputed topological graphs: Precomputed EGO-TOPO graphs for EPIC and EGTEA+ videos can be downloaded here for use as auxiliary information in video understanding tasks for the two datasets. A readme file is included that explains the structure of each graph.

Affordance annotations: Crowdsourced annotations collected for EPIC and EGTEA+ frames can be downloaded here. Each line is an instance containing the frame from the video (e.g. video P04_23, frame 347) and all the interactions annotated for that frame (e.g. fill pan; mix egg; mix onion; mix pan ...). A readme file is included with more details. Model and evaluation code pointers are here.


Cite

If you find this work useful in your own research, please consider citing:
@inproceedings{ego-topo,
    author = {Nagarajan, Tushar and Li, Yanghao and Feichtenhofer, Christoph and Grauman, Kristen},
    title = {EGO-TOPO: Environment Affordances from Egocentric Video},
    booktitle = {CVPR},
    year = {2020}
}