First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. We present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings to facilitate human-centric environment understanding. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that state-of-the-art video models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D.

Method overview


Annotations collected over Ego4D and Housetours can be found here. These include:

Room annotations for Ego4D and HouseTours: Each entry corresponds to a "visit" with the following information:

  video_uid: Ego4D/HT video uid
  start_time: timestamp when the camera-wearer enters the room
  end_time: timestamp when the camera-wearer leaves the room
  label: room category (e.g., kitchen, bedroom, garage)
  instance: id for rooms of the same type (e.g., bedroom0, bedroom1 if there are two bedrooms)
NLQ annotations for HouseTours: Each entry corresponds to a natural language question asked about a video with the following information:

  video_uid: Ego4D/HT video uid
  query: natural language question to be grounded
  response_start: timestamp for the response start
  response_end: timestamp for the response end
  category: question type (e.g., visit_x, see_x_then_y etc.)


If you find this work useful in your own research, please consider citing:
  title={EgoEnv: Human-centric environment representations from egocentric video},
  author={Nagarajan, Tushar and Ramakrishnan, Santhosh Kumar and Desai, Ruta and Hillis, James and Grauman, Kristen},