Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

ECCV, 2024


1 UT Austin   2 FAIR at Meta

TL;DR: We propose a framework, Exo2Ego, for translating exocentric views to egocentric perspectives by integrating high-level structural transformation with pixel-level hallucination.

Abstract: We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To pave the way for future advancements in this field, we curate a comprehensive exo-to-ego cross-view translation benchmark. It consists of a diverse collection of synchronized ego-exo tabletop activity video pairs sourced from three public datasets: H2O, Aria Pilot, and Assembly101. The experimental results validate that Exo2Ego delivers photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of both synthesis quality and generalization ability to new actions.

A 6-minute supplementary video


Problem Overview


The goal is to synthesize the corresponding ego view of an actor from an exo video recording, with minimal assumptions on the viewpoint relationships (e.g., camera parameters or accurate geometric scene structure). Specifically, we focus on synthesizing ego tabletop activities that involve significant hand-object interactions, such as assembling toys or pouring milk.



Our Framework


Framework Overview. Our Exo2Ego framework comprises two modules: (a) High-level Structure Transformation: Given an exo frame, the purpose of the high-level structure transformation is to train a layout translator which predicts the ego layout showing the location and rough contour of the visual concepts. (b) Diffusion-based Pixel Hallucination, which enhances pixel-level details on top of the ego layout using a conditional diffusion model.

Results

We evaluate our model on a new cross-view synthesis benchmark sourced from three public time-synchronized multi-view datasets: H2O, Aria Pilot, and Assembly101. As shown in the figure below, our model produces realistic hands with correct poses, especially noticeable in the highlighted yellow circle regions when dealing with new actions during test time, demonstrating the generalization ability to new actions.


BibTeX

@inproceedings{luo2024exo2ego,
      title={Put myself in your shoes: Lifting the egocentric perspective from exocentric videos},
      author={Luo, Mi and Xue, Zihui and Dimakis, Alex and Grauman, Kristen},
      booktitle={European Conference on Computer Vision},
      pages={407--425},
      year={2024},
      organization={Springer}
    }
  }