Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning

CVPR, 2025 (Oral)


1 UT Austin   2 UC Berkeley
   3 Bespoke Labs

TL;DR: We use paired ego-exo video data as a Rosetta Stone to unlock large-scale unpaired ego-exo video data for view-invariant representation learning.

Abstract: Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical in augmented reality and robotics. We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignment between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment tasks. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and ego video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting.

A 5-minute supplementary video


Problem Overview


Paired data is ideal for ego-exo view-invariant representation learning due to its perfect synchronization, but it is costly to collect. We explore leveraging both paired and unpaired data, taking advantage of the greater scale of unpaired videos. The key question is how to discover meaningful links within unpaired data and how to effectively align ego and exo representations.



Our Framework


Framework Overview. Left: Our ViewpointRosetta model acts as a bridge to align unpaired ego and exo videos. From an ego query video, RST generates a corresponding exo feature. This hallucinated exo feature is then concatenated with narration embeddings to retrieve the closest match from exo candidate videos. Right: we propose soft view-invariant representation learning. Different from traditional video-text and video-video contrastive learning, our approach involves: (1) assigning weights to pseudo-aligned ego-exo pairs, with higher weights given to pairs showing greater semantic similarity (indicated by the dashed line); (2) RST-synthesized exo feature and the anchor ego feature as the positive pair, to enhance feature alignment across views (highlighted by the blue line).

Rosetta Stone Translator (RST): To bridge the gap between ego and exo videos, we introduce a diffusion-based Rosetta Stone Translator. Leveraging synchronized ego and exo videos, we extract features with a frozen video encoder. The RST is trained to predict exo features from ego ones, using a denoising network to reverse the diffusion process.


Results

We evaluate our view-invariant representation on three downstream cross-view understanding tasks. * means having access to all the same paired and unpaired data as ours. † means only training with paired data. ViewpointRosetta markedly outperforms all baseline models, demonstrating consistent performance gains across three tasks.


Qualitative Results of cross-view retrieval. We observe that the baselineVI Encoder, which is trained only on paired data, tends to retrieve exo videos that are visually similar to the input but often lack semantic alignment with the action depicted. In contrast, by leveraging unpaired data, our model goes beyond surface-level visual similarities and retrieves results with meaningful action-based alignment.


BibTeX

@inproceedings{luo2025viewpoint,
      title={Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning},
      author={Luo, Mi and Xue, Zihui and Dimakis, Alex and Grauman, Kristen},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year={2025}
    }
  }