Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

NeurIPS 2023


1 UT Austin   2 FAIR, Meta

We present AE2, a self-supervised embedding approach to learn fine-grained action representations that are invariant to the ego-exo viewpoints.

AE2 embeddings well capture the progress of an action and are view-invariant --- they can be used for aligning egocentric (first-person) and exocentric (third-person) videos, recorded from diverse backgrounds and dramatically different viewpoints.




We propose a new ego-exo benchmark for fine-grained action understanding, which consist of four action-specific datasets.


(A) Break Eggs

Break Eggs 1 Break Eggs 2

(B) Pour Milk

Pour Milk 1 Pour Milk 2

(C) Pour Liquid

Pour Liquid 1 Pour Liquid 2

(D) Tennis Forehand

Tennis Forehand 1 Tennis Forehand 2

Qualitative Examples

All videos are from the test set. We freeze the encoder to extract frame-wise embeddings for these videos and align two videos using nearest neighbor.


Ego-Exo alignmemt


Original (Unaligned) Videos

Temporally Aligned Videos using AE2 Embeddings

Original (Unaligned) Videos

Temporally Aligned Videos using AE2 Embeddings

Original (Unaligned) Videos

Temporally Aligned Videos using AE2 Embeddings

Original (Unaligned) Videos

Temporally Aligned Videos using AE2 Embeddings

Besides ego-exo alignment, AE2 embeddings can also be utilized for same-view alignment.


Ego-Ego alignmemt

Original (Unaligned) Videos

Temporally Aligned Videos using AE2 Embeddings

Exo-Exo alignmemt

Original (Unaligned) Videos

Temporally Aligned Videos using AE2 Embeddings

Video

A silent video designed to supplement the paper

Abstract

The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to learning view-invariant features from paired synchronized viewpoints. We relax that strong data assumption and propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time, even when not captured simultaneously or in the same environment.

To this end, we propose AE2, a self-supervised embedding approach with two key designs: (1) an object-centric encoder that explicitly focuses on regions corresponding to hands and active objects; (2) a contrastive-based alignment objective that leverages temporally reversed frames as negative samples. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context, comprising four datasets---including an ego tennis forehand dataset we collected, along with dense per-frame labels we annotated for each dataset. On the four datasets, our AE2 method strongly outperforms prior work in a variety of fine-grained downstream tasks, both in regular and cross-view settings.

BibTeX

@article{xue2023learning,
      title={Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment},
      author={Xue, Zihui and Grauman, Kristen},
      journal={NeurIPS},
      year={2023}
  }