EgoExo-WM:
Unlocking Exo Video for Ego World Models

The University of Texas at Austin
Indicates Equal Advising
arXiv 2026

Abstract

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space, and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

Humans Learn by Doing

Hover to play video

Humans Also Learn By Watching

Hover to play video
Humans are also capable of incorporating arbitrary viewpoints of human action for learning about the world.

Existing Ego World Models Rely Solely on Ego video, which has inherent limitations...

Limitations of egocentric video for human world models
Egocentric video often occludes the human body and is dwarfed by largely exocentric internet-scale videos.
References
  1. Miech, Antoine, et al. "HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
  2. Grauman, Kristen, et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
  3. Grauman, Kristen, et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
  4. YouTube. "YouTube at 15: My Personal Journey and the Road Ahead." YouTube Official Blog, 14 Feb. 2020, blog.youtube/news-and-events/youtube-at-15-my-personal-journey/.
  5. Damen, Dima, et al. "Scaling Egocentric Vision: The EPIC-KITCHENS Dataset." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

EgoExo-WM - A Framework for Incorporating Exo Videos

EgoExo-WM method overview
We use human motion as the unifying bridge for leveraging exocentric videos in egocentric world modeling. Specifically, we ground exocentric-to-egocentric conversion in human kinematic priors and use the resulting human motion as the action representation. We convert ~10 hours of data from various internet exo videos where a person is visibly performing an action into ego video from HowTo100M (Miech et. al, 2019), CrossTask (Zhukov et. al, 2019), and 100 Days of Hands (Shan et. al, 2020).

Planning with World Model

Planning with EgoExo-WM world model
From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. We highlight downstream applications in Robotics and Augmented Reality.

Qualitative Results

EgoX-Body Qualitative Results

Qualitative Examples of EgoX-Body on Videos across Internet Datasets HowTo100M (Miech et. al, 2019), CrossTask (Zhukov et. al, 2019), and 100 Days of Hands (Shan et. al, 2020)

Hover to play video
EgoX-Body qualitative result 1
EgoX-Body qualitative result 2
EgoX-Body qualitative result 3
EgoX-Body qualitative result 4
EgoX-Body qualitative result 5
EgoX-Body qualitative result 6
EgoX-Body qualitative result 7
EgoX-Body qualitative result 8

EgoX-Body vs. EgoX Qualitative Samples

EgoX-Body qualitatively compared with EgoX. EgoX-Body more faithfully represents the underlying human motion.

EgoX
Misaligned Hand
Object Interaction
EgoX-Body (Ours)
Consistent
Hand Object
Interaction
Hover to play video
EgoX comparison result 1
EgoX comparison result 2
EgoX comparison result 3
EgoX comparison result 4

Planning Results

From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. EgoExo-WM chooses trajectories that better match the ground-truth behavior than Ego-WM.

Hover to play video
Put Cup on Shelf
Planning result 1
Planning result 2
Planning result 3

Quantitative Results

Open-loop World Model Rollout Evaluation

Open-loop world model rollout evaluation table
EgoExo-WM outperforms all PEVA variants and EgoControl under the matched 200-hour training budget, achieving lower L2 embedding error across all datasets for more accurate 2-second open-loop rollouts. Gains are largest on HOMAGE and LEMMA, where EgoExo-WM reduces average L2 error by over half compared to the strongest PEVA baseline. These improvements suggest that converting diverse exocentric videos into egocentric training data broadens motion, interaction, and environment coverage beyond Nymeria.
References
  1. Grauman, Kristen, et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
  2. Rai, Nishant, et al. "Home Action Genome: Cooperative Compositional Action Understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
  3. Jia, Baoxiong, et al. "LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities." European Conference on Computer Vision. Cham: Springer International Publishing, 2020.
  4. Pallotta, Enrico, et al. "EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses." arXiv preprint arXiv:2511.18173 (2025).
  5. Bai, Yutong, et al. "Whole-body Conditioned Egocentric Video Prediction." Advances in Neural Information Processing Systems 38 (2026): 164375-164418.

Planning Results

Planning results table
EgoExo-WM performs best across the evaluation datasets, suggesting its predictions are more useful for selecting goal-directed motions. These gains come from training on diverse converted exocentric videos, which broaden coverage of environments, body motions, and object-centric actions beyond egocentric data alone, improving both whole-body planning and wrist motion.
References
  1. Grauman, Kristen, et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
  2. Rai, Nishant, et al. "Home Action Genome: Cooperative Compositional Action Understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
  3. Jia, Baoxiong, et al. "LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities." European Conference on Computer Vision. Cham: Springer International Publishing, 2020.

BibTeX

@misc{tran2026egoexowmunlockingexovideo,
      title={EgoExo-WM: Unlocking Exo Video for Ego World Models}, 
      author={Danny Tran and Roberto Martín-Martín and Kristen Grauman},
      year={2026},
      eprint={2605.15477},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15477}, 
}