EgoExo-WM: Unlocking Exo Video for Ego World Models

Abstract

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space, and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

Humans Learn by Doing

Hover to play video

Humans Also Learn By Watching

Hover to play video

Humans are also capable of incorporating arbitrary viewpoints of human action for learning about the world.

Existing Ego World Models Rely Solely on Ego video, which has inherent limitations...

Limitations of egocentric video for human world models

Egocentric video often occludes the human body and is dwarfed by largely exocentric internet-scale videos.

References

Miech, Antoine, et al. "HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
Grauman, Kristen, et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Grauman, Kristen, et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
YouTube. "YouTube at 15: My Personal Journey and the Road Ahead." YouTube Official Blog, 14 Feb. 2020, blog.youtube/news-and-events/youtube-at-15-my-personal-journey/.
Damen, Dima, et al. "Scaling Egocentric Vision: The EPIC-KITCHENS Dataset." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

EgoExo-WM - A Framework for Incorporating Exo Videos

We use human motion as the unifying bridge for leveraging exocentric videos in egocentric world modeling. Specifically, we ground exocentric-to-egocentric conversion in human kinematic priors and use the resulting human motion as the action representation. We convert ~10 hours of data from various internet exo videos where a person is visibly performing an action into ego video from HowTo100M (Miech et. al, 2019), CrossTask (Zhukov et. al, 2019), and 100 Days of Hands (Shan et. al, 2020).

Planning with World Model

From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. We highlight downstream applications in Robotics and Augmented Reality.

Qualitative Results

EgoX-Body Qualitative Results

Qualitative Examples of EgoX-Body on Videos across Internet Datasets HowTo100M (Miech et. al, 2019), CrossTask (Zhukov et. al, 2019), and 100 Days of Hands (Shan et. al, 2020)

Hover to play video

EgoX-Body vs. EgoX Qualitative Samples

EgoX-Body qualitatively compared with EgoX. EgoX-Body more faithfully represents the underlying human motion.

EgoX

Misaligned Hand
Object Interaction

EgoX-Body (Ours)

Consistent
Hand Object
Interaction

Hover to play video

Planning Results

From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. EgoExo-WM chooses trajectories that better match the ground-truth behavior than Ego-WM.

Hover to play video

Put Cup on Shelf

Quantitative Results

Open-loop World Model Rollout Evaluation

EgoExo-WM outperforms all PEVA variants and EgoControl under the matched 200-hour training budget, achieving lower L2 embedding error across all datasets for more accurate 2-second open-loop rollouts. Gains are largest on HOMAGE and LEMMA, where EgoExo-WM reduces average L2 error by over half compared to the strongest PEVA baseline. These improvements suggest that converting diverse exocentric videos into egocentric training data broadens motion, interaction, and environment coverage beyond Nymeria.

References

Grauman, Kristen, et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Rai, Nishant, et al. "Home Action Genome: Cooperative Compositional Action Understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
Jia, Baoxiong, et al. "LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities." European Conference on Computer Vision. Cham: Springer International Publishing, 2020.
Pallotta, Enrico, et al. "EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses." arXiv preprint arXiv:2511.18173 (2025).
Bai, Yutong, et al. "Whole-body Conditioned Egocentric Video Prediction." Advances in Neural Information Processing Systems 38 (2026): 164375-164418.

Planning Results

EgoExo-WM performs best across the evaluation datasets, suggesting its predictions are more useful for selecting goal-directed motions. These gains come from training on diverse converted exocentric videos, which broaden coverage of environments, body motions, and object-centric actions beyond egocentric data alone, improving both whole-body planning and wrist motion.

References

Grauman, Kristen, et al. "Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Rai, Nishant, et al. "Home Action Genome: Cooperative Compositional Action Understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
Jia, Baoxiong, et al. "LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities." European Conference on Computer Vision. Cham: Springer International Publishing, 2020.

BibTeX

@misc{tran2026egoexowmunlockingexovideo,
      title={EgoExo-WM: Unlocking Exo Video for Ego World Models}, 
      author={Danny Tran and Roberto Martín-Martín and Kristen Grauman},
      year={2026},
      eprint={2605.15477},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15477}, 
}

EgoExo-WM:Unlocking Exo Video for Ego World Models

Abstract

Humans Learn by Doing

Humans Also Learn By Watching

Existing Ego World Models Rely Solely on Ego video, which has inherent limitations...

EgoExo-WM - A Framework for Incorporating Exo Videos

Planning with World Model

Qualitative Results

EgoX-Body Qualitative Results

Qualitative Examples of EgoX-Body on Videos across Internet Datasets HowTo100M (Miech et. al, 2019), CrossTask (Zhukov et. al, 2019), and 100 Days of Hands (Shan et. al, 2020)

EgoX-Body vs. EgoX Qualitative Samples

EgoX-Body qualitatively compared with EgoX. EgoX-Body more faithfully represents the underlying human motion.

Planning Results

From an observation and a visual goal, a trajectory sampler proposes candidate motion sequences, and the world model ranks them to select the one whose predicted outcome best matches the goal. EgoExo-WM chooses trajectories that better match the ground-truth behavior than Ego-WM.

Quantitative Results

Open-loop World Model Rollout Evaluation

Planning Results

BibTeX

EgoExo-WM:
Unlocking Exo Video for Ego World Models