Predictive Structure Improves Video Diffusion Dynamics

Predictive Structure
Improves Video Diffusion Dynamics

¹ UT Austin ² Google ³ UC Berkeley ⁴ Bespoke Labs
^† project lead

Why Video Generators Struggle with Physics

Modern video diffusion models can paint a frame so realistic it could fool you. But ask them to drop an apple, slide a book, or pour a glass of water, and the illusion shatters. The pixels look right; the dynamics don't.

The issue isn't more data or bigger models. It's the objective. Diffusion training rewards reproducing pixels, which means it spends most of its capacity on textures, lighting, and background detail—the appearance of the world—rather than the underlying physical state that decides what should happen next. The model learns to render the world, not to predict it.

Self-supervised world models like V-JEPA 2 learn the opposite way: instead of reconstructing pixels, they predict future representations from past ones. That forces them to throw away decorative noise and hold on to the things that actually evolve over time—positions, contacts, motion. In other words, they already know something about physics that diffusion models do not.

LDO (Latent Dynamics Optimization) is a simple idea built on this observation: take a frozen predictive world model and use it to teach a pretrained video diffusion model how the world moves. No retraining from scratch, no new dataset, no new sampler at inference time.

A Tale of Two Representations

t-SNE visualization of foreground token representations

Here's a simple test. Take a ball in flight, follow a single point on it across time, and project how each model "sees" that point into 2D. A model that understands motion should trace a smooth arc—blue at the start, red at the end. A model that doesn't will scatter the points like confetti.

On the left, V-JEPA 2 draws exactly the trajectory you'd expect: tight clusters in time, smoothly flowing from blue to red. On the right, the diffusion model's internal features look messy and tangled—same ball, same motion, but the temporal story is gone. That is the gap LDO is built to close.

How LDO Works

LDO uses a frozen world model in two complementary roles: as a teacher that whispers dynamics into the diffusion model's internal features, and as a critic that scores entire generated videos for physical coherence. Together, they nudge the generator toward worlds that not only look real, but behave real.

1. The Teacher: Predictive Dynamics Alignment

The first component asks the diffusion model to think a little more like a world model. As a video flows through the diffusion transformer, the same frames are also passed through a frozen V-JEPA encoder, which forecasts how its own features should evolve over time. We then gently align the geometry of the diffusion model's internal features with V-JEPA's predicted dynamics—so that whatever is changing in the world is also changing, in step, inside the model's hidden representations.

2. The Critic: Reinforcing Coherent Worlds

Group Relative Policy Optimization with V-JEPA Causal Reward

Aligning features is a great start, but we also care about the final video. So LDO turns the world model into a physics judge. For a given prompt, the generator samples a handful of candidate videos; V-JEPA scores each one based on how predictable its dynamics are over time. Videos with coherent motion get high marks; videos that jitter, teleport, or break contact get punished. Reinforcement learning then steers the generator toward the high-scoring worlds.

See It in Action

The best way to feel the difference is to watch it. In the video below, you'll see books that actually rest on tables, apples that fall instead of float, and objects that stay where physics says they should. Same base generator, same prompts—just a smarter post-training signal.

@article{luo2026ldo, title={Predictive Structure Improves Video Diffusion Dynamics}, author={Luo, Mi and Chen, Yujia and Dimakis, Alex and Grauman, Kristen and Chu, Wen-Sheng and Tran, Du}, year={2026} }