Modern video diffusion models can paint a frame so realistic it could fool you. But ask them to drop an apple, slide a book, or pour a glass of water, and the illusion shatters. The pixels look right; the dynamics don't.
The issue isn't more data or bigger models. It's the objective. Diffusion training rewards reproducing pixels, which means it spends most of its capacity on textures, lighting, and background detail—the appearance of the world—rather than the underlying physical state that decides what should happen next. The model learns to render the world, not to predict it.
Self-supervised world models like V-JEPA 2 learn the opposite way: instead of reconstructing pixels, they predict future representations from past ones. That forces them to throw away decorative noise and hold on to the things that actually evolve over time—positions, contacts, motion. In other words, they already know something about physics that diffusion models do not.
LDO (Latent Dynamics Optimization) is a simple idea built on this observation: take a frozen predictive world model and use it to teach a pretrained video diffusion model how the world moves. No retraining from scratch, no new dataset, no new sampler at inference time.