World Models Explained: The Architecture That Could Replace Transformers

Photo by Markus Spiske on Unsplash

In the span of a few months, bridging late 2025 and early 2026, the signal became hard to ignore. Yann LeCun — one of the three original deep learning pioneers, and the researcher who helped invent the convolutional neural network — left Meta after twelve years to launch AMI Labs, raising over a billion dollars at a multi-billion dollar valuation before releasing a single product. Fei-Fei Li’s World Labs shipped Marble, the first commercially available world model. Google DeepMind released Genie 3, capable of generating navigable 3D environments in real time at 24 frames per second. NVIDIA’s Cosmos platform, purpose-built as open infrastructure for world model development, surpassed two million downloads.

None of this has broken through the mainstream coverage that follows every new LLM benchmark result. And for working engineers and ML practitioners, the architecture question has largely gone unanswered: what is a world model, exactly, and how is it different from a transformer?

This post answers that question — from the core architectural difference to what’s actually shipping today versus what remains theoretical.

TL;DR

  • LLMs predict the next token in a sequence. World models predict the next state of an environment, accounting for physics, causality, and spatial relationships.
  • The architectural difference is structural: LLMs operate in token space; world models operate in latent state space with explicit dynamics models.
  • Two competing schools have emerged — generative (Genie 3, Marble: pixel/voxel prediction) and non-generative (LeCun’s JEPA: latent-space prediction). They represent different bets on what “understanding” requires.
  • What’s shipping: Marble for 3D environment generation, Genie 3 for interactive simulation, and NVIDIA Cosmos for robotics/AV training. What’s still theoretical: world models that generalize across domains, models that robustly acquire causal and physical understanding from video alone.
  • For practitioners: the immediately relevant applications are simulation environments for training embodied agents, synthetic training data generation, and physics-aware planning.

The Core Architectural Difference

To understand why world models are architecturally distinct from transformers, you need to understand what each is actually predicting.

A large language model is, at its core, a next-token predictor. Given a sequence of tokens, it estimates the probability distribution over the next token. This is an enormously powerful objective — it turns out that predicting the next word well requires encoding vast amounts of world knowledge, reasoning patterns, and linguistic structure. But the prediction target is fundamentally linguistic. The model is modeling text about the world, not the world itself.

A world model has a different prediction target: the next state of an environment. Given a representation of the current state — what’s in the scene, where objects are, what’s moving — it predicts what the environment will look like after some action is taken or some time passes. This sounds like a small difference. It isn’t.

The distinction forces three structural requirements that LLMs don’t have:

A perception module. Before reasoning about state transitions, the model needs a compact representation of the current state. For LLMs, this is handled by the embedding layer — tokens become vectors. For world models, this requires a richer encoding: images, video, depth maps, and proprioceptive signals from robots. The perception module takes raw sensory inputs and encodes them into a latent representation of the environment.

A dynamics model. This is the core of a world model — a learned function that maps from the current state and an action to the predicted next state. It captures causality and temporal structure. The keyword is “learned”: the dynamics aren’t hand-coded physics rules, they’re acquired from data. This is what makes world models potentially more general than physics simulators and more grounded than LLMs.

A planning (control) module. Given a goal, the planning module uses the dynamics model to simulate future trajectories and select actions that achieve it. This is where the payoff materializes — an agent that can plan over thousands of simulated steps before taking a single real action.

┌──────────────────────────────────────────────────────────┐
│ WORLD MODEL │
│ │
│ Raw Input Latent State Next State │
│ (video/img) ──▶ [Perception] ──▶ [Dynamics Model] ──▶│
│ │ ▲ │
│ │ Action │ │
│ ▼ │ │
│ [Planning Module] ──────────┘ │
│ │ │
│ Output: action sequence │
└──────────────────────────────────────────────────────────┘

vs.

┌──────────────────────────────────────────────────────────┐
│ LANGUAGE MODEL │
│ │
│ Token sequence ──▶ [Transformer] ──▶ P(next token) │
│ │
│ No dynamics model. No state representation. │
│ No planning module. Just token probabilities. │
└──────────────────────────────────────────────────────────┘

This structural difference is why LeCun has argued so forcefully that scaling LLMs will not produce AGI. As he put it in a January 2026 interview after leaving Meta: “You cannot reach human-level intelligence by scaling up a system that fundamentally does not understand the world.” The critique isn’t that LLMs are bad at language — they’re excellent at language. The critique is that language is an impoverished representation of physical reality, and no amount of text prediction recovers the causal structure of how the world works.

Two Schools, Two Bets

The world model field has fractured into two fundamentally different camps. Both agree on the goal — AI that understands physical reality. They disagree sharply on the architecture for achieving this goal.

School 1: Generative World Models

The generative school, represented by Google DeepMind (Genie 3) and World Labs (Marble), makes a bet on richness: build world models that predict at the pixel or voxel level, generating realistic video or 3D geometry as output. The model is “understanding” the world because it can render what the world looks like under different conditions, from different viewpoints, after different actions.

The advantage: these systems produce directly usable outputs — 3D environments you can walk around in, video you can watch, assets you can export to a game engine. The commercial pathway is clear.

The limitation: pixel-level prediction is expensive. Generating a photorealistic 3D environment requires generating millions of values that correspond to individual visual details that may be irrelevant to understanding the scene. A model that can perfectly render the texture of a wooden floor hasn’t necessarily understood that floors are weight-bearing horizontal surfaces.

Genie 3 (Google DeepMind, August 2025) is the most technically mature generative world model. It generates navigable 3D environments from text prompts at 720p and 24 frames per second, with visual consistency maintained across several minutes of real-time interaction. Crucially, it learns the physics of its environments from training data rather than hard-coded rules — if you throw an object in a Genie 3 environment, it falls because the model learned from watching things fall, not because someone encoded gravity. DeepMind pairs Genie 3 with SIMA 2, a generalist agent that trains inside Genie-generated worlds and is tested across them — providing an end-to-end pipeline from environment generation to agent training.

Marble (World Labs, November 2025) takes a different commercial angle. Rather than real-time interactive simulation, Marble generates persistent, downloadable 3D environments from text prompts, images, video, or rough spatial sketches. The environments export to Unreal Engine and Unity. Pricing runs from free tiers to $95/month for professional use. The practical innovation is the Chisel editor: instead of modeling geometry first and applying textures later (the traditional 3D pipeline), Chisel lets creators block out spatial relationships conceptually — “cozy coffee shop with large windows and a corner reading nook” — then progressively refine. Geometric consistency is maintained throughout, preventing the spatial impossibilities that plague purely generative image approaches.

School 2: Non-Generative World Models (JEPA)

LeCun’s camp makes the opposite bet: don’t predict pixels. Predict abstract representations.

The reasoning: if you try to predict the exact pixel values of the next video frame, you spend enormous model capacity predicting details — the exact color of a shadow, the precise texture of grass — that are irrelevant to physical understanding. A child who understands that a ball will roll down a ramp doesn’t need to mentally render every frame of the rolling. They’ve internalized an abstract model of how gravity and inclined surfaces interact.

JEPA (Joint Embedding Predictive Architecture) operationalizes this intuition. Instead of predicting the next frame’s pixels, JEPA predicts the next frame’s representation — a compact latent vector in an abstract space. The model learns to predict abstract structure, not visual detail.

# Conceptual JEPA training objective
# (not runnable — illustrative of the mathematical structure)
# Encoder: maps inputs to abstract representations
def encode(x):
return encoder_network(x) # e.g., a vision transformer
# Predictor: maps from context representation to target representation
def predict(context_repr, action=None):
return predictor_network(context_repr, action)
# JEPA loss: predict the representation of the target, not the target itself
def jepa_loss(context_frames, target_frame, action=None):
context_repr = encode(context_frames)
target_repr = encode(target_frame) # What we want to predict
predicted_repr = predict(context_repr, action)

# Loss is in representation space, not pixel space
return F.mse_loss(predicted_repr, target_repr.detach())
# Key: target_repr is stopped-gradient (detached)
# This prevents the trivial solution where both sides collapse to a constan

The resulting models are smaller, faster to train, and in early evaluations show better downstream performance on tasks requiring physical reasoning — precisely because they’re not wasting capacity on visual minutiae.

AMI Labs’ LeJEPA extends this to video and multimodal inputs, learning from hours of unlabelled video in the same way a child learns from observing the world. Meta’s VL-JEPA (Vision-Language JEPA, released before LeCun’s departure) demonstrated that this approach achieves stronger world understanding benchmarks than larger generative models on tasks that require causal and physical reasoning.

The limitation: JEPA models don’t produce the visually compelling, directly usable outputs that generative models do. “I trained a model that predicts abstract representations of physical states” is a harder sell than “I can generate a navigable 3D environment from your sketch in 30 seconds.”

NVIDIA Cosmos: The Infrastructure Layer

While AMI Labs and World Labs compete on research frontiers, NVIDIA is building the infrastructure that both will eventually depend on.

Cosmos is an open-source world foundation model platform trained on 9,000 trillion tokens drawn from 20 million hours of real-world data: driving scenarios, industrial settings, robotics operations, and human-environment interactions. It comes in three model families optimized for different applications, with the February 2026 Cosmos Predict 2.5 release adding specialized checkpoints for autonomous vehicle perception.

The practical value: world models require training data that’s expensive to collect (real-world sensor data, robotics trajectories, industrial footage). Cosmos provides a pretrained foundation that robotics and AV teams can fine-tune rather than train from scratch. Companies including Figure AI, Uber, Agility Robotics, and XPENG are already using it for synthetic training data generation — producing simulated environments for rare or dangerous scenarios that would be impractical to collect in the real world.

This is where the most immediately practical value is: not in the research-frontier architectures, but in the ability to generate unlimited synthetic training data for embodied systems.

What’s Actually Shipping vs. What’s Still Theoretical

Being honest about this distinction matters. The world model discourse conflates a lot.

What’s demonstrably shipping:

  • Marble (World Labs): generating persistent, editable 3D environments from text/image/video inputs. Commercially available with pricing. Exports to standard 3D tools. Real product with real users.
  • Genie 3 (DeepMind): real-time interactive 3D environment generation from text prompts. Physics-aware. Currently in limited research preview, not public release.
  • NVIDIA Cosmos: open-source world foundation model platform. Two million downloads. In active production use for robotics and AV training.
  • VL-JEPA (Meta): vision-language model that outperforms larger generative models on physical reasoning benchmarks. Research release, not productized.

What’s still largely theoretical or undemonstrated at scale:

  • World models that generalize across domains. Current systems specialize: Marble does 3D environments, Genie 3 does interactive games, Cosmos does autonomous driving and robotics. A single world model that understands both a manufacturing floor and a surgical suite with equal competence doesn’t exist yet.
  • Robust causal reasoning from video alone. JEPA architectures improve on LLMs for physical reasoning, but the claim that they’ve learned genuine causal understanding — as opposed to sophisticated pattern matching over spatial features — is still contested. The spurious correlation problem doesn’t disappear just because your inputs are video instead of text.
  • World models enabling reliable long-horizon planning. DreamerV3 and similar systems can plan thousands of steps ahead in simulated environments. Translating this to the real world — where the dynamics model is never perfectly accurate, and errors accumulate — remains an open research problem.

What This Means for Practitioners

If you’re building systems today, three applications are immediately viable.

Synthetic training data for embodied agents. If you’re training robotics policies or autonomous system components, world models let you generate training scenarios you can’t safely or affordably collect in the real world. Rare edge cases, dangerous failure modes, environmental variations — all generated at scale through Cosmos or fine-tuned world models. This is the highest-ROI near-term use.

Interactive simulation environments for agent training. Genie 3’s pairing with SIMA establishes the template: generate diverse environments, train agents inside them, evaluate cross-environment generalization. For teams building agents that operate in physical spaces (warehouse robots, field inspection drones, home assistants), this pipeline collapses the data collection bottleneck.

3D content and asset generation. Marble’s commercial tier makes world model-quality 3D generation available today for game development, VFX, and architectural visualization. If your product pipeline involves 3D assets, this is already worth evaluating.

What’s not yet viable: using world models as a drop-in replacement for LLMs in reasoning tasks. The architectural advantages LeCun argues for — better causal reasoning, physical consistency, long-horizon planning — are real in research settings but not yet packaged in a form that’s deployable for general-purpose reasoning in production systems. That gap will close over the next two to three years. It hasn’t closed yet.

Gotchas Nobody Tells You

The “understanding physics” claim is overloaded. When Genie 3 is described as “learning physics from training data,” this means it produces outputs that are visually consistent with physical laws — objects fall, liquids flow, rigid bodies don’t interpenetrate. It does not mean it has internalized Newton’s laws in a way that generalizes to novel physical configurations far outside its training distribution. The distinction matters for deployment: these models will fail in physically unusual situations in ways that are hard to predict.

JEPA’s elegant theory has a delivery problem. LeCun’s architectural arguments are compelling, and the early research results are promising. But AMI Labs has raised over a billion dollars without shipping a product. World Labs shipped a product in under two years. “Architecturally superior in theory” has a poor track record of winning against “ships things people can use.” Watch what AMI Labs actually releases, not just what it claims.

The data moat is enormous and underappreciated. World models are only as good as the physical data they’re trained on. NVIDIA Cosmos has 20 million hours of real-world sensor data. Collecting that took years of partnerships with autonomous vehicle companies, robotics labs, and industrial facilities. New entrants face a significant disadvantage — not in architecture, but in training data. For practitioners evaluating fine-tuning options, this is the most important selection criterion: what physical domain was the base model trained on?

Conclusion

The architecture debate between world models and transformers is not primarily a research question — it’s a bet on what the ceiling of LLMs actually is. If you believe, as most of Silicon Valley currently does, that scaling language models will eventually produce systems that genuinely understand physical causality, then world models are an interesting research direction. If you believe, as LeCun does, that token prediction is structurally incapable of producing causal world understanding regardless of scale, then the transformer era is approaching its limit and world models are the successor architecture.

The honest answer is that neither camp has definitive proof. LLMs keep surprising their critics. World models keep delivering on their narrow benchmarks without yet demonstrating the broad generalization their proponents claim.

What’s not in dispute: the wave of concrete releases in 2025 and 2026 — Marble, Genie 3, Cosmos, VL-JEPA — has moved world models from a research concept to a set of usable tools. The immediate applications in robotics training, synthetic data generation, and 3D content creation are real today, regardless of how the deeper architectural debate resolves.

LeCun left a twelve-year position at one of the world’s most powerful AI labs because he believes the next paradigm is here. Whether he’s right about the architecture, the move itself is a data point worth taking seriously.


World Models Explained: The Architecture That Could Replace Transformers was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top