Beyond Tokens: How JEPA Is Quietly Teaching AI to Understand the World

Why the most exciting idea in AI right now isn’t a bigger language model — it’s an architecture that learns the way we do.

If I drop a coffee cup off the edge of this desk, a two-year-old knows what happens next. A trillion-parameter language model, left to its own devices, does not.

That gap — the one between a toddler’s intuition and a frontier AI’s best guess — is the quiet scandal at the center of modern artificial intelligence. We have models that write sonnets, pass the bar exam, and generate photorealistic videos of things that never happened. And yet, asked to anticipate the simplest physical consequence in a world they’ve never seen, they fumble.

For the last three years, a small but increasingly loud group of researchers has been arguing that this isn’t a scaling problem. It’s an architecture problem. And the most provocative answer on the table has a strange, unglamorous name: JEPA — Joint Embedding Predictive Architecture.

If you’ve been watching the AI space carefully, you may have noticed JEPA quietly threading through Meta’s research announcements, showing up in robotics demos, and accumulating variants faster than anyone can blog about them. If you haven’t, this is the primer I wish I had a year ago, when I started building JEPA models for my own PhD research.

This post isn’t about my implementation (that one’s coming). It’s about the bigger story: where AI has been, where it’s stuck, and why JEPA might be the architecture that gets us unstuck.

The Generative Detour

Let’s rewind.

For most of the 2020s, the dominant recipe for AI progress has been astonishingly simple: take a very large neural network, train it to predict the next piece of something — a token, a pixel, a waveform sample — and scale until strange emergent behaviors appear. This recipe gave us GPT, DALL·E, Sora, and the entire generative AI boom.

And it worked well enough that an entire industry convinced itself this was the road to general intelligence.

But if you spend time inside these models, you start to notice the cracks.

Large language models can write convincingly about physics without being able to reason through a novel physical scenario. Video generators can render gorgeous waterfalls, but put six fingers on a hand and can’t keep a glass upright on a tilted tray. Ask any of them to plan across long horizons, and they degrade into confident-sounding improvisation.

The reason isn’t that the networks are too small. It’s that predicting the next pixel or the next token is a strange objective to try to learn the world from. Most of what’s in a pixel is noise: lighting variations, camera grain, textures that don’t matter. A model straining to predict every detail is forced to burn enormous capacity on what is essentially irrelevant. And when futures are inherently uncertain — which in the real world, they almost always are — pixel-level loss averages the possibilities together and hands you a blurry mush.

Yann LeCun, Meta’s Chief AI Scientist, has been saying this for years. His 2022 position paper, A Path Towards Autonomous Machine Intelligence, sketched a different blueprint — one in which generative pixel prediction was demoted from “the answer” to “a technique we tried and should move past.” At the heart of that blueprint was a new kind of model.

LeCun’s Bet

The bet LeCun made is easy to state and hard to take seriously until you sit with it for a while:

Don’t predict the data. Predict a representation of the data.

Here’s the intuition. When you watch a leaf fall, your brain isn’t reconstructing every photon that bounces off it. It’s building a compact, abstract sense of what kind of thing is happening — a leaf, falling, roughly this fast, in roughly this direction — and it’s using that abstraction to project what comes next. The messy pixel-level details are thrown away because they don’t matter for prediction.

JEPA takes that intuition and turns it into an architecture.

Given two related pieces of input — say, one part of an image and another part of the same image, or one video segment and a future one — a JEPA model does three things:

It encodes the context (what you can see) into an abstract embedding. It encodes the target (what’s missing or upcoming) into another embedding. It predicts the target embedding from the context embedding.

The loss doesn’t compare pixels to pixels. It compares meanings to meanings. The model succeeds when its internal prediction matches the internal representation of the actual target — even if both are compact, lossy summaries that discard texture and noise.

This sounds like a minor engineering tweak. It isn’t. It changes what the network is incentivized to care about. Because it’s no longer punished for failing to hallucinate every texture, it can focus on object permanence, physical plausibility, temporal structure — the stuff that actually matters for acting in the world.

And critically, it solves the blurry-future problem. If two different futures are equally plausible in pixel space, a generative model will blur them together. A JEPA model, operating in a compressed embedding space where irrelevant differences have already been discarded, can commit to the part of the future that is predictable and shrug off the part that genuinely isn’t.

Image does not fully covers the whole architecture but the general idea.

The Family Grows

JEPA isn’t a single model. It’s a family of architectures that share the same core move — predict in representation space — applied to different modalities and problems. Since 2023, the family has expanded fast.

I-JEPA (2023) was the opening act. Given an image, it hides several large blocks of pixels and asks the model to predict the embeddings of those blocks from the visible context. No contrastive negatives. No data augmentation tricks. No pixel-level reconstruction. Just: given the representation of what you can see, guess the representation of what you can’t. Despite its simplicity, I-JEPA matched or beat the dominant self-supervised image methods of its time while using dramatically less compute.

V-JEPA (2024) did the same thing for video. Mask large spatio-temporal regions across multiple frames, and predict the missing region’s embeddings from what remains. Because the masks were spatial regions sustained across time, the model couldn’t cheat by copying neighboring frames — it had to actually learn how things move, interact, and evolve. V-JEPA turned out to be remarkably data-efficient and surprisingly good at fine-grained distinctions, like telling the difference between picking up an object and pretending to pick it up.

Then came the moment the wider world started to pay attention.

V-JEPA 2 (June 2025) scaled this idea to over a million hours of internet video and crossed a line that generative models had been struggling to cross: it became useful for controlling real robots, with almost no robot-specific data.

That last point deserves its own section.

The V-JEPA 2 Moment

Here is what V-JEPA 2 did, in plain terms.

First, Meta pre-trained the model on over a million hours of internet video and a million images — no labels, no actions, just passive observation. This stage is essentially a machine watching YouTube and learning, in its own compressed, abstract way, how the physical world behaves.

Then, instead of collecting a massive labeled robotics dataset (the usual, expensive route), they took a much smaller pool: around 62 hours of unlabeled robot video paired with the control commands the robot was executing at the time. That’s a lab-sized dataset, not a Silicon Valley one. With that, they post-trained an action-conditioned predictor on top of the frozen video encoder.

The result — a model called V-JEPA 2-AC — can be handed a goal in the form of an image (this object should end up here), imagine a short sequence of candidate actions in its internal representation space, score each one by how close it gets to the goal, execute the best, and re-plan. Standard model-predictive control, but with a learned world model doing the imagining.

The part that made robotics Twitter sit up was the zero-shot claim: deploy it on a Franka arm in a lab whose data was never in the training set, holding an object it’s never seen, and it succeeds at pick-and-place between 65% and 80% of the time. No retraining. No task-specific reward shaping. Just a world model that already understands enough about objects and contact to improvise.

There are benchmark numbers too, for the appropriately curious — state-of-the-art on Epic-Kitchens-100 action anticipation, competitive on Something-Something v2 motion understanding, strong video question-answering results when aligned with a language model. But the benchmarks aren’t really the headline. The headline is that a video model trained mostly on passive observation produced a usable world model for physical control, and did it with orders of magnitude less robot data than the field assumed was necessary.

If you’ve been around long enough to remember when Atari, then Go, then protein folding felt like the bar, this is that kind of moment. Not an AGI threshold — but a demonstration that a different route exists, and is open.

Why This Is a Different Kind of AI

It’s tempting to read V-JEPA 2 as “another big Meta model” and move on. That would miss the shift.

Generative AI is, at its core, a content engine. It outputs tokens, images, frames. Useful, yes, sometimes astonishing — but its relationship to the world is mediated entirely by the surface statistics of its training data.

JEPA, and the broader world-model research program it sits inside, is a prediction engine. It doesn’t try to show you the world. It tries to model it — to build an internal simulator you can ask questions of, plan inside, and act from.

That distinction matters because it changes what AI is for.

A content engine is a good fit for writing, design, search, and creative assistance. A prediction engine is a good fit for robotics, autonomous systems, and agents that have to take sequential actions in environments where mistakes have consequences. Self-driving cars. Warehouse automation. Surgical planning. Any domain where “what happens if I do this?” is the question that actually matters.

And critically, a prediction engine doesn’t need labels at the same scale. It can learn from the same firehose of unlabeled video and sensor data that the world already generates for free. That opens the door to a kind of AI development that doesn’t depend on ever-more-expensive annotation pipelines and reinforcement-learning-from-human-feedback circuits.

It’s the return, after a long generative detour, of an idea older than the transformer: that intelligence is fundamentally about building a model of the world good enough to act inside.

The Road Ahead

JEPA in 2026 is still a research program, not a finished product. The field is moving in several directions at once, and the next 24 months will probably decide which branches become load-bearing.

A few threads worth watching.

Hierarchical JEPA. Right now, most JEPA models predict at one time scale. Humans and animals predict at many — the next footstep, the next minute, the next hour. Stacking JEPAs into a hierarchy that forecasts at multiple horizons is one of the open research frontiers, and is arguably necessary for long-horizon planning in the real world.

Action-conditioned and causal variants. V-JEPA 2-AC was the first large-scale demonstration of an action-conditioned JEPA. More recent work, including a line on causal JEPA architectures that learn through object-level latent interventions, is pushing toward models that don’t just predict what happens but understand why it happens — which is the difference between pattern-matching and reasoning.

Stability and simplicity. JEPA training has historically been finicky. Representations collapse, exponential moving averages have to be tuned, and losses carry half a dozen terms. A newer line of work — LeWorldModel from a collaboration including LeCun and Balestriero (Maes, Le Lidec, Scieur, LeCun, Balestriero) — has shown that you can train a stable end-to-end JEPA from raw pixels with a two-term loss and about 15 million parameters on a single GPU, planning up to an order of magnitude faster than foundation-model-based world models. In other words, JEPA is in the process of becoming something a PhD student can train in an afternoon — and that’s the phase of a technology where interesting things start to happen.

Multimodality. The obvious next step is fusing vision, audio, proprioception, and language into a single predictive representation — a unified world model that can be queried and acted upon across modalities. Early versions of this are already landing.

Integration with language models. JEPA doesn’t replace LLMs. It sits next to them. A world model tells you what will happen; a language model tells you how to describe it, reason about it symbolically, and communicate with humans. The most interesting agents of the next decade almost certainly have both.

What to Take From This

If you’re a researcher: JEPA is where self-supervised learning is going after MAE-style masked reconstruction plateaued. (I wrote about MAE previously — and part of what got me interested in JEPA was watching, in my own experiments, how much representation capacity gets wasted on pixel-level reconstruction that the downstream task never needed.) The transition from reconstructing pixels to predicting embeddings is one of those rare shifts that’s technically modest and conceptually enormous.

If you’re a builder, world models are about to do to robotics what foundation models did to NLP. The direction of travel is clear — less supervised data, more passive observation, more transferable priors. If your product has a physical-world loop in it, this is the architecture family to be watching.

If you’re neither, and you’re just here because the AI news cycle is exhausting and you want to understand what’s actually happening underneath it: the short version is that after three years of scaling generative models, the field is quietly rediscovering that generating and understanding are not the same thing. JEPA is the bet that understanding, not generation, is what the next wave of AI will be built on.

“Predicting tokens is not enough,” LeCun has been saying for half a decade now. For a long time, that sounded like a philosophical objection. It’s increasingly starting to sound like a roadmap.

I’m a PhD candidate at Hacettepe University working on JEPA-based architectures, currently focused on representation learning. My next post will be a hands-on walkthrough of my own I-JEPA implementation — the design choices, the things that broke, and what the resulting representations actually look like. If that sounds useful, follow along.

Previous posts in this self-supervised learning series:

Visuals in this post were created using AI tools for illustration and better understanding.

Beyond Tokens: How JEPA Is Quietly Teaching AI to Understand the World was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.