Skip to content
← Back
2025 Summary

VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language

Darrell Thomas

VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language

Authors: Delong Chen*, Mustafa Shukor*, Theo Moutakanni*, Willy Chung*, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung Affiliations: Meta FAIR, HKUST, Sorbonne Universite, NYU Published: December 2025 | arXiv:2512.10942v1 Source: arxiv.org/abs/2512.10942


The Big Idea

Every paper on this site so far has been about making the transformer work better. Attention replaced recurrence. Chinchilla showed how to train it right. Mamba challenged whether you need attention at all. DeepSeek proved you can train a world-class model for a fraction of the cost if you engineer around your constraints. R1 showed that reasoning can emerge from reward signals alone.

All of that is real progress. And all of it is still trying to predict the next token.

VL-JEPA is asking a different question: what if token prediction is the wrong objective for an AI that needs to understand the physical world?

This is Yann LeCun’s central argument, and it’s worth taking seriously. An autoregressive model learns to predict what comes next in a sequence of symbols. That’s a powerful capability. But understanding a scene, anticipating what happens when you push an object off a shelf, planning a sequence of physical actions — none of these reduce to predicting the next word. The world is not a sentence.

VL-JEPA proposes that instead of generating text word by word, a model should predict meaning directly — a compact continuous representation of what something means, not a sequence of symbols that describe it. Think of the difference between pointing to a location on a map versus reading out every turn-by-turn direction. The coordinates are faster, smaller, unambiguous, and directly usable by anything that needs to act on them.

The result is a 1.6B-parameter model that outperforms CLIP, SigLIP2, and Perception Encoder on classification and retrieval, matches VLMs 10x its size on visual question answering, beats GPT-4o, Claude-3.5, and Gemini-2 on world-state prediction tasks — and does it all with 50% fewer trainable parameters than a comparable token-predicting model. The performance is compelling. But the architecture is the story. This is the direction.


Why This Changes the Question

The papers before this one were optimizing within a paradigm. This one challenges the paradigm itself.

Token-predicting models have three problems that matter once you leave the domain of language and enter the domain of action:

They’re slow. Generating text token-by-token is sequential by design. Each word depends on the previous one. For a robot tracking objects in real time, or a drone monitoring a scene, waiting for a full sentence to finish before acting is a non-starter. The world doesn’t pause while the model finishes its sentence.

They waste capacity on surface form. “The lamp turned off,” “the light went out,” and “it went dark” mean the same thing. A token-predicting model must learn all three as separate prediction targets. That’s training compute spent on linguistic paraphrase — on the packaging, not the meaning. A model that predicts in representation space maps all three to nearly the same point. It learns once.

They’re optimizing for the wrong thing. Next-token prediction rewards models for matching the statistical distribution of text. But physical understanding — what will happen, what caused what, how objects interact — is not a property of text distributions. You can get very good at predicting text and still have no model of the world behind it.

VL-JEPA bypasses all three by producing a continuous stream of meaning embeddings rather than sequences of words. The understanding is in the embedding. Text is only generated on demand, as a translation step, when a human needs to read the output. The model itself is operating in a richer space than language.


LeCun’s Argument

VL-JEPA is the experimental validation of a thesis Yann LeCun has been advancing for years: the path to real machine intelligence runs through world models, not language models.

The argument has four parts:

Autoregressive prediction is the wrong objective for understanding. LLMs learn to predict the next word. This rewards surface statistics of language. But understanding a scene, planning a motion, or reasoning about physics has nothing to do with word probabilities. Predicting in embedding space forces the model to capture meaning, not syntax.

The real world is continuous, not discrete. Tokens are an artificial discretization of language. Images, video, sensor streams, robot states — the physical world is continuous. A model that reasons in continuous embedding space is architecturally aligned with the structure of reality in a way that token models are not.

Generative models waste capacity on irrelevant detail. An autoregressive model must predict every token, including noise, punctuation, stylistic variation, and everything else that doesn’t affect meaning. JEPA models predict only the abstract representation, ignoring what doesn’t matter. More efficient. Better generalization.

Non-autoregressive prediction enables real-time operation. For robotics, autonomous vehicles, and physical agents, you need always-on perception with low-latency response. JEPA’s continuous embedding stream — with selective decoding that only fires the text generator when meaning actually changes — is a natural fit for that requirement. Autoregressive models are not.

VL-JEPA is the proof of concept that this approach extends to the multimodal domain. Previous JEPA work validated it for images (I-JEPA) and video (V-JEPA). This paper extends it to the joint vision-language space for the first time. It isn’t a research curiosity. It’s a demonstrated architecture for a very different kind of AI.


Where This Goes

The next steps from this work are not incremental improvements on VL-JEPA. They are a different class of system entirely.

Action-conditioned world models. VL-JEPA currently predicts semantic embeddings of text given vision and a query. The direct extension is predicting embeddings of future world states given vision and a proposed action: “if I move my arm here, the scene should look like this.” That’s an internal simulator. A robot with that capability doesn’t need explicit programming for every situation. It can reason about consequences in its own representation space before committing to an action. VL-JEPA’s lead performance on WorldPrediction-WM benchmarks — surpassing GPT-4o, Claude-3.5, and Gemini-2 on understanding state transitions from visual input — suggests this is within reach.

Hierarchical planning without language. Current AI systems that plan do so by generating language chains-of-thought: reason step by step, in words, toward a goal. JEPA enables planning in embedding space instead — high-level goals decomposed into sub-goals, all in continuous representation space, with no tokenization bottleneck and no language overhead. This is faster, more compact, and more directly connected to the actions that need to follow.

Unified multi-modal perception. Audio, tactile sensing, proprioception, force and torque data — all of these could map into the same embedding space with different encoders feeding a unified predictor. A physical agent wouldn’t need separate models for vision, touch, and sound. It would have one model of the world, continuously updated, expressed in a shared representational space.

Scaling the predictor. The current predictor is initialized from a modest 1B-parameter language model with 8 layers. Scaling this while keeping the embedding-space objective intact could unlock more complex reasoning in latent space — what LeCun calls “thinking in representation space” rather than “thinking in language.” The architecture permits this. Nobody has done it yet at scale.


Technical Architecture

VL-JEPA operates on input triplets: a visual input X_V (image or video frames), a textual query X_Q, and a textual target Y (the answer to predict). Four components:

X-Encoder (Vision): A frozen V-JEPA 2 ViT-L (304M parameters). Compresses visual inputs into a sequence of visual embeddings. Video is sampled at 16 frames at 256x256. Images are duplicated to match the expected input shape.

Predictor (Core): The heart of the model. Initialized from the last 8 Transformer layers of Llama-3.2-1B (490M trainable parameters). Takes visual embeddings and tokenized query as input, jointly attending across both. Outputs a predicted embedding S_hat_Y of the target text. Note that causal attention is disabled — vision and query tokens attend to each other freely.

Y-Encoder (Text Target): EmbeddingGemma-300M. Maps the textual target into a continuous embedding S_Y in the same latent space as the predictor’s output. Not involved at inference time — only used during training to generate the target embeddings the predictor learns to match.

Y-Decoder (Text Output): A lightweight module that translates predicted embeddings back into human-readable text at inference time, and only when text output is needed. It is not involved in training at all.

Training Objective: Instead of cross-entropy loss over a token vocabulary, VL-JEPA minimizes bi-directional InfoNCE loss in embedding space — align the predicted embedding to the target embedding, while pushing non-matching embeddings apart to prevent collapse. The predictor and Y-Encoder are trained jointly.

The critical insight: in token space, “the lamp turned off” and “the room went dark” are different targets. In embedding space, the Y-Encoder maps them to nearly the same point. The prediction problem becomes well-posed. The model doesn’t have to learn an infinite set of correct phrasings — just one region of meaning space.


Multi-Task Capability

One architecture, three model families collapsed into one:

  • Text generation (captioning, VQA): Predictor outputs embeddings; Y-Decoder translates to text on demand.
  • Open-vocabulary classification: Candidate labels are encoded by Y-Encoder; predicted embedding is matched to nearest label.
  • Video and text retrieval: Videos or texts are encoded; results ranked by similarity to the query embedding.

Previously these required separate model families (CLIP for classification and retrieval, generative VLMs for text). VL-JEPA handles all three in a single forward pass.


Two-Stage Training

Stage 1: Large-Scale Pretraining

Query-free vision-language alignment on massive caption data (PLM-Image-Auto, Datacomp, YFCC-100M for images; PLM-Video-Auto, Ego4D, Action100M/HowTo100M for video). Image-only training first (100K iterations), then joint image-video training at 16 frames per input. Trained for 2 weeks on 24 nodes × 8x NVIDIA H200 GPUs. Result: VL-JEPA_BASE.

Stage 2: Supervised Finetuning

25M VQA samples, 2.8M captioning samples, 1.8M classification samples plus downsampled pretraining data. 35K steps at batch size 6K (~2 days on 24 nodes). Result: VL-JEPA_SFT.

Total model size: 1.6B parameters.


Key Results

Zero-Shot Classification & Retrieval (Table 1)

ModelParamsAvg Classification (8 datasets)Avg Retrieval R@1 (8 datasets)
CLIP (ViT-L)389M25.329.3
SigLIP2 (ViT-L)375M38.745.4
PE-Core (ViT-L)448M37.344.9
VL-JEPA_BASE1.6B46.458.4
VL-JEPA_SFT1.6B70.759.5

WorldPrediction-WM (Table 3)

This is the benchmark that matters most for where this is going.

ModelScore
GPT-4o52.0
Claude-3.553.3
Gemini-255.6
VL-JEPA_BASE63.9
VL-JEPA_SFT65.7

A 1.6B model surpassing frontier LLMs on understanding physical world-state transitions. Orders of magnitude smaller. The advantage comes not from scale but from the right objective.

Embedding vs. Token Prediction — Controlled Comparison

Matched conditions: same vision encoder, same data, same batch size, same compute.

  • VL-JEPA reaches 14.7 CIDEr after 5M samples with 0.5B trainable parameters
  • Token-predicting VLM reaches 7.1 CIDEr after 15M samples with 1B trainable parameters

VL-JEPA learns twice as fast with half the parameters. The problem is genuinely easier in embedding space.

Selective Decoding

On long-form video streams, selective decoding at 0.35 Hz matches uniform decoding at 1 Hz — a 2.85x reduction in decoding operations with no quality loss. The model monitors its own embedding stream and only fires the text decoder when the meaning changes. A static scene requires no output. This is the behavior you need for a physical agent operating in real time.


Limitations

  1. Not a full VLM replacement. Tasks requiring multi-step reasoning, tool use, and complex instruction following — where autoregressive generation excels — are not evaluated here. The architecture doesn’t claim to replace language models for language tasks.
  2. Scaling unexplored. Results are promising, but the scaling behavior with larger predictors and more data is an open question.
  3. Y-Encoder sensitivity. The choice of text encoder has significant impact on results. EmbeddingGemma-300M works well; visually-aligned encoders show advantages in classification and retrieval tasks.
  4. Appearance-centric benchmarks. VL-JEPA_BASE is relatively weaker on appearance-centric video tasks compared to motion-centric ones, likely due to fewer vision-language training pairs than specialist models.

Key Takeaways

  1. Token prediction is not the only path, and may not be the right one for physical AI. The entire transformer scaling story — Attention, Chinchilla, DeepSeek — optimizes a model that predicts sequences of symbols. VL-JEPA demonstrates that predicting in continuous representation space is faster to train, more efficient at inference, and produces better world-state understanding. The objective matters as much as the architecture.

  2. Predicting embeddings beats predicting tokens. Under controlled conditions, learning in embedding space produces stronger performance with 50% fewer trainable parameters, higher sample efficiency, and a sharper learning curve. Collapsing semantically equivalent but surface-different targets into nearby points in embedding space makes the problem easier. That’s not a trick — it’s the model learning meaning instead of form.

  3. Selective decoding is what real-time physical AI requires. A continuous semantic embedding stream that only triggers text generation when meaning changes isn’t just an efficiency trick. It’s the right interface for a physical agent. Always-on perception, variable-latency response, no wasted compute on static scenes. Autoregressive models cannot do this.

  4. World-state prediction is where this architecture shines. Beating GPT-4o, Claude-3.5, and Gemini-2 on WorldPrediction-WM with a 1.6B model is not a benchmark artifact. It’s evidence that the JEPA objective — predict the meaning of what happens next, not the words that describe it — is better aligned with physical reasoning than token prediction.

  5. This is the on-ramp to embodied AI. The papers on this site trace a path from the architecture that made large models possible, through the training insights that made them practical, through the efficiency engineering that made them accessible. VL-JEPA is where that path arrives at a different kind of question: not “how do we build better language models?” but “how do we build AI that can act in the world?” The answer starts here.


Summary prepared by Darrell Thomas, 2026