JEPA and Self-Supervised Vision
JEPA (Joint Embedding Predictive Architecture) is Yann LeCun’s alternative to autoregressive vision: predict in embedding space, not pixel space. The early results are interesting.
The JEPA idea
Take a piece of an image. Mask part of it. Train the model to predict the embedding of the masked region (not the pixels). The model learns abstract representations rather than reconstructing exact textures.
Vs generative vision
Generative models (diffusion, autoregressive) predict pixels. They waste capacity on noise and texture detail that doesn’t matter for understanding. JEPA argues this is why generative pretraining underperforms on downstream classification and reasoning tasks.
Where this stands in 2026
I-JEPA, V-JEPA, and follow-ups have shown strong representation quality on vision benchmarks. Adoption outside Meta’s research is limited. The field is split between generative pretraining (still dominant) and predictive pretraining (gaining adherents). Resolution likely arrives in 2027.