GitHub - keon/jepa: implementing minimal versions of joint-embedding predictive architecture (JEPA)

Reddit r/ArtificialInteligence Tools

Summary

A GitHub repository providing minimal, standalone PyTorch reimplementations of JEPA family models (I-JEPA, V-JEPA, V-JEPA 2, C-JEPA) for educational purposes, including tutorials and visualization tools.

I created minimal implementation versions of the JEPA family, less than 200 lines of code for my own understanding and education. It definitely helps with understanding to distill the paper down to just the essence of the algorithm and running it with a toy dataset. It is so small that you can just run it on your mac. I added tutorials along with the implementations. Let me know what you guys think!
Original Article
View Cached Full Text

Cached at: 05/13/26, 12:36 AM

keon/jepa

Source: https://github.com/keon/jepa

jepa

Minimal, single-file PyTorch reimplementations of the JEPA family, with paired tutorials.

FileMethodDatasetLOCTutorial
ijepa.pyI-JEPACIFAR-10160ijepa_tutorial.md
vjepa.pyV-JEPAMoving MNIST188vjepa_tutorial.md
vjepa2.pyV-JEPA 2 + V-JEPA 2-ACsynthetic moving digits278vjepa2_tutorial.md
cjepa.pyC-JEPA3-digit bouncing video174cjepa_tutorial.md

Each algorithm file is standalone — only depends on torch and torchvision, no shared utilities. The matching <algo>_extras.py adds visualization (mask grids, loss curves, PCA/LDA/t-SNE evolution, linear probe).

Quick start

git clone [email protected]:keon/jepa.git
cd jepa
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt     # pinned versions, see below

python ijepa.py                     # train I-JEPA only (no plots)
python ijepa_extras.py              # train + write all visualizations + linear probe

Runs on CUDA, MPS, or CPU. CIFAR-10 / MNIST datasets auto-download to ./data/.

Reproducibility

The repo pins exact versions in requirements.txt and pyproject.toml:

python >= 3.10  (tested on 3.13.5)
torch == 2.11.0
torchvision == 0.26.0
matplotlib == 3.10.9
scikit-learn == 1.8.0   # used by ijepa_extras for t-SNE
numpy == 2.4.4
pillow == 12.2.0

Install as a package instead of installing requirements directly:

pip install -e .

What’s where

.
├── ijepa.py / ijepa_extras.py           # I-JEPA on CIFAR-10
├── vjepa.py / vjepa_extras.py           # V-JEPA on Moving MNIST
├── vjepa2.py / vjepa2_extras.py         # V-JEPA 2 + V-JEPA 2-AC (synthetic)
├── cjepa.py / cjepa_extras.py           # C-JEPA on 3-digit bouncing video
├── ijepa_tutorial.md                    # walk-throughs that match the code
├── vjepa_tutorial.md
├── vjepa2_tutorial.md
├── cjepa_tutorial.md
├── papers/                              # the four source PDFs
├── samples/                             # mask grids, loss curves, PCA/LDA/t-SNE plots
└── figs/                                # paper figures referenced by tutorials

The methods, in one paragraph each

I-JEPA (Assran et al. 2023) — predict embeddings of held-out image patches from embeddings of visible patches. EMA target encoder, multi-block masking, smooth-L1 loss. The canonical self-supervised JEPA.

V-JEPA (Bardes et al. 2024) — same recipe, but 3D tubelet patches over video. Two mask groups (short-range + long-range tubes), L1 loss, EMA 0.998 → 1.0.

V-JEPA 2 (Assran et al. 2025) — two-phase: V-JEPA pretraining followed by V-JEPA 2-AC, an action-conditioned predictor trained on frozen-encoder latents with teacher forcing + rollout. The encoder is frozen in phase 2; no EMA.

C-JEPA (Nam et al. 2026) — object-level trajectory masking with an identity anchor at t=0. No EMA. Bidirectional transformer over flattened slot tokens. Built on top of a pretrained object-centric encoder (VideoSAUR in the paper; we use a frozen embedding lookup as a documented stand-in).

Caveats

These are educational reimplementations:

  • ViT-tiny, not ViT-Huge. CIFAR-10 / Moving MNIST / synthetic videos, not ImageNet / Kinetics.
  • I-JEPA hits ~52.7% linear probe on CIFAR-10 after 100 epochs. The paper’s numbers come from ViT-H/14 on ImageNet for 300 epochs — different planet of compute.
  • C-JEPA skips slot discovery (uses oracle positions). Real C-JEPA requires VideoSAUR pretraining (~100k steps) on top of frozen DINOv2 features.
  • V-JEPA 2-AC’s action-conditioning gap stays small in our toy because the data is too easy; the machinery is correct but the signal needs richer data to show up.

Each tutorial discloses the specific deviations from its source paper.

License

MIT.

Similar Articles

DVD-JEPA: an open-source, fully-reproducible JEPA world model [P]

Reddit r/MachineLearning

DVD-JEPA is an open-source, minimal JEPA world model that learns representations from video by predicting future embeddings rather than pixels. It uses a bouncing DVD logo to demonstrate position recovery, dreaming, and anomaly detection, all running in a browser.

The 90-year-old idea behind JEPA models: Canonical Correlation Analysis

Hacker News Top

This blog post explains the connection between JEPA (Joint Embedding Predictive Architecture) models and Canonical Correlation Analysis (CCA), a statistical method from 1936, arguing that CCA is the conceptual precursor to JEPA and that the idea of maximizing correlation in embedding space dates back to Hotelling.