@AlexiGlad: Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation…

X AI KOLs Following 06/16/26, 04:41 PM Papers

representation-learning self-supervised-learning temporal-difference vision causality scaling

Summary

Introduces Temporal Difference in Vision (TDV), a new paradigm for representation learning that relies solely on causality, eliminating the need for augmentations, masking, or cropping, and matches state-of-the-art methods like DINO and iBOT on dense spatial tasks.

Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc... until now! Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality TL;DR: - We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction - TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks - We show that as data scales, weaker assumptions work better Thread:

Original Article

View Cached Full Text

Cached at: 06/16/26, 09:40 PM

Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling

But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc… until now!

Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality

TL;DR:

We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction
TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks
We show that as data scales, weaker assumptions work better

Thread:

[1/4] Why move away from assumptions?

Today’s self-supervised methods lean heavily on strong assumptions such as augmentations, masking, cropping, etc…

But as compute and data scale, the methods that assume the least have historically won out

We test this directly: high masking helps when data is scarce, but as data grows, lighter masking (weaker assumptions) win!

[2/4] So what assumptions should we make, that aren’t restrictive but still allow for learning?

Our answer: Causality! The simple idea that the future is predictable from the past

Unlike a vision-specific assumption like “augmented views should look the same,” causality holds across temporal data

[3/4] This points us to learn representations from video, rather than from still images, as video has a temporal dimension!

Using causality, we develop a simple objective: the current frame’s representation, plus the encoded motion, should equal the next frame’s representation.

By analogy to Temporal Difference in RL, we call it Temporal Difference in Vision (TDV).

[4/4] This is continued in Ninad’s thread! https://x.com/ninaddaithankar/status/2066898901106397304?s=20…

Huge thanks to all collaborators @ninaddaithankar @ylecun @hengjinlp

More information is on the website: https://temporal-difference-vision.github.io https://huggingface.co/papers/2606.15956…

P.S. This project was insanely difficult implementation-wise—pushing on an entirely new paradigm for representation learning is not easy!

We therefore see TDV as laying the foundation for future representation learning approaches that aren’t reliant on strong assumptions

Huge shoutout to @ninaddaithankar for being able to push through these challenges

thanks asher :)

thanks travis :)

thanks :)

@AlexiGlad: Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation…

Similar Articles

@ninaddaithankar: Can a vision model learn to see with no augmentations, no masking, no cropping, no reconstruction? It can! Introducing …

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

Teaching AI to see the world more like we do

D4RT: Teaching AI to see the world in four dimensions

@alesfav: AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your ow…

Submit Feedback

Similar Articles

@ninaddaithankar: Can a vision model learn to see with no augmentations, no masking, no cropping, no reconstruction? It can! Introducing …

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

Teaching AI to see the world more like we do

D4RT: Teaching AI to see the world in four dimensions

@alesfav: AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your ow…