@AlexiGlad: Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation…
Summary
Introduces Temporal Difference in Vision (TDV), a new paradigm for representation learning that relies solely on causality, eliminating the need for augmentations, masking, or cropping, and matches state-of-the-art methods like DINO and iBOT on dense spatial tasks.
View Cached Full Text
Cached at: 06/16/26, 09:40 PM
Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling
But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc… until now!
Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality
TL;DR:
- We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction
- TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks
- We show that as data scales, weaker assumptions work better
Thread:
[1/4] Why move away from assumptions?
Today’s self-supervised methods lean heavily on strong assumptions such as augmentations, masking, cropping, etc…
But as compute and data scale, the methods that assume the least have historically won out
We test this directly: high masking helps when data is scarce, but as data grows, lighter masking (weaker assumptions) win!
[2/4] So what assumptions should we make, that aren’t restrictive but still allow for learning?
Our answer: Causality! The simple idea that the future is predictable from the past
Unlike a vision-specific assumption like “augmented views should look the same,” causality holds across temporal data
[3/4] This points us to learn representations from video, rather than from still images, as video has a temporal dimension!
Using causality, we develop a simple objective: the current frame’s representation, plus the encoded motion, should equal the next frame’s representation.
By analogy to Temporal Difference in RL, we call it Temporal Difference in Vision (TDV).
[4/4] This is continued in Ninad’s thread! https://x.com/ninaddaithankar/status/2066898901106397304?s=20…
Huge thanks to all collaborators @ninaddaithankar @ylecun @hengjinlp
More information is on the website: https://temporal-difference-vision.github.io https://huggingface.co/papers/2606.15956…
P.S. This project was insanely difficult implementation-wise—pushing on an entirely new paradigm for representation learning is not easy!
We therefore see TDV as laying the foundation for future representation learning approaches that aren’t reliant on strong assumptions
Huge shoutout to @ninaddaithankar for being able to push through these challenges
thanks asher :)
thanks travis :)
thanks :)
Similar Articles
@ninaddaithankar: Can a vision model learn to see with no augmentations, no masking, no cropping, no reconstruction? It can! Introducing …
Introduces Temporal Difference in Vision (TDV), a novel visual representation learning paradigm that learns useful representations without augmentations, masking, cropping, or reconstruction, and matches state-of-the-art methods on dense spatial tasks.
You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences
The paper introduces Temporal Difference in Vision (TDV), a self-supervised learning method for video that relies only on a causal assumption that past causes future, avoiding strong inductive biases while matching state-of-the-art on dense spatial tasks.
Teaching AI to see the world more like we do
Google DeepMind published a paper in Nature detailing a method to align AI visual representations with human cognitive structures, improving model robustness and reliability.
D4RT: Teaching AI to see the world in four dimensions
DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.
@alesfav: AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your ow…
This thread presents a theoretical result showing that predicting abstract latent representations (as in JEPA and data2vec) instead of raw tokens can exponentially reduce the data gap between AI and human learning.