@AlexiGlad: Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation…

X AI KOLs Following Papers

Summary

Introduces Temporal Difference in Vision (TDV), a new paradigm for representation learning that relies solely on causality, eliminating the need for augmentations, masking, or cropping, and matches state-of-the-art methods like DINO and iBOT on dense spatial tasks.

Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc... until now! Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality TL;DR: - We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction - TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks - We show that as data scales, weaker assumptions work better Thread:
Original Article
View Cached Full Text

Cached at: 06/16/26, 09:40 PM

Progress in AI is driven by approaches that make weaker assumptions, which allows for better scaling

But representation learning has relied on strong assumptions like augmentations, masking, cropping, etc… until now!

Introducing Temporal Difference in Vision (TDV), a new paradigm for representation learning built on a single assumption: causality

TL;DR:

  • We introduce TDV, the first approach to learn good representations without any augmentations, masking, cropping, or pixel-based reconstruction
  • TDV matches SOTA recipes like DINO and iBOT on dense spatial tasks
  • We show that as data scales, weaker assumptions work better

Thread:

[1/4] Why move away from assumptions?

Today’s self-supervised methods lean heavily on strong assumptions such as augmentations, masking, cropping, etc…

But as compute and data scale, the methods that assume the least have historically won out

We test this directly: high masking helps when data is scarce, but as data grows, lighter masking (weaker assumptions) win!

[2/4] So what assumptions should we make, that aren’t restrictive but still allow for learning?

Our answer: Causality! The simple idea that the future is predictable from the past

Unlike a vision-specific assumption like “augmented views should look the same,” causality holds across temporal data

[3/4] This points us to learn representations from video, rather than from still images, as video has a temporal dimension!

Using causality, we develop a simple objective: the current frame’s representation, plus the encoded motion, should equal the next frame’s representation.

By analogy to Temporal Difference in RL, we call it Temporal Difference in Vision (TDV).

[4/4] This is continued in Ninad’s thread! https://x.com/ninaddaithankar/status/2066898901106397304?s=20…

Huge thanks to all collaborators @ninaddaithankar @ylecun @hengjinlp

More information is on the website: https://temporal-difference-vision.github.io https://huggingface.co/papers/2606.15956…

P.S. This project was insanely difficult implementation-wise—pushing on an entirely new paradigm for representation learning is not easy!

We therefore see TDV as laying the foundation for future representation learning approaches that aren’t reliant on strong assumptions

Huge shoutout to @ninaddaithankar for being able to push through these challenges

thanks asher :)

thanks travis :)

thanks :)

Similar Articles

Teaching AI to see the world more like we do

Google DeepMind Blog

Google DeepMind published a paper in Nature detailing a method to align AI visual representations with human cognitive structures, improving model robustness and reliability.

D4RT: Teaching AI to see the world in four dimensions

Google DeepMind Blog

DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.