@waterloo_intern: After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimi…

X AI KOLs Timeline News

Summary

This tweet discusses the convergence of ML research on attention-based, matmul-optimized algorithms due to hardware constraints, drawing on the 'hardware lottery' concept and noting OpenAI's 9-month chip tape-out as a potential sign of hardware-research co-design.

After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimizing matmul-based algorithms: (MHA, MQA, MLA, SWA, DSA, GQA, SWA-GQA, ABCDA [only one of these is made up]). Surely, an algorithm that is not Attention based is sitting there waiting to be discovered. > the researchers are just being lazy but this is a stupid conclusion. How can you blame researchers, when the hardware they train on is optimized for matmuls (tensor cores / systolic arrays). Any algorithm not a matmul is literally bound to die, even if it's twice as good as attention. Add compute constraints, you have to be crazy to research any direction not attention based (basically @sarahookr 's hardware lottery essay) We talk about hardware-software co-design in inference, but it seems that, to get to the next leap in research, we'll need hardware-research co-design. At first, it seems this will never happen, given typical multi-year hardware tape-out constraints. But then you look at @OpenAI. 9 month tape-out. Better "training" and serving . Why fab your own chip if it's just going to be systolic-array based? Why not just buy Nvidia? > "But Nvidia GPUs are scarce" Then buy TPUs/AMD/Qualcom/Cerebras. Sure the software is not that good, but if you're OAi, you can hire an army of engineers to unlock the full capability. Either they moved away from attention and have a new algorithm they needed their own chips to train it on (unlikely given that a 9-month tape-out with a TPU vendor implies reusing IP)...or research is dead and we're never escaping attention / matmul based algo.
Original Article
View Cached Full Text

Cached at: 06/29/26, 10:32 PM

After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimizing matmul-based algorithms: (MHA, MQA, MLA, SWA, DSA, GQA, SWA-GQA, ABCDA [only one of these is made up]).

Surely, an algorithm that is not Attention based is sitting there waiting to be discovered.

the researchers are just being lazy

but this is a stupid conclusion.

How can you blame researchers, when the hardware they train on is optimized for matmuls (tensor cores / systolic arrays). Any algorithm not a matmul is literally bound to die, even if it’s twice as good as attention. Add compute constraints, you have to be crazy to research any direction not attention based (basically @sarahookr ’s hardware lottery essay)

We talk about hardware-software co-design in inference, but it seems that, to get to the next leap in research, we’ll need hardware-research co-design. At first, it seems this will never happen, given typical multi-year hardware tape-out constraints.

But then you look at @OpenAI. 9 month tape-out. Better “training” and serving .

Why fab your own chip if it’s just going to be systolic-array based? Why not just buy Nvidia?

“But Nvidia GPUs are scarce”

Then buy TPUs/AMD/Qualcom/Cerebras. Sure the software is not that good, but if you’re OAi, you can hire an army of engineers to unlock the full capability.

Either they moved away from attention and have a new algorithm they needed their own chips to train it on (unlikely given that a 9-month tape-out with a TPU vendor implies reusing IP)…or research is dead and we’re never escaping attention / matmul based algo.

Similar Articles

@Phoenixyin13: I think this is a top-notch work in ICML 2026. The attention mechanism of traditional Transformers is essentially point-to-point matching: it cuts input into a bunch of tokens (discrete points), computes similarity between Query and Key, and then weights the Value. In NLP...

X AI KOLs Timeline

Introduces the ICML 2026 paper Functional Attention, which treats functions as first-class citizens and replaces softmax point-to-point similarity with structured linear operators. It addresses issues of discretization, resolution sensitivity, and high computational complexity in traditional Transformers when handling continuous functions. Achieves or surpasses SOTA in tasks like PDE solving and 3D segmentation, and exhibits strong OOD generalization.