@waterloo_intern: After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimi…

X AI KOLs Timeline 06/29/26, 03:35 AM News

ml-research transformers attention-mechanism hardware-software-co-design matmul openai custom-chip research-direction

Summary

This tweet discusses the convergence of ML research on attention-based, matmul-optimized algorithms due to hardware constraints, drawing on the 'hardware lottery' concept and noting OpenAI's 9-month chip tape-out as a potential sign of hardware-research co-design.

After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimizing matmul-based algorithms: (MHA, MQA, MLA, SWA, DSA, GQA, SWA-GQA, ABCDA [only one of these is made up]). Surely, an algorithm that is not Attention based is sitting there waiting to be discovered. > the researchers are just being lazy but this is a stupid conclusion. How can you blame researchers, when the hardware they train on is optimized for matmuls (tensor cores / systolic arrays). Any algorithm not a matmul is literally bound to die, even if it's twice as good as attention. Add compute constraints, you have to be crazy to research any direction not attention based (basically @sarahookr 's hardware lottery essay) We talk about hardware-software co-design in inference, but it seems that, to get to the next leap in research, we'll need hardware-research co-design. At first, it seems this will never happen, given typical multi-year hardware tape-out constraints. But then you look at @OpenAI. 9 month tape-out. Better "training" and serving . Why fab your own chip if it's just going to be systolic-array based? Why not just buy Nvidia? > "But Nvidia GPUs are scarce" Then buy TPUs/AMD/Qualcom/Cerebras. Sure the software is not that good, but if you're OAi, you can hire an army of engineers to unlock the full capability. Either they moved away from attention and have a new algorithm they needed their own chips to train it on (unlikely given that a 9-month tape-out with a TPU vendor implies reusing IP)...or research is dead and we're never escaping attention / matmul based algo.

Original Article

View Cached Full Text

Cached at: 06/29/26, 10:32 PM

Surely, an algorithm that is not Attention based is sitting there waiting to be discovered.

the researchers are just being lazy

but this is a stupid conclusion.

How can you blame researchers, when the hardware they train on is optimized for matmuls (tensor cores / systolic arrays). Any algorithm not a matmul is literally bound to die, even if it’s twice as good as attention. Add compute constraints, you have to be crazy to research any direction not attention based (basically @sarahookr ’s hardware lottery essay)

We talk about hardware-software co-design in inference, but it seems that, to get to the next leap in research, we’ll need hardware-research co-design. At first, it seems this will never happen, given typical multi-year hardware tape-out constraints.

But then you look at @OpenAI. 9 month tape-out. Better “training” and serving .

Why fab your own chip if it’s just going to be systolic-array based? Why not just buy Nvidia?

“But Nvidia GPUs are scarce”

Then buy TPUs/AMD/Qualcom/Cerebras. Sure the software is not that good, but if you’re OAi, you can hire an army of engineers to unlock the full capability.

Either they moved away from attention and have a new algorithm they needed their own chips to train it on (unlikely given that a 9-month tape-out with a TPU vendor implies reusing IP)…or research is dead and we’re never escaping attention / matmul based algo.

@waterloo_intern: After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimi…

Similar Articles

@Phoenixyin13: I think this is a top-notch work in ICML 2026. The attention mechanism of traditional Transformers is essentially point-to-point matching: it cuts input into a bunch of tokens (discrete points), computes similarity between Query and Key, and then weights the Value. In NLP...

@yoonholeee: https://x.com/yoonholeee/status/2064027464926716154

@cHHillee: In modern ML accelerators, FLOPS have absolutely exploded. Often though, the bottleneck is not FLOPS but memory bandwid…

The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention

@SemiAnalysis_: Transformer’s Attention mechanism has come a long way. We’d like to thank the researchers and the engineers in the open…

Submit Feedback

Similar Articles

@Phoenixyin13: I think this is a top-notch work in ICML 2026. The attention mechanism of traditional Transformers is essentially point-to-point matching: it cuts input into a bunch of tokens (discrete points), computes similarity between Query and Key, and then weights the Value. In NLP...

@yoonholeee: https://x.com/yoonholeee/status/2064027464926716154

@cHHillee: In modern ML accelerators, FLOPS have absolutely exploded. Often though, the bottleneck is not FLOPS but memory bandwid…

The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention

@SemiAnalysis_: Transformer’s Attention mechanism has come a long way. We’d like to thank the researchers and the engineers in the open…