@waterloo_intern: After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimi…
Summary
This tweet discusses the convergence of ML research on attention-based, matmul-optimized algorithms due to hardware constraints, drawing on the 'hardware lottery' concept and noting OpenAI's 9-month chip tape-out as a potential sign of hardware-research co-design.
View Cached Full Text
Cached at: 06/29/26, 10:32 PM
After reading up a bit on ML research post transformer era, I was upset that it seems to have converged on hyper-optimizing matmul-based algorithms: (MHA, MQA, MLA, SWA, DSA, GQA, SWA-GQA, ABCDA [only one of these is made up]).
Surely, an algorithm that is not Attention based is sitting there waiting to be discovered.
the researchers are just being lazy
but this is a stupid conclusion.
How can you blame researchers, when the hardware they train on is optimized for matmuls (tensor cores / systolic arrays). Any algorithm not a matmul is literally bound to die, even if it’s twice as good as attention. Add compute constraints, you have to be crazy to research any direction not attention based (basically @sarahookr ’s hardware lottery essay)
We talk about hardware-software co-design in inference, but it seems that, to get to the next leap in research, we’ll need hardware-research co-design. At first, it seems this will never happen, given typical multi-year hardware tape-out constraints.
But then you look at @OpenAI. 9 month tape-out. Better “training” and serving .
Why fab your own chip if it’s just going to be systolic-array based? Why not just buy Nvidia?
“But Nvidia GPUs are scarce”
Then buy TPUs/AMD/Qualcom/Cerebras. Sure the software is not that good, but if you’re OAi, you can hire an army of engineers to unlock the full capability.
Either they moved away from attention and have a new algorithm they needed their own chips to train it on (unlikely given that a 9-month tape-out with a TPU vendor implies reusing IP)…or research is dead and we’re never escaping attention / matmul based algo.
Similar Articles
@Phoenixyin13: I think this is a top-notch work in ICML 2026. The attention mechanism of traditional Transformers is essentially point-to-point matching: it cuts input into a bunch of tokens (discrete points), computes similarity between Query and Key, and then weights the Value. In NLP...
Introduces the ICML 2026 paper Functional Attention, which treats functions as first-class citizens and replaces softmax point-to-point similarity with structured linear operators. It addresses issues of discretization, resolution sensitivity, and high computational complexity in traditional Transformers when handling continuous functions. Achieves or surpasses SOTA in tasks like PDE solving and 3D segmentation, and exhibits strong OOD generalization.
@yoonholeee: https://x.com/yoonholeee/status/2064027464926716154
The author argues that text optimization (prompts, context, memory) is a legitimate and sample-efficient learning mechanism that should be taken more seriously by the ML community, enabling a new scaling axis of update-time compute.
@cHHillee: In modern ML accelerators, FLOPS have absolutely exploded. Often though, the bottleneck is not FLOPS but memory bandwid…
Thinky identifies human-to-AI bandwidth as a growing bottleneck akin to memory bandwidth issues in ML accelerators, proposing solutions to address this limitation.
The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention
The article discusses how rising DDR5 memory prices signal a broader memory bottleneck in AI, particularly the KV cache in softmax attention for LLMs, and highlights post-transformer architectures like linear attention and state space models that aim to reduce memory usage.
@SemiAnalysis_: Transformer’s Attention mechanism has come a long way. We’d like to thank the researchers and the engineers in the open…
A Twitter thread by SemiAnalysis celebrates the progress of Transformer's Attention mechanism and thanks the open-source community for making AI accessible, inviting contributions to complete the open history of Attention.