@SemiAnalysis_: Transformer’s Attention mechanism has come a long way. We’d like to thank the researchers and the engineers in the open…

X AI KOLs Timeline News

Summary

A Twitter thread by SemiAnalysis celebrates the progress of Transformer's Attention mechanism and thanks the open-source community for making AI accessible, inviting contributions to complete the open history of Attention.

Transformer’s Attention mechanism has come a long way. We’d like to thank the researchers and the engineers in the open-source community for continuing to make high-performance AI accessible. Please celebrate with us by sharing this post, tagging more contributors, and sharing anecdotes to complete the open history of Attention! (1/8)
Original Article
View Cached Full Text

Cached at: 06/29/26, 02:22 AM

Transformer’s Attention mechanism has come a long way. We’d like to thank the researchers and the engineers in the open-source community for continuing to make high-performance AI accessible. Please celebrate with us by sharing this post, tagging more contributors, and sharing anecdotes to complete the open history of Attention! (1/8)

In contrast to the slow decline of the Transformers movie series in 2017, the Transformer architecture in NLP showed immense potential. It introduced Multi-Head Attention (MHA) and dramatically improved perplexity scores. We thank @ashVaswani, @NoamShazeer, @YesThisIsLion, and @aidangomez for publishing this seminal work. (2/8)

The early variants of MHA include Multi-Query Attention (MQA), invented by Noam Shazeer, Grouped-Query Attention (GQA), invented by the @MetaAI LLaMA team, and Sliding Window Attention (SWA), popularized by @MistralAI. MQA, GQA, and SWA build on MHA and significantly improve inference performance, with GQA serving as the foundation for many recent sparse attention variants in the agentic AI era. We thank the teams for openly sharing these techniques. (3/8)

One of the greatest leaps since MHA was FlashAttention by @tri_dao. FlashAttention dramatically reduced memory requirements for both the forward and backward passes of attention, unlocking major performance gains and enabling efficient training on long contexts. Since the original release, three new versions have followed, each optimized for the latest GPUs at the time. We thank Tri Dao, @tensorcore, Jay Shah, @tedzadouri, and many others for pushing the frontier of system efficiency in the open-source community. (4/8)

Innovation in attention mechanisms did not stop, even though MHA/GQA/SWA remain hard to beat. In 2024, DeepSeek-V3/R1 demonstrated near-frontier capabilities, proving the effectiveness of their in-house Multi-Head Latent Attention (MLA). At a time when OpenAI’s o1 appeared far ahead, DeepSeek-R1 reignited hope that the open-source community could close the gap. MLA has since become the de facto attention mechanism for many open-weight models. We thank @deepseek_ai for their outstanding work and commitment to the open ethos. (5/8)

The long-context demands of agentic AI accelerated attention research aimed at overcoming the context wall. Over the past year, linear attention has become mainstream, most notably with Gated Delta Networks (GDNs) by @songlinyang4 gaining strong traction among open-weight models. Following the invention of GDN, @Alibaba_Qwen adopted it in Qwen 3.5, while @kimi_moonshot improved upon it with Kimi Delta Attention. On the sparse attention side, @deepseek_ai again led open research with Native Sparse Attention and DeepSeek Sparse Attention (DSA). Inspired by DSA, @MiniMax_AI developed MiniMax Sparse Attention, and @ZhipuAI advanced it further with IndexShare in GLM-5.2. Finally, the SWA-GQA hybrid attention approach was popularized by @cohere and recently refined by the @xiaomimimo team, which released detailed ablation studies. We thank all the open model training labs for sharing their knowledge, research, and excellent open-weight models. (6/8)

As ChatGPT exploded in popularity, research on LLM serving became highly active. Efficient LLM serving remained a major challenge until the invention of KV cache-managing Attention methods, such as Radix Attention. The Radix Attention authors built an inference engine around it, which became @sgl_project; the team later founded Radix Ark. Thank you @ying11231, @zhyncs42, and @banghuaz for building such a fast and capable inference engine. (7/8)

Around the same time, the vLLM inference engine and its underlying Paged Attention took the open-source community by storm. Started by @woosuk_k, the @vllm_project has become one of the most widely used inference engines. @simon_mo_, @kaichaoyou and @rogerw0108 from Inferact, along with @robertshaw21 and @mgoin_ from Red Hat, have been key maintainers who continue to push the project and community forward. We are deeply grateful to the Inferact and Red Hat teams. (8/8)

Similar Articles

@ZhihuFrontier: Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse ro…

X AI KOLs Timeline

A Zhihu contributor's half-year-old prediction that the next Transformer would absorb loops, recurrent state, sparse routing, and latent reasoning is gaining relevance as Loop Engineering advances. The article explores how future Transformer architectures may evolve into hybrid models blending linear-complexity layers for background context with attention for precise reasoning, plus finer-grained sparsity and native System 2 reasoning.