Deficient executive control in transformer attention
Summary
The article discusses a deficiency in executive control within transformer attention mechanisms, highlighting limitations in how transformers manage sequential dependencies.
Similar Articles
Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers
This paper shows that attention heads meeting common criteria for mechanistic role claims (necessity, linear decodability, ablation reversibility) routinely fail to transfer computations across prompts, and introduces the KID (Knowing/Intent/Doing) framework and a three-stage pipeline for more rigorous role assignment.
Your transformer's attention entropy collapse isn't a bug. It's the model doing exactly what you trained it to do. Here's how to fix it with a three-line temperature schedule. arXiv-able. Self-contained proof. No citations needed.
The article explains that attention entropy collapse in deep transformer layers is a geometric consequence of training, not a bug, and proposes a three-line temperature schedule to prevent it.
One of the authors of "Attention is All You Need" just argued we should move past it. Pathway’s Post-Transformer debate is worth watching
A co-author of the seminal 'Attention is All You Need' paper has argued that the field should move beyond transformers, and a debate hosted by Pathway explores this topic.
Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management
This position paper clarifies that claims of Transformer Turing-completeness often rely on unrealistic scaling assumptions, and argues that in real-world fixed models, context management is the critical factor determining computational power.
Temporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory
This paper proposes a meta-control architecture using temporal self-attention for adaptive control of Euler-Lagrange systems with unobservable memory states. It demonstrates improved tracking performance over baseline methods on a 2-DOF manipulator while identifying failure modes in long-memory regimes.