Tag
This paper analyzes hallucination in large language models as a structural consequence of three architectural decisions: self-attention's co-occurrence learning, maximum likelihood estimation training objective, and autoregressive decoding's left-to-right commitment. It maps each mechanism to specific hallucination types and argues that dataset pathologies amplify but do not cause these vulnerabilities.
Introduces Kuramoto attention, a self-attention layer where hidden states are phase angles on a torus, enabling synchronization through gated cosine similarity and circular mean updates. The layer performs comparably to standard transformers on character-level language modeling.
Introduces Contribution Weights, a projection-based metric that accounts for attention weight, value magnitude, and directional alignment to more faithfully measure token importance in transformer LLMs, revealing active functional roles of attention sinks.
This paper presents a mathematical framework for Transformer dynamics as a nonlinear control system on probability measures, proving that Gaussian distributions remain Gaussian under the flow, reducing to finite-dimensional bilinear control, and establishing reachability conditions and asymptotic stability results.
This article provides a visual guide to the Transformer architecture in Large Language Models, covering self-attention, causal self-attention, masked multi-head attention, and the output layer with step-by-step explanations and examples.
StateKV is an inference-time method that enables linear-time video prefill for long-video vision-language models by carrying cross-frame context in a fixed-capacity recurrent state, maintaining accuracy close to full self-attention without fine-tuning.
This paper studies retrieval-augmented generation as an in-context optimization process, showing that linear self-attention can implement gradient descent on a unified RAG objective. It proposes a lightweight method for frozen RAG LLMs that predicts context-conditioned updates, improving performance across multiple QA benchmarks.
This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.
This paper proposes a meta-control architecture using temporal self-attention for adaptive control of Euler-Lagrange systems with unobservable memory states. It demonstrates improved tracking performance over baseline methods on a 2-DOF manipulator while identifying failure modes in long-memory regimes.
Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.