self-attention

#self-attention

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

arXiv cs.AI ↗ · 2026-06-11 Cached

This paper analyzes hallucination in large language models as a structural consequence of three architectural decisions: self-attention's co-occurrence learning, maximum likelihood estimation training objective, and autoregressive decoding's left-to-right commitment. It maps each mechanism to specific hallucination types and argues that dataset pathologies amplify but do not cause these vulnerabilities.

0 favorites 0 likes

#self-attention

Kuramoto Attention: Synchronizing Self-Attention on the Torus

arXiv cs.LG ↗ · 2026-06-11 Cached

Introduces Kuramoto attention, a self-attention layer where hidden states are phase angles on a torus, enabling synchronization through gated cosine similarity and circular mean updates. The layer performs comparably to standard transformers on character-level language modeling.

0 favorites 0 likes

#self-attention

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

arXiv cs.LG ↗ · 2026-06-09 Cached

Introduces Contribution Weights, a projection-based metric that accounts for attention weight, value magnitude, and directional alignment to more faithfully measure token importance in transformer LLMs, revealing active functional roles of attention sinks.

0 favorites 0 likes

#self-attention

Reachability and asymptotics of Gaussian Transformer dynamics

arXiv cs.LG ↗ · 2026-06-09 Cached

This paper presents a mathematical framework for Transformer dynamics as a nonlinear control system on probability measures, proving that Gaussian distributions remain Gaussian under the flow, reducing to finite-dimensional bilinear control, and establishing reachability conditions and asymptotic stability results.

0 favorites 0 likes

#self-attention

@_rohit_tiwari_: https://x.com/_rohit_tiwari_/status/2063982924714901858

X AI KOLs Timeline ↗ · 2026-06-08 Cached

This article provides a visual guide to the Transformer architecture in Large Language Models, covering self-attention, causal self-attention, masked multi-head attention, and the output layer with step-by-step explanations and examples.

0 favorites 0 likes

#self-attention

Linear Scaling Video VLMs for Long Video Understanding

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

StateKV is an inference-time method that enables linear-time video prefill for long-video vision-language models by carrying cross-frame context in a fixed-capacity recurrent state, maintaining accuracy close to full self-attention without fine-tuning.

0 favorites 0 likes

#self-attention

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper studies retrieval-augmented generation as an in-context optimization process, showing that linear self-attention can implement gradient descent on a unified RAG objective. It proposes a lightweight method for frozen RAG LLMs that predicts context-conditioned updates, improving performance across multiple QA benchmarks.

0 favorites 0 likes

#self-attention

Elastic Attention Cores for Scalable Vision Transformers [R]

Reddit r/MachineLearning ↗ · 2026-05-13

This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.

0 favorites 0 likes

#self-attention

Temporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory

arXiv cs.LG ↗ · 2026-05-11 Cached

This paper proposes a meta-control architecture using temporal self-attention for adaptive control of Euler-Lagrange systems with unobservable memory states. It demonstrates improved tracking performance over baseline methods on a 2-DOF manipulator while identifying failure modes in long-memory regimes.

0 favorites 0 likes

#self-attention

@ickma2311: Efficient AI Lecture 12: Transformer and LLM This lecture is not only about how LLMs work. It also explains the buildin…

X AI KOLs Timeline ↗ · 2026-05-09 Cached

Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.

0 favorites 0 likes

self-attention

Submit Feedback