attention

#attention

The Wiola Architecture for Efficient Small Language Models

arXiv cs.AI ↗ · 2d ago Cached

Wiola is a novel Small Language Model architecture introducing five independently designed components—SRPE, GCLA, ATM, DSFF, and WiolaRMSNorm—aimed at improving efficiency and coherence, released in sizes from 120M to 1.5B parameters and integrated with HuggingFace Transformers.

0 favorites 0 likes

#attention

Multi-Head Recurrent Memory Agents

arXiv cs.LG ↗ · 2d ago Cached

This paper identifies memory retention as the bottleneck in recurrent memory agents for long contexts and proposes Multi-Head Recurrent Memory (MHM), a training-free framework that partitions memory into independent heads with a select-then-update strategy. The lightweight instantiation MHM-LRU significantly improves retention and end-to-end accuracy across 100K–1M token ranges, raising retention from below 30% to 73.96% on RULER-HQA at 896K tokens.

0 favorites 0 likes

#attention

The risk of KV cache compression

arXiv cs.LG ↗ · 2d ago Cached

This paper theoretically characterizes the minimax risk of KV cache compression in transformers, providing design principles for accurate compression under causal masking, and instantiates them in a practical algorithm with promising results on LongBench.

0 favorites 0 likes

#attention

PARTREP: Learning What to Repeat for Decoder-only LLMs

arXiv cs.CL ↗ · 2d ago Cached

PartRep proposes a selective prompt repetition method for decoder-only LLMs that appends only the most informative tokens (selected via NLL) instead of the full prompt, reducing KV cache and prefill FLOPs while retaining most of the accuracy gains across multiple benchmarks.

0 favorites 0 likes

#attention

MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering

arXiv cs.CL ↗ · 2d ago Cached

Introduces MultAttnAttrib, a training-free method for multimodal attribution in long document QA, along with the MultAttrEval benchmark. It outperforms prompting-based methods and matches frontier models like GPT-5.4.

0 favorites 0 likes

#attention

The future of social media: AI-generated personalized media, on the spot, based on user's data

Reddit r/ArtificialInteligence ↗ · 3d ago

Discusses the potential of AI-generated personalized media populating social media feeds without consent, raising concerns about manipulation and attention economy.

0 favorites 0 likes

#attention

@athleticKoder: A 1600-word note on how llm inference work: Covering: 1. Attention - the only place tokens interact 2. KV caching - why…

X AI KOLs Timeline ↗ · 3d ago Cached

A detailed thread explaining key concepts of LLM inference: attention, KV caching, chunked prefill, and batching techniques, including continuous batching used in vLLM and SGLang.

0 favorites 0 likes

#attention

@AaronWeiHuang: Our new blog looks at how FP4 is moving beyond compression into a practical primitive for training and inference across…

X AI KOLs Following ↗ · 5d ago Cached

NVIDIA's blog details how FP4, with the NVFP4 format and Blackwell hardware, has evolved from a compression trick to a practical primitive for training and inference across LLMs and diffusion models, achieving near 16-bit accuracy.

0 favorites 0 likes

#attention

FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models

arXiv cs.AI ↗ · 5d ago Cached

This paper proposes FADE, a training-free method that mitigates hallucinations in Large Vision-Language Models by attenuating FFN outputs at critical layers to reduce language-prior dominance, demonstrating effectiveness across multiple benchmarks.

0 favorites 0 likes

#attention

A Path-Space Formulation of Prediction in World Models: From a Single Action to Prediction, Planning, and Irreversibility

arXiv cs.LG ↗ · 5d ago Cached

This paper proposes a path-space formulation of prediction in AI world models, treating the distribution over future trajectories as the fundamental predictive object. It shows that prediction, planning, and uncertainty emerge as operations on a single action functional, and demonstrates that attention asymmetry in learned models correlates with irreversibility in the data.

0 favorites 0 likes

#attention

Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

arXiv cs.CL ↗ · 5d ago Cached

This paper investigates memory-managed long-context attention, a research direction that separates efficient state compression from explicit editable memory slots. Experiments show that a hybrid approach combining fast recurrent/sparse backbones with explicit memory management outperforms pure fixed-state or pure sparse methods across synthetic tasks and long-context benchmarks.

0 favorites 0 likes

#attention

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

arXiv cs.CL ↗ · 6d ago Cached

Introduces LPES, a layer-specific positional embedding scaling method that mitigates the 'lost-in-the-middle' problem in LLMs by assigning distinct scaling factors per layer using a genetic algorithm with Bézier curves, achieving up to 11.2% accuracy gain without fine-tuning or latency increase.

0 favorites 0 likes

#attention

@badlogicgames: recommended reading.

X AI KOLs Timeline ↗ · 2026-06-28 Cached

This article discusses the concept of bounded cognition in software engineering, highlighting the limitations of human memory and attention, and how software systems are built despite these constraints.

0 favorites 0 likes

#attention

@VukRosic99: When a small model learns from a big one, half the lesson is wasted The setup: a small "student" model writes an answer…

X AI KOLs Timeline ↗ · 2026-06-28 Cached

The paper identifies position bias in on-policy distillation for language models, where later tokens in student-generated answers receive degraded supervision. The proposed Importance-Weighted On-Policy Distillation (IW-OPD) weights corrections based on accumulated drift, improving learning speed and final performance.

0 favorites 0 likes

#attention

@rohanpaul_ai: This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs. Rea…

X AI KOLs Following ↗ · 2026-06-27 Cached

The paper introduces Grouped Query Experts, which improves long-context attention by routing each token to only a few query-head experts on top of grouped-query attention, achieving 1.7-1.8x faster prefill while matching accuracy.

0 favorites 0 likes

#attention

MathFormer: Testing whether symbolic math is pattern matching or reasoning [D]

Reddit r/MachineLearning ↗ · 2026-06-27

MathFormer is a small seq2seq model that achieves ~98.6% accuracy on symbolic math tasks, suggesting that mathematical reasoning in LLMs may be large-scale structured pattern completion rather than true reasoning.

0 favorites 0 likes

#attention

Comparing Transformers and Hybrid Models at the Token Level

Lobsters Hottest ↗ · 2026-06-27 Cached

This paper analyzes token-level prediction differences between transformers and hybrid attention-recurrent models using Olmo 3 and Olmo Hybrid, finding that hybrids improve on semantic state tracking while transformers excel at n-gram copying and syntactic bracket matching.

0 favorites 0 likes

#attention

@_avichawla: A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on lo…

X AI KOLs Timeline ↗ · 2026-06-27 Cached

Explains why evicting 90% of KV cache tokens fails to free GPU memory when serving reasoning models on vLLM, due to paged attention fragmentation, and introduces NVIDIA's TriAttention as a solution that achieves 2.5x speedup and 10.7x memory reduction.

0 favorites 0 likes

#attention

Engineering for Bounded Cognition

Hacker News Top ↗ · 2026-06-26 Cached

The article explores the limitations of human cognition—such as holding only about four items in working memory—and how these constraints shape software engineering, arguing that many 'human errors' are actually design failures.

0 favorites 0 likes

#attention

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

arXiv cs.LG ↗ · 2026-06-26 Cached

This paper introduces EpiKV, a KV cache eviction method that scores token importance via changes in internal representations (epiphany score) instead of attention weights, avoiding the need to materialize the attention matrix. It achieves competitive performance on reasoning benchmarks while enabling up to 16× longer context lengths.

0 favorites 0 likes

attention

Submit Feedback