attention

#attention

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper introduces ADAS, a training-free reranking rule for parallel masked diffusion decoding that uses attention to discount tokens that strongly attend to uncertain positions, improving low-NFE performance on reasoning and code tasks with minimal runtime overhead.

0 favorites 0 likes

#attention

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.

0 favorites 0 likes

#attention

@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…

X AI KOLs Timeline ↗ · 2026-06-09 Cached

This paper investigates whether Transformers need separate key and value projections, finding that sharing them can reduce KV cache by 50% with only 3.1% higher perplexity, and further cuts when combined with GQA and MQA.

0 favorites 0 likes

#attention

@pallavishekhar_: Learn LLM internals step by step - from tokenization to attention to inference optimization - BPE - Tokenization - Tran…

X AI KOLs Timeline ↗ · 2026-06-09 Cached

A tweet promoting a resource for learning LLM internals step by step, covering tokenization, attention, and optimization techniques.

0 favorites 0 likes

#attention

@Potatoloogs: How LLMs Actually Work Inside: From Token to Next-Token – A Complete Overview of Nine Core Mechanisms a) Tokenization: The model doesn't read text, it reads integers · Text is first split into subword pieces, then mapped to integer IDs; modern LLM vocabularies typically have tens of thousands to...

X AI KOLs Timeline ↗ · 2026-06-08 Cached

This article systematically outlines the nine core mechanisms inside modern LLMs, from tokenization to next-token prediction, including tokenization, embedding, positional encoding, attention, multi-head attention, feed-forward networks, etc., and compares architectural differences between various models.

0 favorites 0 likes

#attention

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.

0 favorites 0 likes

#attention

How LLMs Actually Work

Lobsters Hottest ↗ · 2026-06-07 Cached

An in-depth walkthrough of how modern LLMs work, covering core mechanisms from tokenization to next-token prediction, without heavy math.

0 favorites 0 likes

#attention

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Hugging Face Daily Papers ↗ · 2026-06-06 Cached

This paper identifies that failures in visual reasoning often stem from breakdowns in dynamic cross-modal coordination between visual and textual evidence during chain-of-thought generation. It introduces DyCo-RL, a reinforcement learning framework that rewards effective cross-modal coordination, leading to improved reasoning performance.

0 favorites 0 likes

#attention

Multi-Granularity Reasoning for Natural Language Inference

arXiv cs.CL ↗ · 2026-06-05 Cached

Proposes a Multi-Granularity Reasoning Network (MGRN) that explicitly leverages hierarchical semantic features for natural language inference, outperforming strong baselines on multiple benchmarks.

0 favorites 0 likes

#attention

@akshay_pachaar: Extending the context window isn't just about larger matrices. In a traditional transformer, expanding tokens by 8x inc…

X AI KOLs Following ↗ · 2026-06-03 Cached

Explains the memory challenge of expanding transformer context windows due to quadratic attention complexity, and hints at solutions.

0 favorites 0 likes

#attention

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

arXiv cs.CL ↗ · 2026-06-03 Cached

This paper investigates whether deep layer value vectors in transformer attention need context from the residual stream. It proposes Bank of Values (BoV), which uses context-free token-specific value vectors in the last third of layers, improving validation loss and benchmark scores over standard attention.

0 favorites 0 likes

#attention

Your transformer's attention entropy collapse isn't a bug. It's the model doing exactly what you trained it to do. Here's how to fix it with a three-line temperature schedule. arXiv-able. Self-contained proof. No citations needed.

Reddit r/ArtificialInteligence ↗ · 2026-06-02

The article explains that attention entropy collapse in deep transformer layers is a geometric consequence of training, not a bug, and proposes a three-line temperature schedule to prevent it.

0 favorites 0 likes

#attention

I think AI is making me dumber and I have proof

Reddit r/artificial ↗ · 2026-06-01

The author recounts a personal experience where their reasoning test scores dropped significantly after two years of daily AI tool usage, raising concerns about long-term cognitive trade-offs for short-term productivity gains.

0 favorites 0 likes

#attention

The solution might be cancelling my AI subscription

Simon Willison's Blog ↗ · 2026-05-31 Cached

A reflective blog post discusses the problem of using AI to rapidly create numerous projects, which can lead to attention fragmentation and lack of meaningful follow-through, while also noting that some people with ADHD find AI helps them focus and complete tasks.

0 favorites 0 likes

#attention

The solution might be cancelling my AI subscription

Hacker News Top ↗ · 2026-05-31 Cached

The author recounts how heavy use of AI tools like Claude and Codex led to an overwhelming number of unfinished projects and exacerbated attention issues, ultimately deciding to cancel their AI subscription.

0 favorites 0 likes

#attention

Llama Surgery: Continuous Sparsification of Pre-Trained Language Models via Differentiable Ultrametric Topology Injection

Reddit r/artificial ↗ · 2026-05-31

Llama Surgery injects learned block-sparse attention topologies into pre-trained Llama 3.1 8B without retraining from scratch, using a Dynamic Topology Router with Gumbel-Softmax routing, temperature annealing, and a Straight-Through Estimator to avoid gradient collapse, achieving stable convergence and coherent output.

0 favorites 0 likes

#attention

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

LongAttnComp adapts AttnComp for long-context reasoning by fine-tuning lightweight cross-attention layers and introducing token-level chunking, a top-p algorithm, positional reordering, and a query parser. It achieves strong performance on long-context tasks like code debugging and transfers across multiple model families.

0 favorites 0 likes

#attention

@zhaoran_wang: for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direct…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

Discussion of a finding that all softmax/linear attention variants can be interpolated, and that the Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Includes link to paper and code.

0 favorites 0 likes

#attention

@royvanrijn: For curious developers I built "The Anatomy of an LLM", an interactive explainer showing how text becomes tokens, vecto…

X AI KOLs Timeline ↗ · 2026-05-28 Cached

An interactive visual guide that explains how large language models work, from tokenization through attention, transformer blocks, and text generation, built by Roy van Rijn.

0 favorites 0 likes

#attention

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

VideoMLA replaces per-head KV caches in video diffusion models with a shared low-rank latent and decoupled 3D-RoPE positional keys, reducing per-token KV memory by 92.7% and improving throughput by 1.23x on a B200 while maintaining quality on VBench benchmarks.

0 favorites 0 likes

attention

Submit Feedback