Tag
This paper introduces ADAS, a training-free reranking rule for parallel masked diffusion decoding that uses attention to discount tokens that strongly attend to uncertain positions, improving low-NFE performance on reasoning and code tasks with minimal runtime overhead.
InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.
This paper investigates whether Transformers need separate key and value projections, finding that sharing them can reduce KV cache by 50% with only 3.1% higher perplexity, and further cuts when combined with GQA and MQA.
A tweet promoting a resource for learning LLM internals step by step, covering tokenization, attention, and optimization techniques.
This article systematically outlines the nine core mechanisms inside modern LLMs, from tokenization to next-token prediction, including tokenization, embedding, positional encoding, attention, multi-head attention, feed-forward networks, etc., and compares architectural differences between various models.
This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.
An in-depth walkthrough of how modern LLMs work, covering core mechanisms from tokenization to next-token prediction, without heavy math.
This paper identifies that failures in visual reasoning often stem from breakdowns in dynamic cross-modal coordination between visual and textual evidence during chain-of-thought generation. It introduces DyCo-RL, a reinforcement learning framework that rewards effective cross-modal coordination, leading to improved reasoning performance.
Proposes a Multi-Granularity Reasoning Network (MGRN) that explicitly leverages hierarchical semantic features for natural language inference, outperforming strong baselines on multiple benchmarks.
Explains the memory challenge of expanding transformer context windows due to quadratic attention complexity, and hints at solutions.
This paper investigates whether deep layer value vectors in transformer attention need context from the residual stream. It proposes Bank of Values (BoV), which uses context-free token-specific value vectors in the last third of layers, improving validation loss and benchmark scores over standard attention.
The article explains that attention entropy collapse in deep transformer layers is a geometric consequence of training, not a bug, and proposes a three-line temperature schedule to prevent it.
The author recounts a personal experience where their reasoning test scores dropped significantly after two years of daily AI tool usage, raising concerns about long-term cognitive trade-offs for short-term productivity gains.
A reflective blog post discusses the problem of using AI to rapidly create numerous projects, which can lead to attention fragmentation and lack of meaningful follow-through, while also noting that some people with ADHD find AI helps them focus and complete tasks.
The author recounts how heavy use of AI tools like Claude and Codex led to an overwhelming number of unfinished projects and exacerbated attention issues, ultimately deciding to cancel their AI subscription.
Llama Surgery injects learned block-sparse attention topologies into pre-trained Llama 3.1 8B without retraining from scratch, using a Dynamic Topology Router with Gumbel-Softmax routing, temperature annealing, and a Straight-Through Estimator to avoid gradient collapse, achieving stable convergence and coherent output.
LongAttnComp adapts AttnComp for long-context reasoning by fine-tuning lightweight cross-attention layers and introducing token-level chunking, a top-p algorithm, positional reordering, and a query parser. It achieves strong performance on long-context tasks like code debugging and transfers across multiple model families.
Discussion of a finding that all softmax/linear attention variants can be interpolated, and that the Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Includes link to paper and code.
An interactive visual guide that explains how large language models work, from tokenization through attention, transformer blocks, and text generation, built by Roy van Rijn.
VideoMLA replaces per-head KV caches in video diffusion models with a shared low-rank latent and decoupled 3D-RoPE positional keys, reducing per-token KV memory by 92.7% and improving throughput by 1.23x on a B200 while maintaining quality on VBench benchmarks.