Tag
Introduces HDD-RoPE, an extension of rotary positional embeddings that uses high-dimensional chunks and data-dependent rotation rates, showing faster convergence on TinyStories compared to xPos.
The article discusses how rising DDR5 memory prices signal a broader memory bottleneck in AI, particularly the KV cache in softmax attention for LLMs, and highlights post-transformer architectures like linear attention and state space models that aim to reduce memory usage.
Introduces Nexus Sampling, a training-free KV-cache eviction method using weighted reservoir sampling instead of deterministic top-k, improving long-context LLM inference under fixed memory budgets, matching dense attention performance at 80% eviction.
Explains that super weights in large language models arise from the SoftMax-Attention interaction creating a 'Nothing Dump' token that serves as a stable reference point; removing these weights cripples performance.
A reflection on the landmark 'Attention Is All You Need' paper, highlighting how removing recurrence and relying solely on attention mechanisms revolutionized AI and led to modern LLMs like GPT and Claude.
A comprehensive free guide explaining LLMs from first principles, covering tokens, transformers, attention, fine-tuning, and local deployment.
Introduces ITNet, a neural architecture based on a learnable integral transform that unifies convolution, attention, and recurrence, achieving strong results across multiple modalities.
Proposes Distance-Adaptive Representation (DAR) which reduces key-value dimensionality for distant tokens while preserving full dimensionality for nearby tokens, improving KV cache efficiency without performance loss.
This paper introduces QG-MIL, a gated transformer aggregator that mitigates attention concentration in multiple instance learning for medical imaging, achieving domain-agnostic performance without auxiliary losses.
Grouped Query Experts (GQE) improves Transformer efficiency by applying a mixture-of-experts layer on top of grouped-query attention, selectively activating query heads per token while keeping key-value cache benefits, matching baseline accuracy with half the query-head compute at 250M parameter scale.
HydraHead is a novel attention hybridization architecture that combines Full and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead via interpretability-driven selection and scale-normalized fusion.
A repository that builds a transformer from scratch without high-level libraries, explaining attention mechanisms and the full training pipeline, trainable in a day on free Colab.
Implemented a top-k kernel from the kernel design section of the MSA paper, using exp-free comparison and warp-level tree merging with CUDA shuffles. The code is available on GitHub.
A Chinese science tweet that intuitively explains the core chain of LLMs (Large Language Models): from token, embedding, position encoding, attention, FFN to residual stream and next-token prediction, helping readers without a math background understand AI papers.
A GitHub guide published by Fluyeporlaweb shows how to build and train a Transformer model from scratch, implementing attention, multi-head, embeddings, and post-training algorithms (SFT, PPO, DPO, GRPO) without high-level libraries, trained on The Pile dataset.
MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.
This tweet shares a well-made explanation of the internal workings of LLMs, covering tokens, embeddings, positional encoding, attention, and feed-forward networks, via a blog post by 0xkato.
A student introduces Silia, a novel transformer architecture that combines attention and FFN into a unified operation to save parameters at scales ≤10M, achieving comparable performance to GPT-2 with fewer parameters despite limited compute resources.
The article discusses a deficiency in executive control within transformer attention mechanisms, highlighting limitations in how transformers manage sequential dependencies.
Introduces Blurry Window Attention (BLA), a novel attention method with bounded-memory control that reconstructs a blurry KV history via Dirichlet kernel interpolation, achieving 8x state efficiency over Sliding Window Attention on the Multi-Query Associate Recall task.