Tag
This paper performs full Jacobian eigendecomposition across production-scale LLMs, revealing a learned spectral gradient from rotation-dominated early layers to symmetric late layers, along with a low-rank bottleneck that compresses perturbations. The results link perturbation propagation and compression to network functional topology.
This paper applies TopK Sparse Autoencoders to three EEG foundation models (SleepFM, REVE, LaBraM) to extract interpretable feature dictionaries and introduces a framework for concept steering, revealing representational failures and clinical entanglements.
Anthropic is hiring 70+ roles in 2026, noting that half of their engineers joined without formal ML degrees by using AI to compress learning, with top compensation packages reaching $920K.
Discussion on the limit of active parameters in Mixture-of-Experts (MoE) models, questioning whether there is a cap on active parameter count beyond which quality doesn't improve.
New course 'Transformers in Practice' from deeplearning.ai and AMD teaches practical understanding of transformer-based LLMs, covering text generation, attention mechanisms, and inference optimization techniques like quantization and KV caching.
Trained transformer-based chess models for rating buckets from 800 to 2500+, predicting moves, thinking time, and outcome. Achieves strong accuracy with only 9M parameters, and includes a novel thinking-time prediction component.
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.
This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.
Stanford's CS336 course on modern neural language models, covering topics like MoEs and RLHF, is being released on YouTube with a two-week delay.
Nous Research introduces Lighthouse Attention, a training-only subquadratic wrapper for scaled dot-product attention that accelerates long-context pretraining and can be removed before deployment to preserve vanilla inference efficiency.
The article presents a discovered spectral ratio between MLP and attention norms that predicts geometric stability in transformer models, with an optimal range of 0.5–2 to prevent rank collapse.
This paper proposes distributional spectral diagnostics to localize grokking transitions in Transformer models before test accuracy rises. It uses empirical distributions and Hankel dynamic mode decomposition to create a monitoring signal that discriminates between grokking and non-grokking runs.
This paper introduces Attractor Models, which use fixed-point solving and implicit differentiation for efficient iterative refinement, achieving superior language modeling and reasoning performance with reduced computational costs compared to traditional transformers.
This paper introduces a theoretical framework for geometric factual recall in transformers, demonstrating that embeddings can encode relational structure via linear superpositions while MLPs act as selectors. It provides empirical and theoretical evidence that this mechanism allows for efficient memorization of facts and multi-hop queries.
This paper demonstrates that mean-pooled cosine similarity is not length-invariant under anisotropic representations, showing it artificially inflates similarity with sequence length. It argues for using Centered Kernel Alignment (CKA) as a default metric to correct biases in cross-lingual and cross-representation analysis.
The paper introduces FocuSFT, a bilevel optimization framework that enhances long-context language model performance by addressing attention dilution through parametric memory. It demonstrates significant improvements in accuracy and context engagement on benchmarks like BABILong and RULER.
The paper introduces Mela, a memory-augmented transformer architecture inspired by human memory consolidation, featuring a Hierarchical Memory Module that improves long-context language modeling performance.
An educational blog post by Amit Shekhar explaining the mathematical mechanics of the Attention mechanism, specifically detailing Query, Key, and Value matrices with a step-by-step numeric example.
The author expresses frustration with the industry's reliance on prompt engineering and scaling to fix logical reasoning deficits in transformer-based LLMs, arguing that these probabilistic models fundamentally lack the architecture for deterministic logic.
This project is a systematic deep learning notes repository covering PyTorch, Transformers, generative models, and more. It aims to address the fragmentation of learning materials and provides code implementations along with practical guides.