transformers

#transformers

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

arXiv cs.LG ↗ · 8h ago Cached

This paper performs full Jacobian eigendecomposition across production-scale LLMs, revealing a learned spectral gradient from rotation-dominated early layers to symmetric late layers, along with a low-rank bottleneck that compresses perturbations. The results link perturbation propagation and compression to network functional topology.

0 favorites 0 likes

#transformers

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

arXiv cs.LG ↗ · 8h ago Cached

This paper applies TopK Sparse Autoencoders to three EEG foundation models (SleepFM, REVE, LaBraM) to extract interpretable feature dictionaries and introduces a framework for concept steering, revealing representational failures and clinical entanglements.

0 favorites 0 likes

#transformers

@Skaly__Bull: The $710,000 median gap is real In 2026, Anthropic has 70+ open roles, and for the first time, you don't need an MIT Ph…

X AI KOLs Timeline ↗ · 15h ago

Anthropic is hiring 70+ roles in 2026, noting that half of their engineers joined without formal ML degrees by using AI to compress learning, with top compensation packages reaching $920K.

0 favorites 0 likes

#transformers

Is there a limit on the number of active parameters in an MoE model?

Reddit r/LocalLLaMA ↗ · 18h ago

Discussion on the limit of active parameters in Mixture-of-Experts (MoE) models, questioning whether there is a cap on active parameter count beyond which quality doesn't improve.

0 favorites 0 likes

#transformers

@AndrewYNg: New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason…

X AI KOLs Following ↗ · 19h ago Cached

New course 'Transformers in Practice' from deeplearning.ai and AMD teaches practical understanding of transformer-based LLMs, covering text generation, attention mechanisms, and inference optimization techniques like quantization and KV caching.

0 favorites 0 likes

#transformers

Trained transformer-based chess models to play like humans (including thinking time) [P]

Reddit r/MachineLearning ↗ · yesterday

Trained transformer-based chess models for rating buckets from 800 to 2500+, predicting moves, thinking time, and outcome. Achieves strong accuracy with only 9M parameters, and includes a novel thinking-time prediction component.

0 favorites 0 likes

#transformers

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv cs.LG ↗ · 2d ago Cached

This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.

0 favorites 0 likes

#transformers

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Reddit r/LocalLLaMA ↗ · 2d ago Cached

This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.

0 favorites 0 likes

#transformers

@stanfordnlp: Many roughly know how a transformer works To REALLY understand modern neural LMs—MoEs, GPU tiling, kernels, RLHF, data—…

X AI KOLs Following ↗ · 2d ago Cached

Stanford's CS336 course on modern neural language models, covering topics like MoEs and RLHF, is being released on YouTube with a two-week delay.

1 favorites 1 likes

#transformers

@omarsar0: Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you …

X AI KOLs Following ↗ · 2d ago Cached

Nous Research introduces Lighthouse Attention, a training-only subquadratic wrapper for scaled dot-product attention that accelerates long-context pretraining and can be removed before deployment to preserve vanilla inference efficiency.

0 favorites 0 likes

#transformers

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

Reddit r/MachineLearning ↗ · 2d ago

The article presents a discovered spectral ratio between MLP and attention norms that predicts geometric stability in transformer models, with an optimal range of 0.5–2 to prevent rank collapse.

0 favorites 0 likes

#transformers

Distributional Spectral Diagnostics for Localizing Grokking Transitions

arXiv cs.LG ↗ · 3d ago Cached

This paper proposes distributional spectral diagnostics to localize grokking transitions in Transformer models before test accuracy rises. It uses empirical distributions and Hankel dynamic mode decomposition to create a monitoring signal that discriminates between grokking and non-grokking runs.

0 favorites 0 likes

#transformers

Solve the Loop: Attractor Models for Language and Reasoning

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper introduces Attractor Models, which use fixed-point solving and implicit differentiation for efficient iterative refinement, achieving superior language modeling and reasoning performance with reduced computational costs compared to traditional transformers.

0 favorites 0 likes

#transformers

Geometric Factual Recall in Transformers

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper introduces a theoretical framework for geometric factual recall in transformers, demonstrating that embeddings can encode relational structure via linear superpositions while MLPs act as selectors. It provides empirical and theoretical evidence that this mechanism allows for efficient memorization of facts and multi-hop queries.

0 favorites 0 likes

#transformers

Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative

arXiv cs.CL ↗ · 4d ago Cached

This paper demonstrates that mean-pooled cosine similarity is not length-invariant under anisotropic representations, showing it artificially inflates similarity with sequence length. It argues for using Centered Kernel Alignment (CKA) as a default metric to correct biases in cross-lingual and cross-representation analysis.

0 favorites 0 likes

#transformers

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Hugging Face Daily Papers ↗ · 4d ago Cached

The paper introduces FocuSFT, a bilevel optimization framework that enhances long-context language model performance by addressing attention dilution through parametric memory. It demonstrates significant improvements in accuracy and context engagement on benchmarks like BABILong and RULER.

0 favorites 0 likes

#transformers

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

Hugging Face Daily Papers ↗ · 4d ago Cached

The paper introduces Mela, a memory-augmented transformer architecture inspired by human memory consolidation, featuring a Hierarchical Memory Module that improves long-context language modeling performance.

0 favorites 0 likes

#transformers

@pallavishekhar_: Math behind Attention - Q, K, and V Read here: https://outcomeschool.com/blog/math-behind-attention-qkv…

X AI KOLs Timeline ↗ · 5d ago Cached

An educational blog post by Amit Shekhar explaining the mathematical mechanics of the Attention mechanism, specifically detailing Query, Key, and Value matrices with a step-by-step numeric example.

0 favorites 0 likes

#transformers

We are hitting a wall trying to force transformers to do actual logic [D]

Reddit r/MachineLearning ↗ · 5d ago

The author expresses frustration with the industry's reliance on prompt engineering and scaling to fix logical reasoning deficits in transformer-based LLMs, arguing that these probabilistic models fundamentally lack the architecture for deterministic logic.

0 favorites 0 likes

#transformers

@QingQ77: 'Dive into Deep Learning' is an excellent introductory book, but its update speed struggles to keep pace with the field's development. Since the Transformer, content like CLIP, Diffusion, vLLM, and more has proliferated. While online resources are abundant, they are highly fragmented—today you study Attention, tomorrow LoRA, the day after...

X AI KOLs Timeline ↗ · 6d ago Cached

This project is a systematic deep learning notes repository covering PyTorch, Transformers, generative models, and more. It aims to address the fragmentation of learning materials and provides code implementations along with practical guides.

0 favorites 0 likes

transformers

Submit Feedback