Tag
The article argues that open-weight AI models are catching up to closed ones not via distillation but due to the modularisation of the AI stack—stable interfaces (Transformer architecture, OpenAI-compatible APIs, agentic harnesses) allow innovations to diffuse rapidly across the ecosystem, shrinking the capability gap while keeping a massive price advantage, potentially leading to a commoditisation of frontier AI.
A comprehensive 15-part series covering LLM internals from tokenization to serving, grounded in Gemma 4 12B's actual config.
OneRank proposes a Transformer-native multi-task ranking framework that integrates feature encoding and prediction to reduce inter-task interference and improve ranking performance in recommender systems.
This article provides a visual guide to the Transformer architecture in Large Language Models, covering self-attention, causal self-attention, masked multi-head attention, and the output layer with step-by-step explanations and examples.
This paper introduces a Jacobian-PCA-Grassmann framework to analyze the geometric structure of expert specialization in Mixture-of-Experts (MoE) Transformers. It finds that experts exhibit strong functional decorrelation while their representations overlap, and that routing sparsity significantly influences this geometry.
This paper demonstrates that cosine similarity is a poor proxy for assessing layer importance in LLMs, and proposes using the actual accuracy drop from layer removal as a more robust metric.
This paper presents a theoretical framework interpreting Transformer components (attention, residual connections, normalization) as arising from a spherical state estimation problem using Radial-Tangential SDEs.
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
DeepSeek has published a paper introducing mHC (Manifold-Constrained Hyper-Connections), a fundamental rewrite of the Transformer architecture that stabilizes large models by replacing standard residual connections with mathematically constrained multi-stream pathways.
A 40-minute walkthrough explains the complete Transformer architecture via whiteboard diagrams and demonstrates a practical implementation in C using Vim.
A 22-year-old founder claims to have cracked open the architecture black box of Anthropic’s Claude Mythos model via an open-source project, speculating it uses a recurrent-depth Transformer design instead of simply scaling parameters.
ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.
A research paper introducing Three-Phase Transformer (3PT), which applies Tesla's polyphase geometry to transformer architectures by organizing the residual stream into three 120° offset phases. The approach achieves 7.2% perplexity improvement on WikiText-103 with minimal parameters (0.00124% overhead) and 1.93× convergence speedup.
Motif-Video 2B is a 2B parameter text-to-video generation model that achieves 83.76% on VBench, surpassing Wan2.1 14B while using 7x fewer parameters and trained on fewer than 10M clips with less than 100,000 H200 GPU hours. The model uses a specialized architecture with shared cross-attention and a three-part backbone to separate prompt alignment, temporal consistency, and detail refinement.