Tag
The Marin team pre-registered a predicted loss of 2.252 for a 129B parameter MoE model training run, and the actual result landed at 2.234, demonstrating accurate loss prediction before training.
Magma is an open-source repository from Microsoft Research for building multimodal AI agents that integrate vision, language, and action, providing model links, inference examples, training instructions, and demos.
A thread explaining why understanding number formats in memory is crucial for learning LLM quantization, covering gradient NaN debugging, numerical stability, and quantization distortion.
SupraLabs released Supra-50M, a compact 50M-parameter causal language model with base and instruct versions, trained on 20B tokens from fineweb-edu, achieving competitive benchmarks against larger models like GPT-2 and SmolLM.
Introduces CODA, a GPU kernel abstraction that expresses Transformer operations as GEMM-plus-epilogue programs to reduce data movement, covering nearly all non-attention computation in a Transformer block.
The paper proposes High-Entropy Sum (HES), a training-free metric for selecting high-quality reasoning data for LLM training, validated across SFT, RFT, and RL paradigms.
ACC converts multi-turn agent trajectories into long-context QA pairs to train LLMs on long-range reasoning without additional annotation, achieving significant gains on MRCR and GraphWalks benchmarks while preserving general capabilities.
A tweet suggests that scaling the embedding learning rate by model width can replace the need for µP (micro-parameterization), referencing Muon optimizer for hidden layers and Adam for the rest.
Modal announces that AppliedCompute is using its platform to train custom agent workforces for companies like DoorDash, Mercor, and Cognition, highlighting the shift from frontier models to specialized models.
A Stanford class on Human-Centered LLMs releases a 60+ page report covering design, data sourcing, training, evaluation, and deployment for developing AI that humans can meaningfully work with.
TideGS introduces an out-of-core training framework that enables 3D Gaussian Splatting with over one billion primitives on a single GPU by managing parameters across SSD-CPU-GPU hierarchy via block-virtualization, asynchronous pipeline, and differential streaming techniques.
This paper proposes φ-balancing, a principled framework for load balancing in Mixture-of-Experts models that directly targets population-level expert balance using convex duality and mirror descent, achieving more stable expert utilization and outperforming prior methods on reasoning and code generation benchmarks.
A personal project led to an ACL 2026 paper introducing TIME, a method training Qwen3 models to engage in short, context-triggered thinking rather than excessive reasoning. The work uses QLoRA and a four-phase curriculum, with all data and code released open-source.
This paper introduces proxy metrics based on token-level statistics from expert-written solutions to forecast downstream LLM performance, significantly outperforming loss-based methods in model selection, pretraining data selection, and training-time forecasting.
Researchers introduce symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional methods like Adam. The approach is validated on various language model architectures including Qwen3-0.6B, Gemma 3 1B, and OLMoE-1B-7B.
The article analyzes the economics of AI, focusing on the war for GPU resources, contrasts human inference spikes with agentic continuous workloads, and argues that current infrastructure is optimized for human usage, not agentic inference, which is more demanding.
A Twitter thread reacts to Anthropic's 2-hour training video on building Claude agents, highlighting the 'Skills' feature as a way to persist workflow and expertise, and lamenting previous manual repetition.
Anthropic released a comprehensive 2-hour training on building Claude agents, hosted by the engineer behind Claude Code, covering agent structuring, terminal access, memory management, and hallucination prevention.
dLLM is an open-source Python library that allows converting any autoregressive language model into a diffusion language model with minimal compute, unifying training and evaluation.
This paper introduces DynMuon, a dynamic spectral shaping optimizer that schedules the update parameter p from positive to mildly negative during training, consistently achieving lower validation loss and requiring 10.6-26.5% fewer steps than the standard Muon optimizer.