training

#training

@percyliang: Not only do we want to train a good model, we want to know it'll be good before we even start training. About a month a…

X AI KOLs Following ↗ · 2026-05-24 Cached

The Marin team pre-registered a predicted loss of 2.252 for a 129B parameter MoE model training run, and the actual result landed at 2.234, demonstrating accurate loss prediction before training.

0 favorites 0 likes

#training

@DanKornas: Most AI agents still split vision, language, and action across separate systems. Magma is a Microsoft Research foundati…

X AI KOLs Timeline ↗ · 2026-05-23 Cached

Magma is an open-source repository from Microsoft Research for building multimodal AI agents that integrate vision, language, and action, providing model links, inference examples, training instructions, and demos.

0 favorites 0 likes

#training

@jino_rohit: before you start learning quantization for llms, you need to understand how different number formats are represented in…

X AI KOLs Timeline ↗ · 2026-05-23 Cached

A thread explaining why understanding number formats in memory is crucial for learning LLM quantization, covering gradient NaN debugging, numerical stability, and quantization distortion.

0 favorites 0 likes

#training

[NEW] Supra-50M Released!

Reddit r/LocalLLaMA ↗ · 2026-05-22

SupraLabs released Supra-50M, a compact 50M-parameter causal language model with base and instruct versions, trained on 20B tokens from fineweb-edu, achieving competitive benchmarks against larger models like GPT-2 and SmolLM.

0 favorites 0 likes

#training

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Hacker News Top ↗ · 2026-05-22 Cached

Introduces CODA, a GPU kernel abstraction that expresses Transformer operations as GEMM-plus-epilogue programs to reduce data movement, covering nearly all non-attention computation in a Transformer block.

0 favorites 0 likes

#training

Unified Data Selection for LLM Reasoning

arXiv cs.CL ↗ · 2026-05-22 Cached

The paper proposes High-Entropy Sum (HES), a training-free metric for selecting high-quality reasoning data for LLM training, validated across SFT, RFT, and RL paradigms.

0 favorites 0 likes

#training

ACC: Compiling Agent Trajectories for Long-Context Training

arXiv cs.CL ↗ · 2026-05-22 Cached

ACC converts multi-turn agent trajectories into long-context QA pairs to train LLMs on long-range reasoning without additional annotation, achieving significant gains on MRCR and GraphWalks benchmarks while preserving general capabilities.

0 favorites 0 likes

#training

@maximelabonne: Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width I'm no nanoGP…

X AI KOLs Following ↗ · 2026-05-21 Cached

A tweet suggests that scaling the embedding learning rate by model width can replace the need for µP (micro-parameterization), referencing Muon optimizer for hidden layers and Adam for the rest.

0 favorites 0 likes

#training

@modal: Frontier models set the floor. Specialized models raise the ceiling. With Modal, @AppliedCompute is training custom age…

X AI KOLs Following ↗ · 2026-05-20 Cached

Modal announces that AppliedCompute is using its platform to train custom agent workforces for companies like DoorDash, Mercor, and Cognition, highlighting the shift from frontier models to specialized models.

0 favorites 0 likes

#training

@Diyi_Yang: The next frontier of AI is not only more capable model; it is an AI that humans can meaningfully live and work with :…

X AI KOLs Following ↗ · 2026-05-20 Cached

A Stanford class on Human-Centered LLMs releases a 60+ page report covering design, data sourcing, training, evaluation, and deployment for developing AI that humans can meaningfully work with.

0 favorites 0 likes

#training

TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

TideGS introduces an out-of-core training framework that enables 3D Gaussian Splatting with over one billion primitives on a single GPU by managing parameters across SSD-CPU-GPU hierarchy via block-virtualization, asynchronous pipeline, and differential streaming techniques.

0 favorites 0 likes

#training

$\phi$-Balancing for Mixture-of-Experts Training

arXiv cs.LG ↗ · 2026-05-18 Cached

This paper proposes φ-balancing, a principled framework for load balancing in Mixture-of-Experts models that directly targets population-level expert balance using convex duality and mirror descent, achieving more stable expert utilization and outperforming prior methods on reasoning and code generation benchmarks.

0 favorites 0 likes

#training

I trained TIME: short context-triggered thinking on Qwen model instead of overthinking

Reddit r/LocalLLaMA ↗ · 2026-05-18

A personal project led to an ACL 2026 paper introducing TIME, a method training Qwen3 models to engage in short, context-triggered thinking rather than excessive reasoning. The work uses QLoRA and a four-phase curriculum, with all data and code released open-source.

0 favorites 0 likes

#training

Forecasting Downstream Performance of LLMs With Proxy Metrics

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

This paper introduces proxy metrics based on token-level statistics from expert-written solutions to forecast downstream LLM performance, significantly outperforming loss-based methods in model selection, pretraining data selection, and training-time forecasting.

0 favorites 0 likes

#training

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

Researchers introduce symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional methods like Adam. The approach is validated on various language model architectures including Qwen3-0.6B, Gemma 3 1B, and OLMoE-1B-7B.

0 favorites 0 likes

#training

AI economics part 2 (11 minute read)

TLDR AI ↗ · 2026-05-18 Cached

The article analyzes the economics of AI, focusing on the war for GPU resources, contrasts human inference spikes with agentic continuous workloads, and argues that current infrastructure is optimized for human usage, not agentic inference, which is more demanding.

0 favorites 0 likes

#training

@KaitoEtLIA: - I use Claude every day - I think I'm pretty good at it - I watch two Anthropic engineers for 2 HOURS - Claude's engin…

X AI KOLs Timeline ↗ · 2026-05-16 Cached

A Twitter thread reacts to Anthropic's 2-hour training video on building Claude agents, highlighting the 'Skills' feature as a way to persist workflow and expertise, and lamenting previous manual repetition.

0 favorites 0 likes

#training

@Jouhatsu_ai: Anthropic has released a complete 2-HOUR TRAINING on building Claude agents. Hosted by the engineer who builds Claude C…

X AI KOLs Timeline ↗ · 2026-05-16 Cached

Anthropic released a comprehensive 2-hour training on building Claude agents, hosted by the engineer behind Claude Code, covering agent structuring, terminal access, memory management, and hallucination prevention.

0 favorites 0 likes

#training

@DailyDoseOfDS_: Turn any Autoregressive LLM into a Diffusion LM. dLLM is a Python library that unifies the training & evaluation of dif…

X AI KOLs Timeline ↗ · 2026-05-16 Cached

dLLM is an open-source Python library that allows converting any autoregressive language model into a diffusion language model with minimal compute, unifying training and evaluation.

0 favorites 0 likes

#training

DynMuon: A Dynamic Spectral Shaping View of Muon

Hugging Face Daily Papers ↗ · 2026-05-16 Cached

This paper introduces DynMuon, a dynamic spectral shaping optimizer that schedules the update parameter p from positive to mildly negative during training, consistently achieving lower validation loss and requiring 10.6-26.5% fewer steps than the standard Muon optimizer.

0 favorites 0 likes

training

Submit Feedback