llm-training

#llm-training

Notes on pretraining parallelisms and failed training runs (12 minute read)

TLDR AI ↗ · 2026-05-18 Cached

A technical deep-dive into common causes of failed pretraining runs in large language models, including causality-breaking issues in expert routing and numerical precision bugs, with examples from Llama 4, Gemini 2 Pro, and GPT-4.

0 favorites 0 likes

#llm-training

@tom_doerr: Trains billion-parameter LLMs from scratch on a single GPU https://github.com/FareedKhan-dev/train-llm-from-scratch…

X AI KOLs Timeline ↗ · 2026-05-17 Cached

A GitHub repository provides scripts to train billion-parameter language models from scratch on a single GPU using PyTorch, based on the Transformer architecture.

0 favorites 0 likes

#llm-training

@LakshyAAAgrawal: Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization.…

X AI KOLs Following ↗ · 2026-05-13

Fast-Slow Training (FST) interleaves context optimization (via GEPA) with model weight updates via RL, achieving 3× sample efficiency over RL alone on math, code, and physics reasoning while preserving plasticity and enabling continual learning.

0 favorites 0 likes

#llm-training

Rotation-Preserving Supervised Fine-Tuning

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces Rotation-Preserving Supervised Fine-Tuning (RPSFT), a method that improves out-of-domain generalization by preserving projected rotations in pretrained singular subspaces during fine-tuning.

0 favorites 0 likes

#llm-training

Learning Agentic Policy from Action Guidance

arXiv cs.CL ↗ · 2026-05-13 Cached

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.

0 favorites 0 likes

#llm-training

YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper introduces YFPO, a neuron-guided preference optimization framework that uses internal activation signals to improve mathematical reasoning in large language models.

0 favorites 0 likes

#llm-training

Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper proposes LayerTracer, an interpretable framework for layer allocation in continued pre-training, demonstrating that freezing deep layers while training shallow ones outperforms full-parameter fine-tuning. It offers a low-cost, actionable strategy for resource-constrained teams optimizing Large Language Models.

0 favorites 0 likes

#llm-training

@songhan_mit: Explore lightening OPD for efficient LLM post training:

X AI KOLs Following ↗ · 2026-05-12

The article introduces a method to lighten OPD for efficient post-training of Large Language Models.

0 favorites 0 likes

#llm-training

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper introduces On-Policy Harness Self-Distillation (OPHSD), a method that internalizes the capabilities of inference-time reasoning harnesses into the base model through self-distillation. The approach improves standalone performance on complex reasoning tasks, allowing the model to retain reasoning scaffolds without permanent external dependencies.

0 favorites 0 likes

#llm-training

DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis

arXiv cs.LG ↗ · 2026-05-12 Cached

The article introduces DataArc-SynData-Toolkit, an open-source framework designed to simplify multi-path, multimodal, and multilingual synthetic data generation. It aims to lower technical barriers and improve usability for training large language models through a unified, configuration-driven pipeline.

0 favorites 0 likes

#llm-training

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Hugging Face Daily Papers ↗ · 2026-05-12 Cached

This paper introduces Pion, a novel spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers.

0 favorites 0 likes

#llm-training

@UnslothAI: We’re excited to share that Unsloth has joined the PyTorch Ecosystem! Unsloth is an open-source project that makes trai…

X AI KOLs Following ↗ · 2026-05-11 Cached

Unsloth, an open-source library for efficient LLM training and inference, has officially joined the PyTorch Ecosystem to enhance accessibility and performance. The announcement highlights new features like Unsloth Studio and optimized kernels for reduced VRAM usage.

0 favorites 0 likes

#llm-training

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

arXiv cs.LG ↗ · 2026-05-11 Cached

This paper proposes Shadow Mask Distillation (SMD) to solve the off-policy bias caused by KV cache compression during reinforcement learning post-training for large language models. It introduces a mechanism that ensures on-policy alignment and improves memory efficiency for long-context reasoning tasks.

0 favorites 0 likes

#llm-training

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.

0 favorites 0 likes

#llm-training

G-Zero: Self-Play for Open-Ended Generation from Zero Data

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.

0 favorites 0 likes

#llm-training

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.

0 favorites 0 likes

#llm-training

SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)

TLDR AI ↗ · 2026-05-11 Cached

This article analyzes post-training methods for language models through a distributional perspective, comparing how SFT, RL, and on-policy distillation reshape model distributions and impact phenomena like catastrophic forgetting.

0 favorites 0 likes

#llm-training

@RohOnChain: Anthropic pays $750,000+ a year for engineers who can train LLMs to do exactly what your prompt says. Stanford broke do…

X AI KOLs Timeline ↗ · 2026-05-10 Cached

The article claims that Stanford has released a free technique for training LLMs to adhere strictly to prompts, a skill Anthropic reportedly pays high salaries for. It urges readers to bookmark the resource before it is removed.

0 favorites 0 likes

#llm-training

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Hacker News Top ↗ · 2026-05-10 Cached

The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.

0 favorites 0 likes

#llm-training

@0xLogicrw: MiniMax published a technical blog post detailing the root cause analysis for its M2 series large models' inability to output the person's name "Ma Jiaqi". Starting from a single case study, the investigation ultimately revealed a systematic degradation issue affecting nearly 5% of the entire vocabulary. The root cause was a severe disconnect in data coverage between the two training stages of the large model. In the first stage (pre-training), massive amounts of internet text were used to cre…

X AI KOLs Timeline ↗ · 2026-05-10

MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.

0 favorites 0 likes

llm-training

Submit Feedback