llm-training

#llm-training

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Hugging Face Daily Papers ↗ · 3d ago Cached

CLI-Universe is a synthesis engine that generates verifiable terminal-agent tasks via multi-dimensional capability taxonomy and evidence-guided research, producing a distilled dataset of 6,000 trajectories. Fine-tuning Qwen3-32B on this dataset achieves 33.4% on Terminal-Bench 2.0, setting a new state-of-the-art for open-source models at or below 32B parameters.

0 favorites 0 likes

#llm-training

@didier_lopes: Incredible how Z. ai literally has their RL infrastructure open source. The entire OPD post-training of GLM-5.2 took on…

X AI KOLs Following ↗ · 6d ago Cached

Z. ai has open-sourced its RL infrastructure, the slime framework, which enabled efficient OPD post-training of GLM-5.2 in about two days. slime is an LLM post-training framework for RL scaling that integrates Megatron and SGLang, and has been battle-tested by frontier models like GLM, Qwen, DeepSeek, and Llama.

0 favorites 0 likes

#llm-training

@rohanpaul_ai: This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logi…

X AI KOLs Following ↗ · 2026-06-17 Cached

Researchers from Stanford, UC, and Nanjing University release SEFD, a dataset of 152B tokens from SEC filings converted to layout-faithful MultiMarkdown, preserving table structure for LLM training with minimal overlap with Common Crawl.

0 favorites 0 likes

#llm-training

@akshay_pachaar: Train your own LLM from scratch. This repo builds a GPT-style transformer from the ground up, without using any high-le…

X AI KOLs Following ↗ · 2026-06-15 Cached

A repository that builds a GPT-style transformer from scratch without high-level libraries, covering everything from data preprocessing to generation, and includes guides for SFT and RLHF.

0 favorites 0 likes

#llm-training

The Culture Funnel: You Can't Align What isn't in the Data

arXiv cs.CL ↗ · 2026-06-15 Cached

This paper introduces the 'culture funnel' concept, demonstrating that cultural signals in LLM training data sharply decline during post-training stages. The authors release a 5.6M-sample tagged dataset to help preserve cultural grounding in model alignment.

0 favorites 0 likes

#llm-training

@ActuallyIsaak: Here is a real-life run, end-to-end from training to using the trained LLM in LM Studio by @lmstudio MLX-LoRA-Studio gi…

X AI KOLs Following ↗ · 2026-06-14 Cached

MLX-LoRA-Studio is a native macOS app for fine-tuning LLMs on Apple Silicon, offering a user-friendly interface and support for various training algorithms including SFT, DPO, and QAT. It is fully open-source and allows local, private fine-tuning without cloud dependency.

0 favorites 0 likes

#llm-training

@PierceZhang34: Train a Small Model in 10 Seconds! First Look at the LLM Training Tool: http://llm.istanbul Recently discovered a super fun open-source style tool website — http://llm.istanbul, which claims to be a WebGPU LLM Workbench, meaning it fully...

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Introduces llm.istanbul, a WebGPU LLM workbench that lets you train small models, train tokenizers, and generate text entirely in the browser, no server required, fully local.

0 favorites 0 likes

#llm-training

Improving Cross-Format Robustness in Language Models with Multi-Format Training

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces FormatMix, a multi-format training approach that improves LLM consistency across different answer formats by expanding a subset of training items into multiple equivalent formats, showing that format diversity is key to robustness.

0 favorites 0 likes

#llm-training

@akshay_pachaar: https://x.com/akshay_pachaar/status/2064700531600458093

X AI KOLs Following ↗ · 2026-06-10 Cached

This article explains how to use GRPO to fine-tune an LLM (Qwen3-8B) for reliable JSON structured output, improving schema accuracy from 62% to 82%, surpassing GPT-4.1's 58%.

0 favorites 0 likes

#llm-training

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

The paper introduces RACES, a recursive automated composition framework that treats verifiable environments as composable building blocks to scale reinforcement learning for LLMs, enabling efficient reasoning generalization through compositional operators.

0 favorites 0 likes

#llm-training

@neural_avb: If you think about it, LLM training in 2026 is really a 3-step loop : - train it on some data - dogfood it/run categori…

X AI KOLs Timeline ↗ · 2026-06-08 Cached

The tweet outlines a 3-step loop for LLM training in 2026: train on data, run evals, and add synthetic data for underperforming tasks. It emphasizes the accessibility of legal distillation via open source models and cheap APIs, noting that training on reasoning traces alone can achieve high scores.

0 favorites 0 likes

#llm-training

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline ↗ · 2026-06-07 Cached

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.

0 favorites 0 likes

#llm-training

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

arXiv cs.LG ↗ · 2026-06-05 Cached

This paper introduces MAPL, a method for learned orthogonal compression of activations in pipeline parallelism, reducing communication overhead while maintaining performance via Stiefel manifold constraints and per-stage factorized anchor embeddings.

0 favorites 0 likes

#llm-training

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Reddit r/MachineLearning ↗ · 2026-06-04

Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.

0 favorites 0 likes

#llm-training

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

arXiv cs.LG ↗ · 2026-06-04 Cached

Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.

0 favorites 0 likes

#llm-training

Self-Distilled Policy Gradient

arXiv cs.LG ↗ · 2026-06-04 Cached

SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.

0 favorites 0 likes

#llm-training

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

arXiv cs.CL ↗ · 2026-06-04 Cached

This paper proposes a difficulty-aware SFT-then-RL framework for training small language models (≤3B parameters) on reasoning tasks, arguing that data difficulty should be strategically aligned with the distinct roles of SFT (learning new skills) and RL (consolidating partial skills). The authors introduce a Bridge mechanism for hard SFT samples and Critique Fine-Tuning for RL failures, showing consistent improvements across five reasoning benchmarks.

0 favorites 0 likes

#llm-training

@yvbbrjdr: I recommend everyone to read the MAI-Thinking-1 technical paper. It contains detailed (almost all) information on how to train a SOTA LLM. https://microsoft.ai/wp-content/uploads/2026/06/ma…

X AI KOLs Timeline ↗ · 2026-06-02 Cached

Recommended reading: the MAI-Thinking-1 technical paper, which details almost all the steps to train a SOTA large language model.

0 favorites 0 likes

#llm-training

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Reddit r/LocalLLaMA ↗ · 2026-06-02

Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.

0 favorites 0 likes

#llm-training

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

OmniOPD introduces a logit-free on-policy distillation method that uses chunk-level semantic similarity and speculative verification to train student models with black-box teachers, achieving up to +28.64% improvement on math benchmarks over standard OPD.

0 favorites 0 likes

llm-training

Submit Feedback