language-model-training

#language-model-training

Dispersion loss counteracts embedding condensation in small language models

Hacker News Top ↗ · 2026-07-03 Cached

This paper observes that token embeddings in small language models condense into a narrow cone-like subspace, a phenomenon termed embedding condensation, and proposes a dispersion loss to counteract it, improving generalization.

0 favorites 0 likes

#language-model-training

First Steps Toward Automated AI Research (12 minute read)

TLDR AI ↗ · 2026-06-12 Cached

Recursive releases an automated AI research system that achieves state-of-the-art results on three benchmarks: fixed-budget language model training, small-model training speed, and GPU kernel optimization. The system automates the research loop and open-sources artifacts from its runs.

0 favorites 0 likes

#language-model-training

@ChengleiSi: Excited to share these preliminary results on our internal autoresearch system @Recursive_SI, where we achieve SOTA on …

X AI KOLs Following ↗ · 2026-06-11 Cached

Recursive's automated AI research system achieves state-of-the-art results on NanoChat, NanoGPT Speedrun, and GPU kernel benchmarks by automating the research loop without task-specific adaptations, and open-sourcing artifacts for further inspection.

0 favorites 0 likes

#language-model-training

@Recursive_SI: https://x.com/Recursive_SI/status/2064980090702962699

X AI KOLs Timeline ↗ · 2026-06-11 Cached

Recursive releases early results from its automated AI research system, achieving state-of-the-art in fixed-budget language model training, small-model training speed, and GPU kernel optimization, and open-sources artifacts.

0 favorites 0 likes

#language-model-training

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

The S2L-PO framework uses smaller models as natural explorers to enhance policy diversity in GRPO for training large language models. It achieves faster convergence and improves accuracy on mathematical reasoning benchmarks while reducing rollout compute.

0 favorites 0 likes

#language-model-training

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Hugging Face Daily Papers ↗ · 2026-05-14 Cached

This paper investigates the impact of subword tokenization on LLM training efficiency and performance by conducting controlled byte-level pretraining experiments. It reveals key factors such as training throughput and the integration of subword boundaries as linguistic priors.

0 favorites 0 likes

language-model-training

Dispersion loss counteracts embedding condensation in small language models

First Steps Toward Automated AI Research (12 minute read)

@ChengleiSi: Excited to share these preliminary results on our internal autoresearch system @Recursive_SI, where we achieve SOTA on …

@Recursive_SI: https://x.com/Recursive_SI/status/2064980090702962699

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Submit Feedback