ExpRL: Exploratory RL for LLM Mid-Training

Hugging Face Daily Papers 06/15/26, 12:00 AM Papers

Summary

ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.

Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through mid-training on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: RL-based mid-training using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as reward scaffolds: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.

Original Article

View Cached Full Text

Cached at: 06/16/26, 03:32 PM

Paper page - ExpRL: Exploratory RL for LLM Mid-Training

Source: https://huggingface.co/papers/2606.17024 ExpRL: Exploratory RL for LLM Mid-Training

What if reference solutions could prime a model for RLwithout ever being shown to the policy?

We introduceExpRL, an RL-based mid-training method that uses human-written reference solutions as dense rewardscaffoldsrather than imitation targets. The reference stays hidden from the policy — an LLM judge scores the model’s own on-policy reasoning against it, rewarding partial progress and useful intermediate steps that sparse final-answer rewards never upweight. The trick is a simple asymmetry: models verify progress against a reference far better than they generate full solutions from scratch.

Two variants:ExpRL-Outcome(dense reward on the full trace) andExpRL-Process(dense rewards on prefixes for finer credit assignment). Both expand the model’s coverage over solution strategies (pass@k)beforesparse-reward RL even starts.

On hard math reasoning (4B Qwen3-Instruct policy), ExpRL beats SFT, sparse GRPO, and self-distillation as RL priming and gives a stronger initialization for downstream RL -- ExpRL-Process reaches ~63% on AIME-2026 vs 58.75% for the strongest baseline. It also increases verification, self-correction, and backtracking, and extends to mixed-domain (math / science / coding) and to a small 4B judge priming a larger 8B policy.

Happy to discuss — feedback welcome!

ExpRL: Exploratory RL for LLM Mid-Training

Paper page - ExpRL: Exploratory RL for LLM Mid-Training

Similar Articles

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Submit Feedback

Similar Articles

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning