ExpRL: Exploratory RL for LLM Mid-Training
Summary
ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.
View Cached Full Text
Cached at: 06/16/26, 03:32 PM
Paper page - ExpRL: Exploratory RL for LLM Mid-Training
Source: https://huggingface.co/papers/2606.17024 ExpRL: Exploratory RL for LLM Mid-Training
What if reference solutions could prime a model for RLwithout ever being shown to the policy?
We introduceExpRL, an RL-based mid-training method that uses human-written reference solutions as dense rewardscaffoldsrather than imitation targets. The reference stays hidden from the policy — an LLM judge scores the model’s own on-policy reasoning against it, rewarding partial progress and useful intermediate steps that sparse final-answer rewards never upweight. The trick is a simple asymmetry: models verify progress against a reference far better than they generate full solutions from scratch.
Two variants:ExpRL-Outcome(dense reward on the full trace) andExpRL-Process(dense rewards on prefixes for finer credit assignment). Both expand the model’s coverage over solution strategies (pass@k)beforesparse-reward RL even starts.
On hard math reasoning (4B Qwen3-Instruct policy), ExpRL beats SFT, sparse GRPO, and self-distillation as RL priming and gives a stronger initialization for downstream RL -- ExpRL-Process reaches ~63% on AIME-2026 vs 58.75% for the strongest baseline. It also increases verification, self-correction, and backtracking, and extends to mixed-domain (math / science / coding) and to a small 4B judge priming a larger 8B policy.
Happy to discuss — feedback welcome!
Similar Articles
RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training
Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
This paper challenges the assumption that RL teaches new reasoning capabilities to LLMs, arguing instead that it performs sparse policy selection at high-entropy decision points. It introduces ReasonMaxxer, an RL-free method that matches full RL performance with significantly lower training costs.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper introduces LLM-as-Environment-Engineer, a framework where LLMs design their own training environments for reinforcement learning in multi-agent reasoning tasks, enabling self-improving training that surpasses larger proprietary models.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.