Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Summary
This paper proposes mid-training language models on self-generated diverse reasoning traces before reinforcement learning, showing improved RL performance on math benchmarks by exposing models to multiple valid solution approaches.
View Cached Full Text
Cached at: 05/20/26, 10:40 PM
Paper page - Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Source: https://huggingface.co/papers/2605.08472 Excited to share our new paper 🚀
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models We study a simple question: Can we make RL more effective by first teaching models multiple correct ways to solve the same problem? Instead of reinforcing a single reasoning trajectory, can we expose the model to a richer space of valid approaches before RL begins?
Our investigation is simple. Before RL, we mid-train the model on multiple correct ways of solving the same problem, so that when RL begins, it operates over a richer set of priors rather than a single narrow reasoning mode. Importantly, these reasoning traces are self-generated by the same base model that is later trained with RL. No human-written chains of thought, and no distillation from a stronger teacher model.
To make the solutions diverse, we use problem-solving heuristics inspired by George Pólya’s How to Solve It. For each question, the model is prompted to solve it using different approaches: analogy, working backward, decomposition, introducing auxiliary elements, logical step-by-step justification, bright ideas, and more. This gives us structurally distinct reasoning traces for the same underlying problem.
The generated solutions are filtered in two steps. First, rule-based verification keeps only responses with the correct final answer. Then, a reward model scores how well the response follows the intended heuristic. The highest-scoring correct response per (question, heuristic) pair is selected, giving us multiple correct, heuristic-specific solution traces per question. 🧠
Why should this help RL? Our theoretical view: mid-training on n correct approaches creates multiple high-probability continuations at reasoning branch points, an N-modal distribution. Under a positive gradient, RL can meaningfully update across all N modes rather than sharpening a single one. Under a negative gradient, mass removed from the sampled approach redistributes to the remaining N-1 dominant modes, i.e., to the other valid approaches the model knows. This is the mechanism by which RL learns to combine the approaches introduced during mid-training.
Empirically, this improves GRPO-based RL. On Llama-3.2-3B-Instruct, models initialized with our heuristic-guided mid-training consistently outperform vanilla RL and STaR+RL across six math benchmarks, with gains becoming clearer at larger pass@k. At pass@64, the average improves from 44.21 for vanilla RL to 48.09 with n=16. 📊
One of our most interesting findings: RL doesn’t just use the individual approaches from mid-training. It composes them. We analyze reasoning traces using an LLM-based classifier across 64 Pólya-style heuristics. At n=16, RL-trained models combine multiple problem-solving approaches in 56.7% of chains, vs. only 23.3% before RL. This composition rate grows as n increases. Combinations like Bolzano + Decompose or Restate + Decompose + Carry-Out emerge consistently after RL, even though they were never observed together during mid-training. RL is doing the composition. 🔗
Four additional findings from our analysis: Under a fixed instance-level budget, 16 approaches on 463 questions outperform 1 approach on 7,408 questions, around 7% relative improvement after RL. This means learning more problem solving approaches is more beneficial than learning to solve more problems, during mid-training.
Correctness vs Diversity:. Diverse but incorrect reasoning traces fall below vanilla RL. With more incorrect problem solving approaches, the performance worsens more. Diversity alone is not enough, and correctness is pivotal.
More diverse than distillation. Our self-generated data scores Vendi 13.81 vs. 10.95 for QwQ-32B distillation, and gives better post-RL performance despite coming from a much weaker model.
Generalizes beyond math. Despite math-centric heuristics, gains on HumanEval (code) and MuSR (narrative reasoning) show that Polya’s problems solving approaches transfer.
Takeaway: RL performance depends not only on the RL stage itself, but also on the distribution the model is exposed to beforehand. Mid-training on diverse, self-generated, correct reasoning traces improves subsequent RL, and the effect is driven by RL learning to compose the approaches introduced during mid-training.
Similar Articles
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
This paper investigates how using diverse self-generated data during mid-training improves the effectiveness of Reinforcement Learning in Large Language Models, particularly for reasoning tasks.
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.
Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation
This paper proposes a reinforcement learning approach to enable large language models to translate unseen languages by leveraging in-context linguistic knowledge, outperforming in-context learning and supervised fine-tuning.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.






