Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination
Summary
Introduces Atomic Decomposition and Recombination (ADR), a framework that generates novel and challenging verifiable code tasks by decomposing and recombining atomic elements, enabling scalable reinforcement learning with verifiable rewards for large language models.
View Cached Full Text
Cached at: 06/05/26, 06:07 AM
Paper page - Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination
Source: https://huggingface.co/papers/2605.31058
Abstract
Atomic Decomposition and Recombination (ADR) framework generates novel and challenging verifiable code tasks for scalable reinforcement learning with verifiable rewards in large language models.
Reinforcement Learning with Verifiable Rewards(RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities ofLarge Language Models(LLMs). However, thescalabilityof RLVR is severely constrained by the scarcity of sufficiently challengingverifiable code tasksthat target near the model’s edge of competence. Prior studies often rely onheuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we proposeAtomic DecompositionandRecombination(ADR), a novel framework that generatesverifiable code tasksvia decomposition into atomic elements and controlledrecombination, thereby enabling the generation of genuinely novel and challengingverifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, includingalgorithmic programming,tool usage, anddata science. Our work sheds light on a new paradigm for novelcode task synthesisand scalable RLVR training.
View arXiv pageView PDFGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.31058
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.31058 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31058 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31058 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@adithya_s_k: https://x.com/adithya_s_k/status/2054961319179420035
An analysis of why RL for coding tasks is gaining traction due to verifiable rewards, and why the emerging framework Harbor addresses the bottleneck of environment complexity in RL training.
CodeAlchemy: Synthetic Code Rewriting at Scale
CodeAlchemy is a synthetic data generation framework that transforms publicly available code into semantically rich training data using five strategies, producing over 500 billion tokens and enabling small models to outperform much larger ones on code benchmarks.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.
REVES: REvision and VErification--Augmented Training for Test-Time Scaling
Proposes REVES, a two-stage iterative framework that alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.