TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

arXiv cs.LG Papers

Summary

Proposes TD-Grokking, a training-time decomposition framework that recursively breaks down intractable zero-reward problems into verifiable subproblems, enabling LLMs to learn from failed trajectories. Outperforms vanilla GRPO and baselines on mathematical and medical reasoning tasks.

arXiv:2606.09883v1 Announce Type: new Abstract: Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists: RLVR fails on highly challenging zero-reward problems, where all sampled reasoning trajectories yield uniformly failed outcomes, providing no optimization signal to drive model improvement. Prior efforts to address this limitation, such as dense process supervision, partial reward assignment, or prefix-guided exploration, suffer from inherent task constraints or do not fully equip the policy model with the capabilities necessary to solve the original intractable problems. To address this, we propose TD-Grokking, a training-time decomposition framework for zero-reward problems. It recursively decomposes intractable root problems into self-contained, verifiable subproblems, forming hierarchical trees where solvable leaves provide non-zero rewards. Evaluations on mathematical and medical tasks show that TD-Grokking outperforms vanilla GRPO as well as all baseline approaches. Together with detailed analysis, these results confirm that training-time decomposition effectively converts zero-reward examples into usable training signals, enabling consistent performance gains. Our code and datasets are available at https://anonymous.4open.science/r/TD-Grokking-6567/.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:16 AM

# TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition
Source: [https://arxiv.org/html/2606.09883](https://arxiv.org/html/2606.09883)
###### Abstract

Large language models \(LLMs\) have made remarkable progress in reasoning tasks, largely driven by post\-training paradigms, especially reinforcement learning with verifiable rewards \(RLVR\)\. However, a critical bottleneck persists: RLVR fails on highly challenging zero\-reward problems, where all sampled reasoning trajectories yield uniformly failed outcomes, providing no optimization signal to drive model improvement\. Prior efforts to address this limitation, such as dense process supervision, partial reward assignment, or prefix\-guided exploration, suffer from inherent task constraints or do not fully equip the policy model with the capabilities necessary to solve the original intractable problems\. To address this, we propose TD\-Grokking, a training\-time decomposition framework for zero\-reward problems\. It recursively decomposes intractable root problems into self\-contained, verifiable subproblems, forming hierarchical trees where solvable leaves provide non\-zero rewards\. Evaluations on mathematical and medical tasks show that TD\-Grokking outperforms vanilla GRPO as well as all baseline approaches\. Together with detailed analysis, these results confirm that training\-time decomposition effectively converts zero\-reward examples into usable training signals, enabling consistent performance gains\. Our code and datasets are available at[https://anonymous\.4open\.science/r/TD\-Grokking\-6567/](https://anonymous.4open.science/r/TD-Grokking-6567/)\.

11footnotetext:Department of Data Science, City University of Hong Kong\.22footnotetext:Hong Kong Institute of AI for Science, City University of Hong Kong\.33footnotetext:Li Auto Inc\.$\\dagger$$\\dagger$footnotetext:Corresponding author: Ning Miao \(ningmiao@cityu\.edu\.hk\)\.## 1Introduction

Large language models \(LLMs\) have achieved remarkable progress in mathematical and algorithmic reasoning, with performance continuing to advance rapidly following elaborate post\-training paradigms\(Shaoet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib15); Guoet al\.,[2025](https://arxiv.org/html/2606.09883#bib.bib16); Yanget al\.,[2025](https://arxiv.org/html/2606.09883#bib.bib42)\)\. As an essential component of post\-training, reinforcement learning with verifiable rewards \(RLVR\) unlocks the inherent reasoning capabilities of LLMs by contrasting successful and unsuccessful reasoning trajectories\(Shaoet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib15); Guoet al\.,[2025](https://arxiv.org/html/2606.09883#bib.bib16)\)\. This naturally raises a fundamental research question: how can LLMs acquire the ability to tackle highly challenging problems where no successful trials are available? Learning from such zero\-reward problems poses a critical bottleneck: training on these questions offers no optimization signal, as all sampled rollouts yield uniformly failed outcomes\. The resulting constant zero reward signal stagnates model optimization and renders standard RLVR ineffective\.

Prior research has made considerable efforts to alleviate the unlearnability of zero\-reward problems\. For example,Lightmanet al\.\([2024](https://arxiv.org/html/2606.09883#bib.bib36)\)demonstrate that dense process supervision, such as process reward models \(PRMs\), provides scores for intermediate steps in a generated reasoning trajectory, thereby refining credit assignment beyond final\-answer supervision\. In code generation tasks,Sunet al\.\([2026](https://arxiv.org/html/2606.09883#bib.bib40)\)introduce the notion of partial correctness by giving partial credits to code generations that pass a subset of test cases\. Their observations reveal a grokking\-like phenomenon: full\-pass reward starts to increase after roughly 450 RL training steps, mirroring thegrokkingphenomenon observed in supervised learning\. Despite their empirical effectiveness, process or partial reward strategies suffer from inherent limitations and task constraints\. In mathematical reasoning, training accurate PRMs on frontier problems remains difficult, and evaluating partial correctness for a solution is often infeasible\.

A separate line of research facilitates exploration on hard problems by conditioning the model generation on partial solution trajectories or privileged hints\(Liet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib19); Zhanget al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib20); Chenet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib21); Liaoet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib22); Xiaet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib23)\)\. Such prefix\-guided exploration allows the policy model to focus on the latter part of the solution, which increases the chance of obtaining non\-zero rewards\. While effective as an exploration aid, this approach shortens the horizon of the current rollout rather than endowing the model with all core capabilities required to solve the original problem\. In other words, the model learns to continue reasoning from an intermediate state with partially solved trajectories, but the upstream reasoning required to reach those intermediate states remains unmodeled\.

![Refer to caption](https://arxiv.org/html/2606.09883v1/x1.png)Figure 1:Overview of TD\-Grokking\. \(a\) Regular GRPO stagnates on challenging zero\-reward problems\. \(b\) Through training\-time decomposition, TD\-Grokking obtains dense training signals\.In this work, we propose a training\-time decomposition framework,TD\-Grokking, which is tailored specifically for effective learning from challenging zero\-reward problems\. For each intractable zero\-reward root problem, we use a decomposition generator to construct self\-contained, verifiable subproblems that encapsulate the sub\-capabilities essential for solving the original problem\. If a generated subproblem remains beyond the capability of the current policy model, we recursively apply the decomposition procedure\. This pipeline transforms intractable zero\-reward problems into hierarchical decomposition trees, where solvable leaf subproblems provide non\-zero outcome rewards under standard final\-answer verification\. RL training initiates from these reward\-enriched leaf nodes, and progressive optimization propagates upward to enhance performance on parent nodes, including root nodes, once child subproblems are reliably solved\. Figure[1](https://arxiv.org/html/2606.09883#S1.F1)illustrates our decomposition\-based training pipeline\.

Empirically, we evaluate TD\-Grokking across mathematical and medical domains, both of which contain highly challenging datasets with zero\-reward problems\. On mathematical benchmarks, TD\-Grokking improves the accuracy on AIME 24 and 25 by approximately 4% over vanilla GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib15)\), outperforming all baseline approaches\. On medical tasks, TD\-Grokking surpasses vanilla GRPO by up to 6\.2%\. These results demonstrate that decomposition acts as an effective training\-time mechanism to convert zero\-reward examples into usable learning signals, yielding consistent performance gains on challenging benchmarks\.

## 2Related Work

We now introduce previous efforts to help LLMs tackle hard problems\. We start with inference\-time boosting of LLMs, and then discuss two major approaches to learning to solve hard problems\.

#### Inference\-time problem decomposition\.

Decomposition has also been widely used at inference time\. Chain\-of\-thought aggregates intermediate reasoning paths\(Weiet al\.,[2022](https://arxiv.org/html/2606.09883#bib.bib30)\)\. Least\-to\-most prompting, decomposed prompting, Self\-Ask, and Plan\-and\-Solve further structure inference through subproblems, plans, modular prompts, or tool\-mediated steps\(Zhouet al\.,[2023](https://arxiv.org/html/2606.09883#bib.bib31); Khotet al\.,[2023](https://arxiv.org/html/2606.09883#bib.bib32); Presset al\.,[2023](https://arxiv.org/html/2606.09883#bib.bib33); Wanget al\.,[2023](https://arxiv.org/html/2606.09883#bib.bib34)\)\. Tree\-of\-thought methods extend this idea by searching over multiple intermediate reasoning states with lookahead, backtracking, or self\-evaluation\(Yaoet al\.,[2023](https://arxiv.org/html/2606.09883#bib.bib35)\)\.

These methods use decomposition as an inference\-time scaffold for a fixed model\. They can improve how the model organizes reasoning, but they do not directly train required but missing reasoning skills\. They are therefore most effective when the required subskills are already within the model’s reachable repertoire\. Our setting is different: the initial model receives nearly zero verified reward on the root problems\. We therefore use decomposition as a training\-time mechanism, where each subproblem becomes an independent RL instance rather than an intermediate prompt within a single inference attempt\.

#### Process\-level and partial feedback\.

When a question is so hard that the LLM cannot reach a correct answer after multiple trials, the reward will be zero for all generations, which makes learning impossible\. The first approach to mitigate this is to give rewards to partially correct solutions\. For example, process supervision trains reward models to score intermediate reasoning steps instead of only final answers, and has been shown to improve mathematical reasoning and verification\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib36)\)\. Follow\-up work reduces the annotation cost by automatically constructing step\-level labels or collecting process reward data through search, as in Math\-Shepherd and OmegaPRM\(Wanget al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib37); Luoet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib38)\)\. However, it is very costly to train a separate process reward model, and the accuracy and generalization of existing PRMs make them unsuitable for the RL training of SOTA models\(Zhenget al\.,[2025](https://arxiv.org/html/2606.09883#bib.bib39)\)\. In coding tasks, special partial\-correctness signals can also be obtained from subsets of test cases, allowing RL to reward programs before full correctness is achieved\(Sunet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib40)\)\.

#### Training\-time hints and scaffolded exploration\.

A related line of work improves exploration by making difficult problems easier during training\. QuestA augments hard questions with partial solution sketches\(Liet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib19)\); Scaf\-GRPO injects hierarchical hints when GRPO encounters all\-fail rollout groups\(Zhanget al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib20)\); NuRL and self\-hinting methods use generated cues to move hard prompts into a learnable region\(Chenet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib21); Liaoet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib22)\); and HiLL studies whether hinted success transfers to the no\-hint setting\(Xiaet al\.,[2026](https://arxiv.org/html/2606.09883#bib.bib23)\)\.

These methods are effective exploration aids, but a hint\-conditioned trajectory is not the same as solving the original root problem\. The model may learn to continue from a useful hint without learning to produce the hidden reasoning state itself\. In contrast, our method uses decomposition not merely to shorten the rollout horizon, but to convert hidden reasoning requirements into subproblems with independent verifiable rewards\.

## 3Method

The core objective of TD\-Grokking is to extract usable training signal from challenging reasoning problems that produce no useful outcome reward under standard direct RLVR\. Given a difficult reasoning problem, TD\-Grokking does not modify the final\-answer verifier, introduce learned process reward models, or make root rollouts easier by supplying privileged hints\. Instead, it reorganizes the fundamental training unit: zero\-reward problems are expanded into smaller, self\-contained, fully verifiable subproblems that capture the reasoning demands required to solve the original root problem\. Reinforcement learning is then performed on these subproblems with standard outcome\-based rewards, enabling the model to escape the zero\-reward regime and recover performance on the original root problems\.

### 3\.1Problem Setting

Letxxdenote a root reasoning problem with ground\-truth verifiable answeryy\. A policy modelπθ\\pi\_\{\\theta\}generates a solution trajectoryo∼πθ\(⋅∣x\)o\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\), from which an answery^​\(o\)\\hat\{y\}\(o\)is extracted\. The standard outcome reward is

R​\(x,o\)=𝟏​\[Verify⁡\(y^​\(o\),y\)=1\]\.R\(x,o\)=\\mathbf\{1\}\\left\[\\operatorname\{Verify\}\\bigl\(\\hat\{y\}\(o\),y\\bigr\)=1\\right\]\.
For any verifiable problemqqwith target answeraa, we define its empirical verified accuracy under policyπ\\piand sampling budgetKKas

AccKπ\(q\)=1K∑k=1K𝟏\[Verify\(a^\(ok\),a\)=1\],ok∼π\(⋅∣q\)\.\\operatorname\{Acc\}^\{\\pi\}\_\{K\}\(q\)=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\mathbf\{1\}\\left\[\\operatorname\{Verify\}\\bigl\(\\hat\{a\}\(o\_\{k\}\),a\\bigr\)=1\\right\],\\qquad o\_\{k\}\\sim\\pi\(\\cdot\\mid q\)\.The problemqqmay be either an original root problem or a decomposed subproblem\. For an initial policyπθ0\\pi\_\{\\theta\_\{0\}\}, a root problemxxis called*zero\-reward*under budgetKKif

AccKπθ0⁡\(x\)=0\.\\operatorname\{Acc\}^\{\\pi\_\{\\theta\_\{0\}\}\}\_\{K\}\(x\)=0\.This definition is policy\- and budget\-dependent: the problem is not assumed to be intrinsically unsolvable, but it is uninformative for direct outcome\-only RL because all sampled rollouts receive zero reward\.

### 3\.2Constructing Verifiable Subproblems

For each hard root problemxx, TD\-Grokking generates a set of labeled candidate instances

𝒟​\(x\)=\{\(sx,1,ax,1\),…,\(sx,mx,ax,mx\)\},\\mathcal\{D\}\(x\)=\\\{\(s\_\{x,1\},a\_\{x,1\}\),\\ldots,\(s\_\{x,m\_\{x\}\},a\_\{x,m\_\{x\}\}\)\\\},wheresx,js\_\{x,j\}denotes a candidate subproblem andax,ja\_\{x,j\}is its target answer used for verification\. We use the term*subproblem*to refer to the questionsx,js\_\{x,j\}itself, while the pair\(sx,j,ax,j\)\(s\_\{x,j\},a\_\{x,j\}\)denotes the corresponding verifiable training instance\.

A candidate subproblemsx,js\_\{x,j\}is retained only if it satisfies three conditions\. First, it is*root\-conditioned*: it corresponds to a local reasoning requirement used in solving the parent root, rather than to a generic skill label\. Second, it is*self\-contained*: all assumptions needed to solve it are stated in the subproblem itself, with no dependence on the root solution or on other subproblem answers\. Third, it is*verifiable*: its final answer has a well\-defined targetax,ja\_\{x,j\}and can be checked by the same outcome\-style verifier used for RLVR, after the usual answer extraction and normalization\.

The decomposition pipeline is implemented as a sequence of structured calls to the decomposition generator, followed by validation\. Exact prompt templates, parsing rules, and generation hyperparameters are reported in the appendix\. The main method has six stages\.

#### 1\. Hard\-root selection\.

We first identify roots on which the starting policy receives zero verified reward under repeated sampling\. This focuses decomposition on examples that direct outcome\-only RL is least able to exploit\. The selection criterion is tied to the base policy, verifier, and sampling budget; a root can leave the zero\-reward set after training\.

#### 2\. Guide preparation\.

For each selected root, TD\-Grokking obtains a decomposition guide as described above\. When a dataset\-provided solution is available, the guide is taken from the source data after answer\-consistency checking\. When it is unavailable, the decomposition generator first produces a solution sketch whose final answer must verify against the known target answer\. This step makes the subsequent decomposition solution\-guided rather than purely associative: subproblems are extracted from a concrete reasoning path instead of being sampled as generic skills that merely look related to the root\.

#### 3\. Solution\-guided segmentation\.

Given\(x,y,g\)\(x,y,g\), the generator segments the guide into a short ordered list of local reasoning requirements\. A segment is not a training label and is not exposed to the student\. Its purpose is to locate the operations that make the root difficult as an end\-to\-end problem: for example, identifying a hidden constraint, deriving a recurrence, simplifying a symbolic expression, resolving a case split, or computing a key intermediate value\.

#### 4\. Requirement extraction\.

For each segment, the generator extracts the minimal information needed to state the corresponding local requirement as a separate task\. This includes relevant conditions from the original problem statement and, when necessary, intermediate facts from the guide\. The extraction step separates*given conditions*, which should appear in the subproblem statement, from*target conclusions*, which should remain for the policy to solve\. This distinction is what prevents decomposition from becoming either underspecified local hints or a copied solution trace\.

#### 5\. Self\-contained rewriting\.

The extracted requirement is rewritten into a problem–answer pair\(sx,j,ax,j\)\(s\_\{x,j\},a\_\{x,j\}\)\. The rewritten problem must contain all assumptions needed for solving it, avoid references to neighboring subproblems, and end with an answer that can be automatically checked\. This step turns a latent operation in the root solution into an independent RL instance\.

#### 6\. Structured validation\.

The raw decomposition is not used directly\. TD\-Grokking treats validation as part of the method, because the useful signal comes from retaining only those local tasks that are simultaneously root\-derived, self\-contained, and verifiable\. The validator checks candidates at multiple levels\.

At the*format level*, each candidate must expose a parseable problem, answer, and optional solution field; the answer must be non\-empty and concise rather than a paragraph of explanation\. At the*dependency level*, the problem must be self\-contained, must include all necessary assumptions, and must avoid references to earlier subproblems or to the guide\. At the*verifier level*, the answer must be compatible with the same answer\-extraction and normalization interface used for the root\. At the*structure level*, the retained decomposition should cover meaningful local requirements from the guide without excessive duplication, over\-fragmentation, direct root copying, or answer\-revealing shortcuts\.

Only candidates that pass validation are used for training\. If too few candidates pass, or if the retained candidates do not cover the major reasoning requirements in the guide, the root is regenerated with stricter instructions\. This retry loop is a quality\-control step: it improves the reliability of the decomposed training pool while preserving the invariant that every training node is a self\-contained, verifiable task\.

### 3\.3Difficulty Calibration and Recursive Expansion

After validation, TD\-Grokking calibrates the difficulty of each retained node against the current policy\. For a retained subproblemsx,js\_\{x,j\}, we estimateAccKcπ⁡\(sx,j\)\\operatorname\{Acc\}^\{\\pi\}\_\{K\_\{c\}\}\(s\_\{x,j\}\)using the same outcome verifier as above\. A subproblem with at least one verified rollout under the calibration budget, i\.e\.,AccKcπ⁡\(sx,j\)\>0\\operatorname\{Acc\}^\{\\pi\}\_\{K\_\{c\}\}\(s\_\{x,j\}\)\>0, is treated as an*active trainable leaf*: it is not already solved by construction, but it can provide nonzero reward events for RL\. A retained subproblem withAccKcπ⁡\(sx,j\)=0\\operatorname\{Acc\}^\{\\pi\}\_\{K\_\{c\}\}\(s\_\{x,j\}\)=0is an unresolved node for the current policy\. Such a node can be decomposed again using the same guide\-preparation, segmentation, rewriting, and validation procedure, or deferred until the policy improves\.

Thus, TD\-Grokking naturally defines a decomposition tree\. The original hard problem is the root, unresolved subproblems become internal nodes, and calibrated trainable subproblems become leaves\. Although the procedure allows recursive expansion, in practice we found that most first\-level retained subproblems already produced verified rollouts under the starting policy\.

### 3\.4Training on the Decomposition Tree

Given a decomposition tree, TD\-Grokking trains the policy bottom\-up\. At a given training stage, the active set consists of retained nodes that are verifiable and appropriate for the current policy’s difficulty level\. Training first emphasizes active leaves, because they provide outcome rewards while remaining tied to the original hard root\. After training, the policy is re\-evaluated on parent nodes and on the root\. If a parent begins to produce verified rollouts, it can be promoted into the active set so that the model practices recomposing the learned local requirements into a larger reasoning unit\. If a parent remains zero\-reward, the same decomposition operation can be applied again or the node can remain deferred\. This yields a curriculum in which the model first learns local requirements and then recomposes them into progressively larger reasoning units\.

All nodes use the same final\-answer RLVR interface\. For a nodezz, whether it is a root or a subproblem, the rollout receives reward only if its extracted final answer verifies against the node’s targetaza\_\{z\}\. TD\-Grokking therefore does not require process labels, partial\-credit rewards, or task\-specific reward shaping\. The only change is which verifiable items are presented to RL and in what order\.

Because calibration showed that first\-level subproblems were already trainable, our controlled experiments stop expansion after the first level to isolate the effect of root\-derived subproblem training\. We report three static training views over the same source roots\.Root\-onlyis exactly vanilla RLVR on the original root problems, using the same outcome verifier, reward definition, prompting format, and optimizer, but without any decomposition\-derived items\.Sub\-onlytrains only on the retained subproblems and serves as the diagnostic for local\-to\-global recomposition, because any root\-level recovery cannot come from direct root rewards\.Mixedtrains on the union of roots and their subproblems, preserving pressure on the final task while adding local outcome\-reward signal from easier, root\-derived items\. In addition, we evaluate aDynamicvariant that maintains an accuracy\-based active set: subproblems are introduced when their parent remains uninformative and are retired once they become reliably solved, reducing training cost while following the same decomposition tree\. These views are experimental controls over the same decomposition tree, rather than separate reward objectives\.

## 4Experiments

We apply TD\-Grokking to improve training on zero\-reward problems in two distinct domains\. On competition\-level math, we test the main hypothesis: whether training\-time decomposition can turn otherwise uninformative roots into useful RL training signals\. We then turn to the medical domain to test whether TD\-Grokking remains useful outside mathematical reasoning\. Across all experiments, decomposition is performed only during training\-data construction; evaluation is performed on the original benchmark questions without inference\-time decomposition\.

### 4\.1Mathematical Reasoning

#### Setup\.

All mathematical TD\-Grokking variants start from Qwen3\-1\.7B and use GRPO with a final\-answer verifiable reward\(Yanget al\.,[2025](https://arxiv.org/html/2606.09883#bib.bib42); Shaoet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib15)\)\. The training pool is constructed from3,7883\{,\}788DeepScaleR\-hard questions, and the corpus is described in Appendix[A\.1](https://arxiv.org/html/2606.09883#A1.SS1)\. We compare our model with multiple baselines on benchmarks including AIME 2024/2025, AMC23, OlympiadBench, MATH500, and GPQA\-Diamond\(Zhang and Math\-AI Team,[2024](https://arxiv.org/html/2606.09883#bib.bib43),[2025](https://arxiv.org/html/2606.09883#bib.bib44); Math\-AI Team,[2026](https://arxiv.org/html/2606.09883#bib.bib45); Heet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib46); Hendryckset al\.,[2021b](https://arxiv.org/html/2606.09883#bib.bib47); Reinet al\.,[2024](https://arxiv.org/html/2606.09883#bib.bib48)\)\. We report four variants of TD\-Grokking:Root\-only,Sub\-only,Mixed, andDynamic\. Root\-only, reported asVanillain the following results, trains on the original zero\-reward roots and is identical to standard RLVR on the root problems\. Sub\-only trains only on root\-derived subproblems and serves as the diagnostic for local\-to\-global recomposition, because the checkpoint never receives root rewards during training\. Mixed trains on the union of roots and subproblems, preserving pressure on the final root task while adding trainable local leaves\.

To verify that the performance gain from TD\-Grokking is not simply an artifact of adding more training examples compared with GRPO on original questions, we add a comparative experiment named GRPO\-simple\. Specifically, we train GRPO\-simple by adding a comparable\-scale, comparable\-difficulty MWP\-RLVR set whose examples have no decomposition relationship to the selected zero\-reward roots\. The set contains11,89611\{,\}896examples fromDolphin18K\-clean\-single\-numeric,2,3732\{,\}373fromMAWPS, and1,2181\{,\}218fromASDiv\-A, for a total of15,48715\{,\}487training rows\. These source corpora are drawn from established math word\-problem datasets\(Huanget al\.,[2016](https://arxiv.org/html/2606.09883#bib.bib50); Koncel\-Kedziorskiet al\.,[2016](https://arxiv.org/html/2606.09883#bib.bib51); Miaoet al\.,[2020](https://arxiv.org/html/2606.09883#bib.bib52)\)\.

We further compare TD\-Grokking with an SFT model tuned on the same reference solutions used for subproblem decomposition\.

#### Main results\.

Figure[2](https://arxiv.org/html/2606.09883#S4.F2)summarizes both the root\-level training dynamics and the generated\-token cost of dynamic mixing\. The training\-accuracy panel is evaluated only on original root questions\. We can see that training only on very hard original questions leads to little performance gain\. In comparison, RL training on their subproblems activates learning and equips the LLM with abilities that transfer back to original problems\. By mixing original questions with subproblems, the LLM can practice local skills while remaining aligned with the target root\-question distribution, leading to stronger root\-level performance\.

Table[1](https://arxiv.org/html/2606.09883#S4.T1)summarizes the performance of TD\-Grokking and baseline approaches on mainstream mathematical benchmarks\. Among the three primary static TD\-Grokking variants, mixed training achieves the strongest performance on all reported mathematical benchmarks except MATH500\. For example, on AIME 2024/2025, mixed TD\-Grokking improves avg@8 accuracy by 6\.25 and 3\.75 percentage points compared with vanilla GRPO training on original problems\. These results show that subproblems provide denser reward signals, which makes training on originally zero\-reward roots more effective\. Original problems are also important to keep the policy aligned with the target question distribution\.

Table 1:Experimental results on mathematical and science reasoning benchmarks\. All entries are percentages\. For AIME, avg@8 is mean correctness over eight sampled rollouts per problem, and pass@8 counts whether at least one sampled rollout solves the problem\. Vanilla, sub\-only, and mixed are the primary static TD\-Grokking views\. Bold marks the best completed result in each column\.![Refer to caption](https://arxiv.org/html/2606.09883v1/x2.png)

![Refer to caption](https://arxiv.org/html/2606.09883v1/x3.png)

Figure 2:Training\-time root accuracy and generated\-token cost\. Left: accuracy on original root prompts over 20\-step windows\. Vanilla GRPO denotes direct GRPO on the original root problems without subproblem augmentation; Mixed GRPO keeps root prompts in the training mixture, and Sub\-only GRPO is shown as the subproblem\-only comparison\. Right: cumulative generated response\-token AUC from run start to the selected checkpoints, showing that dynamic mixing uses fewer generated tokens than the full mixed run in this accounting\.Through dynamic problem selection, dynamic mixing can achieve performance comparable to fully mixed training with lower generated\-token cost\. In the checkpoint\-cumulative accounting in Figure[2](https://arxiv.org/html/2606.09883#S4.F2), dynamic mixing saves about 35% of generated response tokens compared with fully mixed training\. Further comparisons with GRPO\-simple and SFT show that the improvement of TD\-Grokking is not merely a result of more training data or the use of reference CoT solutions\. The training\-time decomposition strategy is the most critical factor in the improved reasoning performance\.

### 4\.2Medical\-Domain Instantiation

In this section, we investigate whether TD\-Grokking remains useful outside mathematics\. To this end, we choose the medical domain, which requires probabilistic and case\-based reasoning and is very different from mathematical reasoning\. We compare TD\-Grokking with all baseline approaches on the challenging MedBullet dataset\(Chenet al\.,[2025](https://arxiv.org/html/2606.09883#bib.bib57)\), following the same setup as in the mathematical\-reasoning experiments\. We evaluate on MedQA, MedMCQA, PubMedQA, and MMLU medical subsets\(Jinet al\.,[2021](https://arxiv.org/html/2606.09883#bib.bib53); Palet al\.,[2022](https://arxiv.org/html/2606.09883#bib.bib54); Jinet al\.,[2019](https://arxiv.org/html/2606.09883#bib.bib55); Hendryckset al\.,[2021a](https://arxiv.org/html/2606.09883#bib.bib56)\)\.

Table 2:Medical\-domain results on MedBullet\. All entries are percentages\.MMLU Medicalis the macro\-average over five medical\-related MMLU domains\.Table[2](https://arxiv.org/html/2606.09883#S4.T2)shows the same qualitative pattern as the mathematical experiments\. Mixed training outperforms the best non\-mixed control on all four medical evaluation rows, improving the average from43\.57%43\.57\\%to46\.78%46\.78\\%\. The gains are modest but consistent:\+2\.12\+2\.12on MedQA,\+2\.08\+2\.08on MedMCQA,\+4\.30\+4\.30on PubMedQA, and\+2\.47\+2\.47on the MMLU medical macro\-average\. The good performance on medical reasoning indicates that the gains from TD\-Grokking do not rely on mathematical notation, contest\-style structure, or math\-specific reward engineering\. It supports the broader TD\-Grokking claim: when a domain supplies hard root questions with verifiable answers, training\-time decomposition can create useful dense learning signal, leading to better learning outcomes\.

## 5Analyzing the Learning Mechanism of TD\-Grokking

In this section, we conduct an in\-depth analysis of how training subproblems influence the model’s behavior on root problems\.

![Refer to caption](https://arxiv.org/html/2606.09883v1/x4.png)Figure 3:Composition of zero\-reward problems before and after sub\-only training\.#### Training on subproblems alone can activate learning on root problems\.

As we have already seen in Figure[2](https://arxiv.org/html/2606.09883#S4.F2)\(Left\), training only on subproblems can increase the probability that the LLM solves root problems\. Figure[3](https://arxiv.org/html/2606.09883#S5.F3)more clearly shows how zero\-reward problems are influenced\. Before training, the base model has zero accuracy on507507of514514root problems, meaning 98\.6% of roots are in the zero\-reward region\. After sub\-only RL,8484roots have nonzero accuracy\. Among these,7777are newly recovered from the base zero\-reward set, giving a zero\-recovery rate of77/507=15\.2%77/507=15\.2\\%\. At the item level, among all the514514root questions,197197improve,66decline, and311311remain unchanged\. Although many zero\-reward root problems remain unsolved, sub\-only training nevertheless activates root\-level learning despite receiving no root\-level rewards\.

#### Higher subproblem gains, higher root problem gains\.

To further verify our hypothesis that learning subskills can directly lead to better solutions to root problems, we investigate changes in model behavior after training from a more fine\-grained perspective\. Specifically, we split the DeepScaleR\-hard dataset into 10 subdomains to observe the relationship between the performance gains of subproblems and root problems\. As shown in Figure[4](https://arxiv.org/html/2606.09883#S5.F4), the solution rates of both subproblems and root problems increase consistently across all1010subdomains, with gains ranging from3\.93\.9to23\.123\.1percentage points\.

![Refer to caption](https://arxiv.org/html/2606.09883v1/x5.png)Figure 4:Accuracy changes of subproblems and root problems after sub\-only RL in 10 subdomains\.
#### Direct training on root problems is still necessary\.

We now check whether successful root\-problem solving is merely a deterministic consequence of solving all associated subproblems\. If so, the accuracy of root problems would be a deterministic function of their subproblem accuracies\. Instead, the correlations are positive but weak\. Under the trained sub\-only checkpoint, Pearson correlations between root accuracy and child\-accuracy summaries are0\.250\.25for the product,0\.270\.27for the minimum, and0\.230\.23for the log\-product; the corresponding Spearman correlations are0\.280\.28,0\.300\.30, and0\.270\.27\.

These weak correlations imply that successful root\-problem solving depends not only on mastering subskills, but also on learning more advanced skills, such as choosing the right subskills, planning feasible reasoning paths, and dealing with inconsistencies between different steps\. As a result, mixed training on both subproblems and root problems is necessary to achieve the best performance\.

## 6Conclusion, Limitations, and Future Work

#### Conclusion\.

We introduced TD\-Grokking, a training\-time decomposition framework for extracting useful RLVR signal from hard examples that are initially zero\-reward under direct outcome supervision\. Instead of changing the verifier, adding process rewards, or providing decomposition at inference time, TD\-Grokking converts each hard root into self\-contained and verifiable subproblems that can be optimized with the same final\-answer reward\. Across mathematical reasoning and medical QA benchmarks, mixed training consistently outperforms direct root\-only training, while the sub\-only diagnostic shows that root\-derived subproblem practice can recover a non\-trivial subset of previously unsolved roots even without direct root rewards\. These results support the central claim that decomposition is not only an inference scaffold, but also an effective training\-time mechanism for making hard problems learnable\.

#### Limitations\.

The main limitation of TD\-Grokking is that decomposition is not free: it requires an external construction model, validation passes, and difficulty calibration before RL training begins\. However, our experiments show that this additional computation is worthwhile for zero\-reward hard\-example pools, where direct root\-only training provides little usable signal\. By spending compute up front to expose trainable subproblem tasks, TD\-Grokking improves final benchmark performance and the dynamic variant further reduces generated\-token cost by retiring mastered subproblems\. Due to limited computational resources, we were not able to conduct broader experiments across more model families, domains, and decomposition quality regimes\.

#### Future work\.

Future work should study recursive decomposition on larger models and harder domains where first\-level subproblems may still remain zero\-reward\. It would also be useful to develop automatic policies for deciding when the decomposition cost is justified and how much computation should be allocated to construction, validation, calibration, and RL training\.

## References

- H\. Chen, Z\. Fang, Y\. Singla, and M\. Dredze \(2025\)Benchmarking large language models on answering and explaining challenging medical questions\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Albuquerque, New Mexico,pp\. 3563–3599\.External Links:ISBN 979\-8\-89176\-189\-6,[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.182),[Link](https://aclanthology.org/2025.naacl-long.182/)Cited by:[§4\.2](https://arxiv.org/html/2606.09883#S4.SS2.p1.1)\.
- J\. C\. Chen, B\. X\. Peng, P\. K\. Choubey, K\. Huang, J\. Zhang, M\. Bansal, and C\. Wu \(2026\)Nudging the boundaries of llm reasoning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hfNnQHkTtv)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p3.1),[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.Nature645,pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z),[Link](https://www.nature.com/articles/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p1.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting AGI with olympiad\-level bilingual multimodal scientific problems\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 3828–3850\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.211),[Link](https://aclanthology.org/2024.acl-long.211/)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021a\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§4\.2](https://arxiv.org/html/2606.09883#S4.SS2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021b\)Measuring mathematical problem solving with the MATH dataset\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Huang, S\. Shi, C\. Lin, J\. Yin, and W\. Ma \(2016\)How well do computers solve math word problems? large\-scale dataset construction and evaluation\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Berlin, Germany,pp\. 887–896\.External Links:[Document](https://dx.doi.org/10.18653/v1/P16-1084),[Link](https://aclanthology.org/P16-1084/)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p2.4)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.External Links:[Document](https://dx.doi.org/10.3390/app11146421)Cited by:[§4\.2](https://arxiv.org/html/2606.09883#S4.SS2.p1.1)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu \(2019\)PubMedQA: a dataset for biomedical research question answering\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,pp\. 2567–2577\.External Links:[Link](https://pubmedqa.github.io/)Cited by:[§4\.2](https://arxiv.org/html/2606.09883#S4.SS2.p1.1)\.
- T\. Khot, H\. Trivedi, M\. Finlayson, Y\. Fu, K\. Richardson, P\. Clark, and A\. Sabharwal \(2023\)Decomposed prompting: a modular approach for solving complex tasks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=_nGgzQjzaRy)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Koncel\-Kedziorski, S\. Roy, A\. Amini, N\. Kushman, and H\. Hajishirzi \(2016\)MAWPS: a math word problem repository\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,San Diego, California,pp\. 1152–1157\.External Links:[Document](https://dx.doi.org/10.18653/v1/N16-1136),[Link](https://aclanthology.org/N16-1136/)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p2.4)\.
- J\. Li, H\. Lin, H\. Lu, K\. Wen, Z\. Yang, J\. Gao, Y\. Wu, and J\. Zhang \(2026\)QuestA: expanding reasoning capacity in llms via question augmentation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3MifB0f7qR)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p3.1),[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Liao, H\. Dong, X\. Xu, C\. Monz, and J\. Bian \(2026\)Self\-hinting language models enhance reinforcement learning\.arXiv preprint arXiv:2602\.03143\.External Links:[Link](https://arxiv.org/abs/2602.03143)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p3.1),[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/hash/aca97732e30bcf1303bc22ac3924fd16-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p2.1),[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng, J\. Sun, and A\. Rastogi \(2024\)Improve mathematical reasoning in language models by automated process supervision\.arXiv preprint arXiv:2406\.06592\.External Links:[Link](https://arxiv.org/abs/2406.06592)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px2.p1.1)\.
- Math\-AI Team \(2026\)AMC23 dataset\.Note:[https://huggingface\.co/datasets/math\-ai/amc23](https://huggingface.co/datasets/math-ai/amc23)DatasetExternal Links:[Link](https://huggingface.co/datasets/math-ai/amc23)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Miao, C\. Liang, and K\. Su \(2020\)A diverse corpus for evaluating and developing English math word problem solvers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 975–984\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.92),[Link](https://aclanthology.org/2020.acl-main.92/)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p2.4)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InProceedings of the Conference on Health, Inference, and Learning,Proceedings of Machine Learning Research, Vol\.174,pp\. 248–260\.External Links:[Link](https://proceedings.mlr.press/v174/pal22a.html)Cited by:[§4\.2](https://arxiv.org/html/2606.09883#S4.SS2.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 5687–5711\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.378),[Link](https://aclanthology.org/2023.findings-emnlp.378/)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof Q&A benchmark\.InProceedings of the First Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p1.1),[§1](https://arxiv.org/html/2606.09883#S1.p5.1),[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Sun, Y\. Cao, P\. Huang, H\. Bai, H\. Hajishirzi, N\. Dziri, and D\. Song \(2026\)RL grokking recipe: how does rl unlock and transfer new algorithms in llms?\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CJJ8VxOWbG)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p2.1),[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Wang, W\. Xu, Y\. Lan, Z\. Hu, Y\. Lan, R\. K\. Lee, and E\. Lim \(2023\)Plan\-and\-solve prompting: improving zero\-shot chain\-of\-thought reasoning by large language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 2609–2634\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.147),[Link](https://aclanthology.org/2023.acl-long.147/)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 9426–9439\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.510),[Link](https://aclanthology.org/2024.acl-long.510/)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Xia, C\. Xu, Z\. Yao, J\. McAuley, and Y\. He \(2026\)Learning to hint for reinforcement learning\.arXiv preprint arXiv:2604\.00698\.External Links:[Link](https://arxiv.org/abs/2604.00698)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p3.1),[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.09388),[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://proceedings.neurips.cc/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Zhang, S\. Wu, Y\. Zhu, H\. Tan, S\. Yu, Z\. He, and J\. Jia \(2026\)Scaf\-grpo: scaffolded group relative policy optimization for enhancing llm reasoning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=bOwVr0yr7r)Cited by:[§1](https://arxiv.org/html/2606.09883#S1.p3.1),[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Zhang and Math\-AI Team \(2024\)American invitational mathematics examination \(AIME\) 2024\.Note:[https://huggingface\.co/datasets/math\-ai/aime24](https://huggingface.co/datasets/math-ai/aime24)DatasetExternal Links:[Link](https://huggingface.co/datasets/math-ai/aime24)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Zhang and Math\-AI Team \(2025\)American invitational mathematics examination \(AIME\) 2025\.Note:[https://huggingface\.co/datasets/math\-ai/aime25](https://huggingface.co/datasets/math-ai/aime25)DatasetExternal Links:[Link](https://huggingface.co/datasets/math-ai/aime25)Cited by:[§4\.1](https://arxiv.org/html/2606.09883#S4.SS1.SSS0.Px1.p1.1)\.
- C\. Zheng, Z\. Zhang, B\. Zhang, R\. Lin, K\. Lu, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025\)ProcessBench: identifying process errors in mathematical reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 1009–1024\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.50),[Link](https://aclanthology.org/2025.acl-long.50/)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le, and E\. Chi \(2023\)Least\-to\-most prompting enables complex reasoning in large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WZH7099tgfM)Cited by:[§2](https://arxiv.org/html/2606.09883#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AExperimental Details

This appendix records the data construction, benchmark protocol, generation settings, decomposition prompt, and qualitative cases used in the experiments\. All evaluations are root\-prompt evaluations unless a diagnostic is explicitly described as a subproblem evaluation\. The model never receives a decomposition guide, subproblem list, or intermediate answer at evaluation time\.

### A\.1Training Corpus Accounting

The mathematical experiments start from the zero\-reward slice of DeepScaleR\-hard for Qwen3\-1\.7B\. The initial decomposition pool contains4,9444\{,\}944root questions whose root accuracy was zero under the calibration budget used during data construction\. After decomposition, validation, and salvage, the final training corpus contains3,7883\{,\}788aligned roots and14,71714\{,\}717retained subproblems\. Table[3](https://arxiv.org/html/2606.09883#A1.T3)gives the exact training views used by the main mathematical runs\.

Table 3:Mathematical training corpus used by the reported H100 GRPO runs\. The mixed view has an average of3\.893\.89retained subproblems per root\.The earlier full decomposition pass produced3,9893\{,\}989usable roots before the solq95 filter\. The paper reports the filtered corpus because it is the dataset actually used by the H100 root\-only, sub\-only, mixed, and dynamic runs\. The relationship between the three views is controlled: every retained subproblem is tied to exactly one retained root, and every retained root appears once in root\-only and mixed training\.

### A\.2Formal RLVR Signal View

This section makes explicit the sparse\-reward mechanism that motivates training\-time decomposition\. Let a verifiable training item bez=\(qz,az\)z=\(q\_\{z\},a\_\{z\}\), whereqzq\_\{z\}is the prompt andaza\_\{z\}is the reference answer\. For rolloutoz,k∼πθ\(⋅∣qz\)o\_\{z,k\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q\_\{z\}\), letE​\(oz,k\)E\(o\_\{z,k\}\)be the extracted final answer and letVVbe the verifier\. The binary RLVR reward is

rz,k=𝟏​\[V​\(E​\(oz,k\),az\)=1\]\.r\_\{z,k\}=\\mathbf\{1\}\\\!\\left\[V\\\!\\left\(E\(o\_\{z,k\}\),a\_\{z\}\\right\)=1\\right\]\.WithKKsampled rollouts, the empirical item accuracy is

p^K​\(z;θ\)=1K​∑k=1Krz,k\.\\widehat\{p\}\_\{K\}\(z;\\theta\)=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}r\_\{z,k\}\.A rootxxis zero\-reward for the starting policy ifp^K​\(x;θ0\)=0\\widehat\{p\}\_\{K\}\(x;\\theta\_\{0\}\)=0\. This is not a statement thatxxis impossible; it means that the sampled outcome rewards forxxcontain no positive event under the current policy and sampling budget\.

For GRPO\-style group\-relative optimization, the outcome advantage for a rollout from itemzzis computed from rewards within the same prompt group:

r¯z=1G​∑g=1Grz,g,sz=1G​∑g=1G\(rz,g−r¯z\)2,\\bar\{r\}\_\{z\}=\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}r\_\{z,g\},\\qquad s\_\{z\}=\\sqrt\{\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\(r\_\{z,g\}\-\\bar\{r\}\_\{z\}\)^\{2\}\},Az,g=rz,g−r¯zsz\+ϵ\.A\_\{z,g\}=\\frac\{r\_\{z,g\}\-\\bar\{r\}\_\{z\}\}\{s\_\{z\}\+\\epsilon\}\.If all rewards in the group are zero, thenr¯z=0\\bar\{r\}\_\{z\}=0and the group contains no outcome\-level contrast\. If an item has true success probabilitypzp\_\{z\}, the probability that aGG\-rollout group contains at least one positive reward is

Pinfo​\(z\)=1−\(1−pz\)G\.P\_\{\\mathrm\{info\}\}\(z\)=1\-\(1\-p\_\{z\}\)^\{G\}\.Thus a zero\-reward root withpx≈0p\_\{x\}\\approx 0is unlikely to produce informative groups, while a derived subproblemsswithps\>0p\_\{s\}\>0can produce positive reward events with probability1−\(1−ps\)G1\-\(1\-p\_\{s\}\)^\{G\}\. Decomposition densifies RLVR by moving training mass from roots withPinfo≈0P\_\{\\mathrm\{info\}\}\\approx 0to root\-conditioned nodes with nonzeroPinfoP\_\{\\mathrm\{info\}\}, while mixed training keeps some pressure on the original root distribution\.

Formally, for each retained rootxxwe construct a one\-level decomposition tree

Tx=\{x\}∪𝒟​\(x\),𝒟​\(x\)=\{sx,1,…,sx,mx\}\.T\_\{x\}=\\\{x\\\}\\cup\\mathcal\{D\}\(x\),\\qquad\\mathcal\{D\}\(x\)=\\\{s\_\{x,1\},\\ldots,s\_\{x,m\_\{x\}\}\\\}\.Each nodez∈Txz\\in T\_\{x\}has its own verifiable answeraza\_\{z\}\. The three static training views are

𝒮root=\{x\},𝒮sub=⋃x𝒟​\(x\),𝒮mix=𝒮root∪𝒮sub\.\\mathcal\{S\}\_\{\\mathrm\{root\}\}=\\\{x\\\},\\qquad\\mathcal\{S\}\_\{\\mathrm\{sub\}\}=\\bigcup\_\{x\}\\mathcal\{D\}\(x\),\\qquad\\mathcal\{S\}\_\{\\mathrm\{mix\}\}=\\mathcal\{S\}\_\{\\mathrm\{root\}\}\\cup\\mathcal\{S\}\_\{\\mathrm\{sub\}\}\.The dynamic variant maintains an active set𝒜t⊆𝒮mix\\mathcal\{A\}\_\{t\}\\subseteq\\mathcal\{S\}\_\{\\mathrm\{mix\}\}\. In the reported implementation,

p^8​\(z;θt\)=0⇒activate children of​z,p^8​\(z;θt\)≥78⇒retire​z\.\\widehat\{p\}\_\{8\}\(z;\\theta\_\{t\}\)=0\\Rightarrow\\text\{activate children of \}z,\\qquad\\widehat\{p\}\_\{8\}\(z;\\theta\_\{t\}\)\\geq\\frac\{7\}\{8\}\\Rightarrow\\text\{retire \}z\.These discrete thresholds instantiate the intended “below 10%” and “above 80%” active\-set rules under an eight\-rollout group\.

Finally, the local\-to\-global intuition can be expressed through a simple diagnostic approximation\. If a root solution requiresmmlocal requirements and requirementjjsucceeds with probabilitypjp\_\{j\}, then under an independence approximation,

proot≈∏j=1mpj,Δ​log⁡proot≈∑j=1mΔ​log⁡pj\.p\_\{\\mathrm\{root\}\}\\approx\\prod\_\{j=1\}^\{m\}p\_\{j\},\\qquad\\Delta\\log p\_\{\\mathrm\{root\}\}\\approx\\sum\_\{j=1\}^\{m\}\\Delta\\log p\_\{j\}\.This approximation is not an assumption used by the training algorithm; it is a useful interpretation of the case studies\. Improving a weak local step can have a disproportionate effect on the probability that the full root reasoning chain closes successfully\.

### A\.3Decomposition Pipeline

Decomposition is a data\-construction step, not an inference\-time scaffold\. For each selected root problem, the pipeline obtains or constructs an answer\-consistent guide, segments the guide into local requirements, rewrites those requirements into standalone problem\-answer pairs, and validates the resulting candidates\. Retained candidates must be self\-contained and verifiable by the same final\-answer interface as the root problem\. The validator also screens out candidates that simply copy the full root problem or expose the full solution trace\.

The production decomposition generator was DeepSeek\-V3\.2 through a chat API\. The retry\-focused production prompt version wasv6\_text\_blocks\_retry\_focus\. The important generation and validation settings are summarized in Table[4](https://arxiv.org/html/2606.09883#A1.T4)\.

Table 4:Decomposition generation and validation settings used to build the mathematical training corpus\.#### Core decomposition prompt\.

Following the appendix style of full prompt templates, we include the exact static text of the production decomposition prompt\. Problem\-dependent slots are shown in angle brackets\. The selected good and bad examples are inserted verbatim from the few\-shot library according to the coarse tags inferred from the root problem\. TheVerificationfield is not used as a reward signal; it is a construction\-time sanity check explaining why the subproblem is self\-contained\.

Prompt Template of TD\-Grokking Decomposition Problems

Listing 1:System message for decomposition generation\.Youareacarefulmathcurriculumdesigner\.

Yourjobistodecomposeoneoriginalmathproblemintoasmallsetofusefultrainingsubproblems\.

Youmustoptimizefor:

1\.self\-containedsubproblems,

2\.solvablesubproblems,

3\.independencebetweensubproblems,

4\.reusablemathematicalskills,

5\.exact\-answercompatibilitywithprogrammaticevaluation\.

Outputonlytherequestedtextblocks\.

Ifalatersubproblemneedsaquantitythatcouldhavebeencomputedearlier,restatethatconcretequantitydirectlyinsidethenewquestioninsteadofreferringtoanearliersubproblem\.

Listing 2:User prompt template for decomposition generation\.Decomposethefollowingmathproblemintoasmallsetoftrainingsubproblems\.

Goal:

\-producesubproblemsthatareindependentlysolvable,

\-eachsubproblemmustrestateallconditionsitneeds,

\-eachsubproblemmusttrainareusablemathskillfromtheoriginalproblem,

\-thefinalsubproblemmusthavethesamefinalanswerastheoriginalanswer\.

Hardrequirements:

1\.Returnbetween3and8subproblems\.

2\.Everysubproblemmustbeself\-contained\.Donotrefertoothersubproblems,previousresults,orhiddencontext\.

3\.Everysubproblemmustrestatethenumericvalues,symbols,domains,constraints,anddefinitionsitneeds\.Thisrestatementcanbenaturalproseoranexplicit‘Given:‘clause\.Ifalaterstepreusesaquantity,restatetheconcretevalueorformuladirectlyinsteadofcitinganearlierstep\.

4\.Donotwriteproof,verification,or"showthat"stylesubproblems\.

5\.Donotcreatetrivialmetasubproblemssuchas"isthefinalanswercorrect?"or"choosethecorrectoption"unlesstheoriginalproblemisinherentlymultiple\-choiceandthatstepstillrequiresrealreasoning\.

6\.Preferconcretesubproblemsoverplaceholder\-onlyalgebra\.DonotinventabstractvariableslikeS,T,orRunlesstheyaredefinedinsidethesamesubproblemandgenuinelyuseful\.

7\.Each‘answer‘mustbeashortexactfinalanswerusableasgroundtruth\.Donotincludeexplanations,equations,ormultiplesentencesin‘answer‘\.

8\.Keep‘reasoning‘,‘solution‘,and‘verification‘concisebutcomplete\.Neverleaveanyfieldblank\.‘verification‘mustexplainwhythesubproblemisself\-containedandsolvableonitsown,notwhyitmatchesanothersubproblem\.

9\.Thelastsubproblemmustsolvefortheoriginaltargetandmustendwiththeoriginalanswerexactly\.

10\.Acrossthewholedecomposition,preferatleasttwogenuinelydifferentskillsteps\.Avoidrepeatingthesameshellsentencewithonlynumberschanged\.

11\.Neverreturnonly1or2subproblems\.Iftheproblemfeelssimple,splititintosmallerconcretecomputationsanyway\.

12\.Donotintroducenewhelperobjectslikenewpolynomials,functions,sequences,points,orvariablesunlesstheoriginalproblemalreadyusesthemorthehelperisstrictlynecessaryforashortcomputation\.

13\.Donotletthefirstsubproblemabsorbthewholetask\.Eachsubproblemshouldaskforoneconcreteintermediatequantity,relation,orcheck\.

14\.Keepeachfieldshort,butprioritizecompletenessoverrigidsentencecounts:Questionatmost3sentences,Reasoning1to2shortsentences,Solution1to3shortsentences,Answeroneline,Verification1to2shortsentences\.

15\.Ifunsure,prefersimplearithmetic,algebraic,geometric,orprobabilisticintermediatequantitiesoverlongtheorysummaries\.

16\.Ifasubproblemwouldotherwisebetoolong,shortenthedecompositionorsimplifythewording\.Donotleave‘solution‘,‘answer‘,or‘verification‘blank\.

17\.Eachsubproblemshouldtargetexactlyoneconcreteintermediatequantity,oneconcreterelation,oroneconcretecounting/probabilityresult\.Donotturnasubproblemintoamini\-lectureoralongmulti\-partderivation\.

18\.Target4to6subproblemsbydefault\.Use7or8onlywhentheoriginalproblemclearlyneedsthem\.Neverexceed8\.

19\.Ifalatersubproblemneedsanearlierresult,rewritethequestionintheform‘Given<explicitvalueorformula\>\.\.\.‘orrestatethefullquantityinplainlanguage\.Neverwritephraseslike‘fromthepreviousstep‘,‘fromSubproblem3‘,‘usingtheresultabove‘,or‘sameasbefore‘\.

20\.Ifyouarerunningoutofspace,reducethenumberofsubproblemsinsteadofleavingthelastsubproblemincomplete\.Thefinalsubproblemmustalwayscontainanon\-empty‘solution‘andanon\-empty‘answer‘\.

21\.Thefinal‘answer‘shouldmatchtheoriginalanswer’stargetandouterformascloselyaspossible\.Iftheoriginalanswerisanexpression,giveonlythatexpression\.Ifitisanequation,inequality,set,orderedpair,list,ornamedquantity,preservethatouterstructureinsteadofansweringwithadifferentbutrelatedobject\.

22\.Donotendwithaverification\-onlysubproblem\.Ifthelaststepwouldonlycheckcorrectness,mergeitintothepreviouscomputationalstepandkeepthefinalanswerthere\.

23\.In‘answer‘,writeonlythetargetanswer\.Donotwriteexplanatoryprefixessuchas‘Therefore‘,‘So‘,‘Theansweris‘,‘check‘,‘because‘,orafullsentence\.

Stylepreference:

<STYLE\_INSTRUCTION\>

Returnonlytextinthisexactstructure:

\#\#\#Subproblem1

Question:\.\.\.

Reasoning:\.\.\.

Solution:\.\.\.

Answer:\.\.\.

Verification:\.\.\.

\#\#\#Subproblem2

Question:\.\.\.

Reasoning:\.\.\.

Solution:\.\.\.

Answer:\.\.\.

Verification:\.\.\.

\#\#\#Subproblem3

Question:\.\.\.

Reasoning:\.\.\.

Solution:\.\.\.

Answer:\.\.\.

Verification:\.\.\.

Usethesamesixlabelsforeverysubproblemblock\.DonotuseJSON\.Donotusecodefences\.

Gooddecompositionexamplestoimitateinspirit,notinwording:

<SELECTED\_GOOD\_DECOMPOSITION\_EXAMPLES\>

Baddecompositionpatterntoavoid:

<SELECTED\_BAD\_DECOMPOSITION\_EXAMPLES\>

Originalproblem:

<ORIGINAL\_PROBLEM\>

Originalanswer:

<ORIGINAL\_ANSWER\>

Referencesolution:

<REFERENCE\_SOLUTION\_OR\_NO\_REFERENCE\_SOLUTION\_PROVIDED\>

TheEndofPrompt

For the final production retry pass,STYLE\_INSTRUCTIONwas the natural\-style instruction:

> Prefer natural question wording\. Restate necessary conditions inside each question, but do not force every subproblem into the same explicit template\. UseGiven:only when it genuinely improves clarity\.

### A\.4Subproblem Skill Annotation Prompt

The skill labels used in the diagnostic figures are produced after decomposition\. The annotation pass is separate from training and evaluation: the labels are used only for analysis, not as policy input, reward input, or test\-time scaffolding\. We use a frozen taxonomy with one primary label and up to two secondary labels per subproblem\. The exact prompt template used for the annotation pass is shown below\.

Listing 3:System message for atomic skill annotation\.Youareacarefulmathannotationassistant\.

Yourtaskistoclassifyeachsubproblemintoanatomicskilltaxonomy\.

Youmustfollowtheserulesstrictly:

1\.Useonlylabelsfromtheprovidedlabelset\.

2\.Outputexactlyoneprimary\_label\.

3\.Outputzerototwosecondary\_labels\.

4\.Treatverificationasasecondarylabelunlesstheproblemisalmostentirelyaconsistencycheck\.

5\.Prefertheskillrequiredtosolvetheproblem,notsuperficialwording\.

6\.Iftheproblemasksforanunknownvaluebysolvinganequationorsystem,preferequation\_solving\.

7\.Iftheproblemmainlyaskstorewrite,simplify,ortransformanexpression,preferalgebraic\_manipulation\.

8\.Iftheproblemismainlyaboutratios,averages,unitrates,prices,speed,orpercentages,preferratio\_rate\_percent\.

9\.Iftheproblemismainlyaboutcounting,probability,expectation,orcombinatorialarrangements,prefercounting\_probability\.

10\.Iftheproblemismainlyaboutdivisibility,modulararithmetic,bases,remainders,gcd,lcm,orintegerstructure,prefernumber\_theory\.

11\.Iftheproblemismainlyaboutgeometricquantitiessuchaslength,area,volume,perimeter,coordinates,orchordlength,prefergeometry\_measurement\.

12\.Iftheproblemismainlyaboutgeometrictheorems,anglerelations,tangentproperties,conics,similarity,congruence,orgeometricstructure,prefergeometry\_relation\.

13\.Iftheproblemismainlydirectevaluationofaknownnumericexpression,prefernumeric\_computation\.

14\.Iftheproblemismainlyaboutrecurrence,progression,repeatedupdates,oriterativeprocesses,prefersequence\_recurrence\.

15\.Iftheproblemismainlyaboutfunctiondefinitionrecoveryorsubstitutionintoafunctionform,preferfunction\_substitution\.

16\.Iftheproblemismainlyaboutexponentrules,logarithmrules,trigonometricidentities,orstandardtrigonometricratios/values,preferexponent\_log\_trig\_rules\.

OutputvalidJSONonly\.

Return:

\{

"results":\[

\{

"problem\_id":"\.\.\.",

"primary\_label":"\.\.\.",

"secondary\_labels":\["\.\.\.","\.\.\."\],

"rationale":"\.\.\."

\}

\]

\}

Listing 4:User prompt template for atomic skill annotation\.Classifythefollowingsubproblemsusingthefixedatomicskilltaxonomy\.

Allowedlabels:

\-numeric\_computation

\-algebraic\_manipulation

\-equation\_solving

\-function\_substitution

\-exponent\_log\_trig\_rules

\-ratio\_rate\_percent

\-counting\_probability

\-number\_theory

\-sequence\_recurrence

\-geometry\_measurement

\-geometry\_relation

\-verification

Few\-shotexamples:

<FEWSHOT\_JSON\_LINES\>

Targetbatch:

\{

"label\_set":\[

"numeric\_computation",

"algebraic\_manipulation",

"equation\_solving",

"function\_substitution",

"exponent\_log\_trig\_rules",

"ratio\_rate\_percent",

"counting\_probability",

"number\_theory",

"sequence\_recurrence",

"geometry\_measurement",

"geometry\_relation",

"verification"

\],

"subproblems":\[

\{

"problem\_id":"\.\.\.",

"question":"\.\.\."

\}

\]

\}

The final retained diagnostic set contains2,4872\{,\}487labeled subproblems from514514parent roots\. Figure[5](https://arxiv.org/html/2606.09883#A1.F5)visualizes the primary\-secondary co\-occurrence structure induced by the annotation prompt\. The heatmap is sparse by design: secondary labels are used only when a subproblem genuinely requires a second skill rather than as a generic topic tag\.

![Refer to caption](https://arxiv.org/html/2606.09883v1/x6.png)Figure 5:Primary\-secondary atomic skill co\-occurrence over the2,4872\{,\}487retained subproblems in the paired diagnostic\. Rows are primary labels and columns are secondary labels; blank cells have zero count\. The same frozen taxonomy is used for the skill\-conditioned transfer analysis in Figure[4](https://arxiv.org/html/2606.09883#S5.F4)and Table[8](https://arxiv.org/html/2606.09883#A2.T8)\.
### A\.5Dynamic Active\-Set Rule

The dynamic variant uses the same one\-level root\-subproblem graph as the mixed training view, but changes which rows are active\. Each root starts active\. A root with no successful rollout in the current88\-sample group activates its first\-level subproblems\. A row with at least77successful rollouts out of88is treated as mastered and can leave the active set\. Thus the textual thresholds “below 10%” and “above 80%” are implemented as the discrete rules0/80/8and at least7/87/8, respectively\. This keeps the reward function unchanged and only changes the training sampler\.

### A\.6Benchmark and Generation Protocols

All mathematical benchmark results in Table[1](https://arxiv.org/html/2606.09883#S4.T1)are produced through the same local lm\-eval/vLLM evaluation wrapper\. Unless stated otherwise, generation uses Qwen thinking mode with chat templating enabled, temperature0\.60\.6, top\-pp0\.950\.95, top\-kk2020, min\-pp0, and bfloat16 vLLM inference\. AIME results use repeated sampling; the other mathematical benchmarks report single\-run exact\-match accuracy\.

Table 5:Benchmark protocols used for the mathematical evaluation suite\.For the medical\-domain instantiation, we use generative local tasks that ask the model to produce the final option or short answer directly\. The medical evaluation does not use the math boxed\-answer system instruction\. Reported medical scores are exact\-match percentages after task\-specific normalization\. The MMLU Medical row in Table[2](https://arxiv.org/html/2606.09883#S4.T2)is the macro\-average of Anatomy, Clinical Knowledge, College Medicine, Medical Genetics, and Professional Medicine\.

Table 6:Medical\-domain benchmark protocol\.
### A\.7Training Hyperparameters

The H100 mathematical GRPO runs share the same training skeleton wherever possible\. The base model is Qwen3\-1\.7B, the reward is the existing final\-answer math verifier, and no decomposition\-specific reward shaping is introduced\. Unless otherwise noted, the reported mathematical RL experiments were run on a single server with88NVIDIA H100 GPUs\. Table[7](https://arxiv.org/html/2606.09883#A1.T7)lists the stable run parameters used by the root\-only, sub\-only, mixed, and dynamic families\.

Table 7:Main mathematical GRPO training settings\. The evaluation sampling parameters differ from training and are reported separately in Appendix[A\.6](https://arxiv.org/html/2606.09883#A1.SS6)\.
### A\.8Anonymized Artifact

The anonymized code and data artifact is available at[https://anonymous\.4open\.science/r/TD\-Grokking\-6567](https://anonymous.4open.science/r/TD-Grokking-6567)\. It includes the data\-construction prompts, preprocessing scripts, training launchers, evaluation scripts, and configuration files needed to reproduce the main experimental comparisons\.

## Appendix BSubproblem Skill Diagnostic

Before the full H100 benchmark suite, we ran a paired diagnostic to test whether training only on subproblems can improve both subproblem accuracy and the corresponding root accuracy\. The diagnostic contains514514root problems and2,4872\{,\}487subproblems\. Each item is evaluated with6464rollouts, temperature0\.60\.6, top\-pp0\.950\.95, and a1616k token cap\. In this diagnostic, strict exact\-match accuracy rises from1\.4%1\.4\\%to16\.3%16\.3\\%on roots and from52\.3%52\.3\\%to75\.2%75\.2\\%on subproblems\. Table[8](https://arxiv.org/html/2606.09883#A2.T8)reports the full skill\-conditioned transfer breakdown\.

Table 8:Skill\-conditioned transfer after sub\-only RL\. All deltas are percentage points under strict exact\-match evaluation\. SubproblemΔ\\Deltauses the same subproblem skill rows as Figure[4](https://arxiv.org/html/2606.09883#S5.F4)\. RootΔ\\Deltais the mean root\-accuracy improvement for roots containing the skill; Gap compares that root delta with roots not containing the skill\. Skill groups overlap, so these conditional gains are not additive\.![Refer to caption](https://arxiv.org/html/2606.09883v1/x7.png)Figure 6:Heatmap view of the skill\-conditioned transfer table\. The first two columns show strict subproblem accuracy for the base model and the Sub\-only GRPO checkpoint; the remaining columns show subproblem gain, root gain for roots containing the skill, and the conditional transfer gap\. The values are the same as Table[8](https://arxiv.org/html/2606.09883#A2.T8)and the source CSV is included with the figure artifacts\.![Refer to caption](https://arxiv.org/html/2606.09883v1/x8.png)Figure 7:Subproblem skill radar for the paired diagnostic\. The subproblem\-only checkpoint improves strict accuracy across the seven core subproblem skill families shown here, with especially large gains on geometry, equation\-solving, counting/probability, and numeric computation\.The radar is intended as a compact visual summary, while the main quantitative claim remains root\-prompt performance\. Several skill\-level patterns are useful for interpretation\. In the strict exact\-match transfer table, geometry\-measurement roots improve by\+23\.1\+23\.1percentage points when that skill is present; geometry\-relation roots improve by\+22\.5\+22\.5points; equation\-solving roots improve by\+21\.7\+21\.7points; and counting/probability roots improve by\+18\.5\+18\.5points\. By contrast, number\-theory roots improve less and have a negative transfer gap even though their subproblem scores improve\. This is consistent with the paper’s claim that subproblem training helps by stabilizing local steps, but does not guarantee uniform root\-level transfer for every skill family\.

This asymmetry argues against a simple “subproblem accuracy goes up, therefore root accuracy goes up” explanation\. Transfer is strongest when subproblems match local bottlenecks, while number\-theory roots remain difficult, likely because they require global organization or insights not captured by the current subproblems\.

## Appendix CCase Studies

We selected case studies from a held\-out diagnostic subset with saved rollout text\. The goal is not to show isolated anecdotes as the main evidence, but to make the mechanism behind the aggregate gains inspectable\. The favorable cases in Table[9](https://arxiv.org/html/2606.09883#A3.T9)share the same pattern: after subproblem\-only training, the model more often commits to a viable solution path, keeps key intermediate values stable, recovers from local mistakes, and closes the reasoning chain to the final answer\.

Table 9:Representative positive\-transfer cases from the saved rollout\-text subset\. Scores are pass@1 estimates over6464sampled rollouts\.![Refer to caption](https://arxiv.org/html/2606.09883v1/x9.png)Figure 8:Geometry case\-study gains under subproblem\-only RL\. Each row is a root problem with saved rollout text; points show the base and Sub\-only GRPO root pass@1 estimates over6464sampled rollouts\. These are the geometry rows from Table[9](https://arxiv.org/html/2606.09883#A3.T9), emphasizing that local geometry practice can make a previously unreliable full\-solution route much more repeatable\.#### Geometry case\.

Fordeepscaler\_00000900, the base model is not uniformly incapable: some rollouts find the correct route\. The failure mode is instability: the model alternates between coordinate geometry, special\-point guesses, and unfinished circle reasoning\. After subproblem\-only training, the successful route appears much more often: place the triangle at coordinates, compute midpoints, derive the two circumcircles, find the second intersection, and use the distance formula\. This supports the interpretation that subproblem practice does not merely teach a final answer format; it makes a multi\-step route more repeatable\.

#### Probability cases\.

Fordeepscaler\_00000631, the main error is a wrong sample space\. The trained model is more likely to notice the inconsistency and return to the three\-left\-shoes by three\-right\-shoes sample space\. Fordeepscaler\_00000312, the key improvement is intermediate\-count stability: base rollouts often start with the right summation but later disagree with themselves about the favorable count, while the trained model more often keeps the count20522052through the final simplification\.

#### Boundary of the case evidence\.

The same saved case\-study set also contains apparent negative cases, but several of them are contaminated by equivalent\-answer or ground\-truth issues\. For example, a time\-answer case alternates between seconds and minutes\-seconds forms, and two number\-theory cases have likely answer conflicts in the stored ground truth\. We therefore use the case\-study section to explain positive transfer mechanisms and avoid claiming a clean taxonomy of negative transfer from these few text\-inspectable examples\.

## Appendix DEvaluation and Data Quality Notes

The main experiments use exact or verifier\-based final\-answer scoring\. This is appropriate for RLVR but introduces several practical quality issues that we track explicitly\.

#### Answer extraction\.

For math benchmarks, outputs are reduced to final answers using boxed\-answer extraction plus local normalization and equivalence checks\. Benchmark\-specific scorer logs record any blank targets or normalization failures before aggregate reporting\.

#### Generation mismatch\.

Training rollouts and benchmark evaluations intentionally use different sampling regimes\. Training in the mixed solq95 family uses temperature1\.01\.0, top\-pp0\.950\.95, top\-k=−1k=\-1, and a1616k response cap\. The main benchmark evaluations use Qwen thinking\-mode decoding with temperature0\.60\.6, top\-pp0\.950\.95, top\-kk2020, min\-pp0, and longer generation caps\. This mismatch is reported because it affects output length and stability, especially on harder contest\-style benchmarks\.

#### No test\-time decomposition\.

All benchmark results are evaluated on the original prompts\. Decomposition guides, subproblem prompts, subproblem answers, and validation metadata are strictly training\-time artifacts\.

#### Existing assets\.

We use Qwen3\-1\.7B, DeepScaleR, verl, vLLM, lm\-eval\-harness, and public mathematical reasoning benchmarks including MATH500, AIME, AMC, OlympiadBench, and Omni\-MATH\. For each existing asset, we cite the original source and report the corresponding version, URL, license, and terms of use where available\. We use these assets only for research and evaluation purposes and follow their stated licenses and usage terms\.

Similar Articles

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv cs.LG

This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.