ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

arXiv cs.CL 05/13/26, 04:00 AM Papers
Summary
This paper introduces ReAD, a reinforcement-guided capability distillation framework that optimizes token budgets by accounting for cross-capability transfer in large language models. It demonstrates improved downstream utility and reduced harmful spillover compared to existing baselines.
arXiv:2605.11290v1 Announce Type: new Abstract: Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:11 AM
# ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
Source: [https://arxiv.org/html/2605.11290](https://arxiv.org/html/2605.11290)
Xueqi Cheng1Xugui Zhou2Tyler Derr3Yushun Dong11Florida State University2Louisiana State University3Vanderbilt University\{xc25,yushun\.dong\}@fsu\.edu; xuguizhou@lsu\.edu; tyler\.derr@vanderbilt\.edu

###### Abstract

Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model \(LLM\) into a smaller one while preserving the abilities needed for a downstream task\. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student’s broader capability profile, especially when multiple abilities jointly determine task success\. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget\-dependent cross\-capability transfer, and additional budget often brings limited task\-relevant gains while sometimes degrading other useful abilities\. Building on these insights, we proposeReAD, aReinforcement\-guided cApabilityDistillation framework that explicitly accounts for capability interdependence\. ReAD first infers task\-essential capabilities, then generates capability\-targeted supervision on the fly, and finally uses an uncertainty\-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains\. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines\. Our code is publicly available at[https://github\.com/LabRAI/ReAD](https://github.com/LabRAI/ReAD)\.

## 1Introduction

Knowledge distillation \(KD\)[hinton2015distilling](https://arxiv.org/html/2605.11290#bib.bib12)for large language models \(LLMs\)[zhao2023survey](https://arxiv.org/html/2605.11290#bib.bib41)has become an essential research direction for enhancing the accessibility and efficiency of Machine\-Learning\-as\-a\-Service \(MLaaS\)[cai2024llmaas](https://arxiv.org/html/2605.11290#bib.bib3);[xu2024survey\_kd\_llms](https://arxiv.org/html/2605.11290#bib.bib35);[yang2024survey\_kdllm](https://arxiv.org/html/2605.11290#bib.bib37)\. Through knowledge distillation, a small LLM learns to imitate a large teacher LLM’s outputs, enabling effective deployment under limited computational or financial resources[sanh2019distilbert](https://arxiv.org/html/2605.11290#bib.bib25);[jiao2020tinybert](https://arxiv.org/html/2605.11290#bib.bib14);[wang2020minilm](https://arxiv.org/html/2605.11290#bib.bib31);[li2021dynamickd](https://arxiv.org/html/2605.11290#bib.bib16);[liang2023less](https://arxiv.org/html/2605.11290#bib.bib18)\. Recent studies show that capability distillation, a specialization of knowledge distillation that focuses supervision on a target capability \(e\.g\., instruction\-following, reasoning, mathematics, or coding\), can substantially improve the student’s performance on downstream tasks that primarily depend on that capability, while achieving these gains at lower serving cost than the large teacher model[magister2023teaching](https://arxiv.org/html/2605.11290#bib.bib20);[shridhar2022distilling](https://arxiv.org/html/2605.11290#bib.bib27);[xu2023wizardlm](https://arxiv.org/html/2605.11290#bib.bib33);[yue2024distilling](https://arxiv.org/html/2605.11290#bib.bib38)\.

Despite rapid progress in capability distillation, most existing methods still treat capabilities as if they can be improved independently[taori2023stanford](https://arxiv.org/html/2605.11290#bib.bib30);[chiang2023vicuna](https://arxiv.org/html/2605.11290#bib.bib6);[zhang2024knowledgeable](https://arxiv.org/html/2605.11290#bib.bib40), even though empirical evidence suggests that optimizing one capability often triggers broad, unintended shifts in the model’s overall capability profile[zhong2024revisiting](https://arxiv.org/html/2605.11290#bib.bib42);[fang2025kddd\_survey](https://arxiv.org/html/2605.11290#bib.bib10);[cloud2025subliminal](https://arxiv.org/html/2605.11290#bib.bib8)\. This mismatch is especially consequential in the budgeted setting: under a fixed token budget, the goal is to maximize downstream task utility, which depends not only on the targeted capability but also on other capabilities the task may implicitly require\. If cross\-capability interactions are ignored, allocating more tokens to a single target can become inefficient in two ways: the extra improvement on the target can shrink as the budget grows, and the same updates may reduce performance on other tasks that require non\-target capabilities\. In this case, additional tokens are effectively spent on updates that provide little task\-relevant benefit and may even increase deployment risk in MLaaS, which we refer to as*budget waste*\. To systematically examine this gap, we take a first step toward measuring and understanding these interactions through a controlled empirical study\. Concretely, we repeatedly distill the student under multiple token budgets, each time allocating the entire budget to one target capability drawn from a set of widely recognized core LLM capabilities: General Knowledge \(General\), Reasoning, Math, Code, Tool use, Long\-Context Understanding \(LCU\), Steerability, and Multilinguality\. After each run, we evaluate the resulting student on the full benchmark suite in Table[3](https://arxiv.org/html/2605.11290#A1.T3)and record the score change for every capability\. This yields a budget\-dependent capability transfer matrix in which the diagonal entries capture on\-target improvement and the off\-diagonal entries capture how training toward one capability redistributes performance across the others\. Across budgets, we observe two consistent patterns:\(i\) capability\-specific distillation induces systematic, budget\-dependent transfer to other capabilities rather than isolated improvements, and\(ii\) increasing the budget for a single target produces smaller additional target gains while making the average harm to non\-target capabilities more pronounced, which together explain why naively scaling the budget for one capability can be inefficient\.

Building on these observed insights, we proposeReAD, aReinforcement\-guided cApabilityDistillation framework for large language models\. Overall, ReAD combines on\-the\-fly capability\-targeted data generation with token\-level knowledge distillation, and uses a lightweight contextual bandit to adaptively allocate a fixed budget across interdependent capabilities\. Specifically, ReAD first infers a task requirement vector that identifies which capabilities are essential for improving the downstream utility and treats degradations on these dimensions as harmful spillover\. It then allocates distillation effort to maximize a proxy reward that favors capability gains aligned with the task requirements while penalizing spillover and budget consumption\. In each interval, ReAD samples capability\-labeled prompts according to the current allocation, queries the teacher to obtain supervision, updates the student with a standard distillation loss, and uses the resulting capability\-profile change to update an uncertainty\-aware UCB allocation rule\. Extensive experiments demonstrate that ReAD strengthens task\-relevant capabilities and reduces wasted budget on low\-utility capability updates under budget constraints\.

To sum up, this paper makes the following key contributions:

- •Empirical study of cross\-capability interactions\.We show that distilling toward a single capability consistently changes other capabilities, revealing systematic cross\-capability interactions that are often overlooked\.
- •Budgeted capability distillation formulation\.We formulate capability distillation as allocating a fixed budget across multiple interacting capabilities, providing objective for improving utility while controlling side effects\.
- •ReAD: reinforcement\-guided capability distillation\.ReAD infers task\-relevant capabilities, generates capability\-targeted distillation data with controllable style and difficulty, and uses an uncertainty\-aware contextual bandit to allocate the budget across capabilities\.
- •Theory analysis of the capability interdependence\.We explain cross\-capability interactions and diminishing returns in capability distillation\.

## 2Preliminary

### 2\.1Notation

In this paper, we study budgeted capability distillation from a large teacher modelTTto a smaller student modelSS, with total training\-token budgetBB\. Let𝒞=\{c1,…,c\|𝒞\|\}\\mathcal\{C\}=\\\{c\_\{1\},\\ldots,c\_\{\|\\mathcal\{C\}\|\}\\\}denote a set of measurable capabilities, and letsk\(M\)s\_\{k\}\(M\)be modelMM’s benchmark score on capabilityckc\_\{k\}\. A distillation strategyπ\\pispecifies how data are generated, how the teacher is queried, and how tokens are allocated across capabilities, assigning budgets\{bk\}\\\{b\_\{k\}\\\}with∑kbk≤B\\sum\_\{k\}b\_\{k\}\\leq B, and producing a distilled studentSπS\_\{\\pi\}\.

### 2\.2Intuition and Formalization

Existing work on LLM capability distillation often assumes that targeted capabilities remain independent under compression\. However, distilling for one capability can inadvertently shift others, leading to inefficient or even counterproductive use of a limited budget[zhong2024revisiting](https://arxiv.org/html/2605.11290#bib.bib42);[fang2025kddd\_survey](https://arxiv.org/html/2605.11290#bib.bib10);[cloud2025subliminal](https://arxiv.org/html/2605.11290#bib.bib8)\. To study these interactions, we evaluate eight core LLM capabilities using commonly adopted benchmarks, with the full benchmark\-to\-metric mapping provided in Appendix[A](https://arxiv.org/html/2605.11290#A1), Table[3](https://arxiv.org/html/2605.11290#A1.T3)\. These capabilities span major dimensions of LLM behavior and motivate our unified*capability distillation*formulation\.

###### Definition 2\.1\(Capability distillation\)\.

Given a teacher modelTT, an initial student modelS0S\_\{0\}, and capability\-specific data distributions\{𝒟c\}c∈𝒞\\\{\\mathcal\{D\}\_\{c\}\\\}\_\{c\\in\\mathcal\{C\}\}, capability distillation allocates a limited budgetBBacross capabilities to improve task\-relevant performance under shared representations by optimizing a weighted mixture of distillation objectives:

minS,𝐰\\displaystyle\\min\_\{S,\\,\\mathbf\{w\}\}𝔼x∼𝒟𝐰\[ℓ\(S\(x\),T\(x\)\)\]s\.t\.𝐰∈Δ\|𝒞\|,cost\(S;S0\)≤B\.\\displaystyle\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\_\{\\mathbf\{w\}\}\}\\\!\\left\[\\ell\\\!\\left\(S\(x\),T\(x\)\\right\)\\right\]\\quad\\ \\text\{s\.t\.\}\\quad\\ \\mathbf\{w\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\},\\;\\mathrm\{cost\}\(S;S\_\{0\}\)\\leq B\.\(1\)where𝒟𝐰:=∑c∈𝒞wc𝒟c\\mathcal\{D\}\_\{\\mathbf\{w\}\}:=\\sum\_\{c\\in\\mathcal\{C\}\}w\_\{c\}\\mathcal\{D\}\_\{c\}, andcost\(S;S0\)\\mathrm\{cost\}\(S;S\_\{0\}\)measures the distillation token budget consumed to obtainSSfromS0S\_\{0\}\. Here𝐰=\(wc\)c∈𝒞\\mathbf\{w\}=\(w\_\{c\}\)\_\{c\\in\\mathcal\{C\}\}lies in the probability simplex

Δ\|𝒞\|=\{𝐰∈ℝ\|𝒞\|\|wc≥0,∑c∈𝒞wc=1\},\\Delta\_\{\|\\mathcal\{C\}\|\}=\\Big\\\{\\mathbf\{w\}\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\|\}\\ \\big\|\\ w\_\{c\}\\geq 0,\\ \\sum\_\{c\\in\\mathcal\{C\}\}w\_\{c\}=1\\Big\\\},which specifies the allocation of training effort across capabilities, andℓ\\elldenotes a standard distillation loss, such as token\-level cross\-entropy or logit\-matching between the student and teacher outputs\.

We therefore conduct an empirical study to identify and quantify how allocating a limited distillation budget to specific capabilities induces systematic performance changes in other capabilities, thereby exposing the intrinsic interdependence structure among capabilities in LLMs\.

###### Definition 2\.2\(Capability interdependence\)\.

LetS0S\_\{0\}be an initial student model, and letS\(𝐰,B\)S\(\\mathbf\{w\},B\)denote the student obtained after capability distillation under allocation𝐰∈Δ\|𝒞\|\\mathbf\{w\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\}and budgetBB\. For each capabilitycic\_\{i\}, define the one\-hot allocation vector𝐰\(i\)∈Δ\|𝒞\|\\mathbf\{w\}^\{\(i\)\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\}bywi\(i\)=1w^\{\(i\)\}\_\{i\}=1andwj\(i\)=0w^\{\(i\)\}\_\{j\}=0for allj≠ij\\neq i\. The*capability interdependence*under capability distillation is characterized by the capability transfer matrix

𝐓ij\(B\):=sj\(S\(𝐰\(i\),B\)\)−sj\(S0\),\\mathbf\{T\}\_\{ij\}\(B\):=s\_\{j\}\\\!\\left\(S\(\\mathbf\{w\}^\{\(i\)\},B\)\\right\)\-s\_\{j\}\(S\_\{0\}\),where𝐓\(B\)∈ℝ\|𝒞\|×\|𝒞\|\\mathbf\{T\}\(B\)\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\|\\times\|\\mathcal\{C\}\|\},𝐓ij\(B\)\\mathbf\{T\}\_\{ij\}\(B\)quantifies the effect of distilling toward capabilitycic\_\{i\}on the performance of capabilitycjc\_\{j\}\. Non\-zero off\-diagonal entries indicate interdependence between capabilities under shared representations\.

### 2\.3Exploratory Study and Problem Formulation

In this section, we analyze empirical capability interdependence through controlled distillation experiments\. Across all experiments, we distill Llama\-3\.3\-70B\-Instruct into Llama\-3\.1\-8B\-Instruct under fixed token budgets, using the target capabilities in Appendix Table[3](https://arxiv.org/html/2605.11290#A1.T3)and several representative capability\-distillation strategies\. For each strategyπ\\piand budgetBB, we evaluate the student on all benchmarks to form a capability transfer matrix, capturing both target\-capability gains and non\-target changes\. We normalize all benchmark scores to\[0,100\]\[0,100\]; details are provided in Appendix[B](https://arxiv.org/html/2605.11290#A2)\.

![Refer to caption](https://arxiv.org/html/2605.11290v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.11290v1/x2.png)\(a\)Small token budget
![Refer to caption](https://arxiv.org/html/2605.11290v1/x3.png)\(b\)Large token budget

Figure 1:Cross\-capability transfer under single\-capability distillation\. Larger budgets sharpen target\-capability gains but expose stronger negative transfer\.Observation 1: Distilling a specific capability redistributes performance across other capabilities in a budget\-dependent manner\.Figure[1](https://arxiv.org/html/2605.11290#S2.F1)reports the visualization of the capability transfer matrices under different token budgets\. Here, the off\-diagonal mass is consistently non\-zero, meaning optimizing a single capability alters other capabilities rather than leaving them unchanged\. Besides, the structure varies with budget: at small budgets, changes are generally weak and diffuse, whereas at large budgets the diagonal strengthens, but negative off\-diagonals become more visible for specific non\-targets\. Together, these results indicate that capability distillation induces a structured redistribution of performance\.

![Refer to caption](https://arxiv.org/html/2605.11290v1/x4.png)\(a\)Diminishing returns
![Refer to caption](https://arxiv.org/html/2605.11290v1/x5.png)\(b\)Rising spillover

Figure 2:Budget waste in capability distillation\. Extra tokens yield smaller target gains while increasing harmful spillover to non\-target capabilities\.Observation 2: Capability distillation exhibits substantial budget waste due to diminishing returns and negative spillover\.Under a fixed distillation budget, additional tokens for a target capability should ideally provide meaningful target improvement without degrading other capabilities needed downstream\. We test this using two diagnostics derived from the capability transfer matrixT\(B\)T\(B\)\. First,*diminishing returns*measures whether marginal target gains shrink as the budget increases, comparing 20M→\\rightarrow80M against 80M→\\rightarrow150M for the same target\. Second,*negative spillover*measures the drop on non\-target capabilities as more budget is assigned to target\.

Figure[2\(a\)](https://arxiv.org/html/2605.11290#S2.F2.sf1)shows that, for the strongest\-improving targets, the 20M→\\rightarrow80M gain is consistently larger than the 80M→\\rightarrow150M gain, indicating target\-capability saturation\. Figure[2\(b\)](https://arxiv.org/html/2605.11290#S2.F2.sf2)further shows that collateral harm to non\-target capabilities grows with budget\. Together, these trends reveal budget waste: beyond a capability\-specific knee point, extra tokens yield limited target benefit while amplifying cross\-capability degradation, motivating allocation strategies that explicitly trade off target gains against spillover\.

Overall, these observations suggest that effective capability distillation under a fixed budgets requires jointly reasoning about task\-relevant capabilities, cross\-capability interactions, and budget allocation\. We formalize this as a budget allocation problem across interdependent capabilities:

###### Problem 2\.3\(Budget allocation for interdependent capability distillation\)\.

Given a teacher modelTT, an initial student modelS0S\_\{0\}, a downstream taskτ\\tau, and a total distillation budgetBB, consider a sequential distillation process over stepst=1,…,Tt=1,\\ldots,Twith cumulative cost not exceedingBB\. At each steptt, a capability allocation vector𝐰t∈Δ\|𝒞\|\\mathbf\{w\}\_\{t\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\}is selected and applied to update the student, yieldingStS\_\{t\}\. LetUτ\(⋅\)U\_\{\\tau\}\(\\cdot\)is the task\-dependent utility function, our target is to determine an allocation policy\{𝐰t\}t=1T\\\{\\mathbf\{w\}\_\{t\}\\\}\_\{t=1\}^\{T\}that maximizes the performance of the final student model:

max\{𝐰t\}t=1T⁡Uτ\(ST\),s\.t\.∑t=1Tcost\(St−1→St\)≤B\\max\_\{\\\{\\mathbf\{w\}\_\{t\}\\\}\_\{t=1\}^\{T\}\}\\ U\_\{\\tau\}\\\!\\left\(S\_\{T\}\\right\),\\ \\text\{s\.t\.\}\\;\\sum\_\{t=1\}^\{T\}\\mathrm\{cost\}\\\!\\left\(S\_\{t\-1\}\\\!\\rightarrow\\\!S\_\{t\}\\right\)\\leq B\(2\)

## 3Methodology

ReAD treats Problem[2\.3](https://arxiv.org/html/2605.11290#S2.Thmtheorem3)as state\-dependent budget allocation over capabilities\. At each step, it estimates task requirements, generates capability\-targeted teacher supervision, and uses an uncertainty\-aware contextual bandit to choose the next token allocation\. As the student profile changes, ReAD can shift budget away from saturated or high\-spillover capabilities\.

### 3\.1Identifying Task\-Essential Capabilities

#### Task requirement vector\.

For each downstream taskτ\\tau, ReAD constructs a task card𝒟τspec\\mathcal\{D\}^\{\\mathrm\{spec\}\}\_\{\\tau\}from non\-test data, containing a short task description, input/output format, evaluation target, and a few representative exemplars, with the test split reserved only for final reporting\. From this task card, ReAD estimates a requirement vector𝐫τ∈Δ\|𝒞\|\\mathbf\{r\}\_\{\\tau\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\}, whererτ,cr\_\{\\tau,c\}indicates how useful capabilityccis for improving the task utilityUτ\(⋅\)U\_\{\\tau\}\(\\cdot\)\. We use𝐫τ\\mathbf\{r\}\_\{\\tau\}as an actionable allocation prior: high\-mass capabilities are prioritized for improvement, and drops on these capabilities are treated as harmful spillover\. Since token allocations cannot be negative,𝐫τ\\mathbf\{r\}\_\{\\tau\}is constrained to the simplex; negative sensitivities are captured instead by signed capability changes and the spillover penalty in Section[3\.3](https://arxiv.org/html/2605.11290#S3.SS3)\.

#### Requirement identifier\.

We learn a lightweight identifiergϕg\_\{\\phi\}that maps the task card to the requirement vector𝐫τ=gϕ\(𝒟τspec\)∈Δ\|𝒞\|\\mathbf\{r\}\_\{\\tau\}=g\_\{\\phi\}\(\\mathcal\{D\}^\{\\mathrm\{spec\}\}\_\{\\tau\}\)\\in\\Delta\_\{\|\\mathcal\{C\}\|\}\. The identifier is a small Transformer encoder over the task card, followed by a two\-layer MLP and a softmax head\. Since𝐫τ\\mathbf\{r\}\_\{\\tau\}is not directly observed, we supervise it through low\-budget interventions in capability space\.

#### Local supervision signal\.

Let𝐬\(S\)∈ℝ\|𝒞\|\\mathbf\{s\}\(S\)\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\|\}denote the measured capability profile of a studentSS, and writeUτ\(S\)=Fτ\(𝐬\(S\)\)U\_\{\\tau\}\(S\)=F\_\{\\tau\}\(\\mathbf\{s\}\(S\)\)\. For a small intervention that changes the profile from𝐬\\mathbf\{s\}to𝐬\+Δ𝐬\\mathbf\{s\}\+\\Delta\\mathbf\{s\},

Fτ\(𝐬\+Δ𝐬\)=Fτ\(𝐬\)\+∇𝐬Fτ\(𝐬\)⊤Δ𝐬\+o\(‖Δ𝐬‖\),F\_\{\\tau\}\(\\mathbf\{s\}\+\\Delta\\mathbf\{s\}\)=F\_\{\\tau\}\(\\mathbf\{s\}\)\+\\nabla\_\{\\mathbf\{s\}\}F\_\{\\tau\}\(\\mathbf\{s\}\)^\{\\top\}\\Delta\\mathbf\{s\}\+o\(\\\|\\Delta\\mathbf\{s\}\\\|\),\(3\)so the utility change is locally approximated by

ΔUτ=Uτ\(S′\)−Uτ\(S\)≈∇𝐬Fτ\(𝐬\)⊤Δ𝐬\.\\Delta U\_\{\\tau\}=U\_\{\\tau\}\(S^\{\\prime\}\)\-U\_\{\\tau\}\(S\)\\approx\\nabla\_\{\\mathbf\{s\}\}F\_\{\\tau\}\(\\mathbf\{s\}\)^\{\\top\}\\Delta\\mathbf\{s\}\.\(4\)Thus,𝐫τ\\mathbf\{r\}\_\{\\tau\}is trained as a simplex\-constrained surrogate for the beneficial part of the local utility sensitivity, e\.g\.,𝐫τ≈ΠΔ\(\[∇𝐬Fτ\(𝐬\)\]\+\)\\mathbf\{r\}\_\{\\tau\}\\approx\\Pi\_\{\\Delta\}\(\[\\nabla\_\{\\mathbf\{s\}\}F\_\{\\tau\}\(\\mathbf\{s\}\)\]\_\{\+\}\)\. This does not assume that the true sensitivity has no negative coordinates; it only converts the sensitivity into a nonnegative budget\-allocation prior\.

#### Offline pretraining\.

We construct an intervention set𝒵=\{\(𝒟τspec,Δ𝐬τ\(m\),ΔUτ\(m\)\)\}τ,m\\mathcal\{Z\}=\\\{\(\\mathcal\{D\}^\{\\mathrm\{spec\}\}\_\{\\tau\},\\Delta\\mathbf\{s\}^\{\(m\)\}\_\{\\tau\},\\Delta U^\{\(m\)\}\_\{\\tau\}\)\\\}\_\{\\tau,m\}from auxiliary tasks and non\-test splits\. For each taskτ\\tau, we sample sparse allocation vectors𝐰\(m\)∈Δ\|𝒞\|\\mathbf\{w\}^\{\(m\)\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\}by choosing a support setℐ\(m\)⊆𝒞\\mathcal\{I\}^\{\(m\)\}\\subseteq\\mathcal\{C\}, sampling nonzero weights from a Dirichlet distribution, and renormalizing them\. A short probe distillation under a small budgetbprobeb\_\{\\mathrm\{probe\}\}produces an intervened studentS\(m\)S^\{\(m\)\}, from which we measure

ΔUτ\(m\):=Uτ\(S\(m\)\)−Uτ\(S0\),Δ𝐬τ\(m\):=𝐬\(S\(m\)\)−𝐬\(S0\)\.\\Delta U^\{\(m\)\}\_\{\\tau\}:=U\_\{\\tau\}\(S^\{\(m\)\}\)\-U\_\{\\tau\}\(S\_\{0\}\),\\qquad\\Delta\\mathbf\{s\}^\{\(m\)\}\_\{\\tau\}:=\\mathbf\{s\}\(S^\{\(m\)\}\)\-\\mathbf\{s\}\(S\_\{0\}\)\.\(5\)We pretraingϕg\_\{\\phi\}by predicting the observed utility change from the induced capability change:

minϕ∑\(τ,m\)\(ΔUτ\(m\)−gϕ\(𝒟τspec\)⊤Δ𝐬τ\(m\)\)2\+λentℋ\(gϕ\(𝒟τspec\)\)\.\\min\_\{\\phi\}\\sum\_\{\(\\tau,m\)\}\\Big\(\\Delta U^\{\(m\)\}\_\{\\tau\}\-g\_\{\\phi\}\(\\mathcal\{D\}^\{\\mathrm\{spec\}\}\_\{\\tau\}\)^\{\\top\}\\Delta\\mathbf\{s\}^\{\(m\)\}\_\{\\tau\}\\Big\)^\{2\}\+\\lambda\_\{\\mathrm\{ent\}\}\\,\\mathcal\{H\}\\\!\\left\(g\_\{\\phi\}\(\\mathcal\{D\}^\{\\mathrm\{spec\}\}\_\{\\tau\}\)\\right\)\.\(6\)The positive entropy penalty discourages near\-uniform requirement vectors and makes the allocation prior more discriminative\. At deployment time, ReAD computes𝐫τ\\mathbf\{r\}\_\{\\tau\}once and reuses it throughout the distillation run\. Since the identifier is trained once and reused across tasks, it is not counted as part of the per\-run student distillation budget\.

### 3\.2On\-the\-Fly Capability\-Targeted Data Generation

#### Capability\-conditioned templates\.

Given𝐫τ\\mathbf\{r\}\_\{\\tau\}, ReAD generates training examples on the fly from a capability\-labeled template library\. For each capabilityc∈𝒞c\\in\\mathcal\{C\}, a template set𝒫c\\mathcal\{P\}\_\{c\}specifies an instruction scaffold, typed slots, and output\-format constraints\. Slot values are filled using deterministic seeds\. The generator controls the form of teacher supervision

#### Difficulty scoring\.

For promptxxin capabilitycc, we compute a deterministic difficulty score

dc\(x\)=1\|ℱc\|∑f∈ℱcf\(x\)−minx′∈𝒬c⁡f\(x′\)maxx′∈𝒬c⁡f\(x′\)−minx′∈𝒬c⁡f\(x′\)\+ϵ,d\_\{c\}\(x\)=\\frac\{1\}\{\|\\mathcal\{F\}\_\{c\}\|\}\\sum\_\{f\\in\\mathcal\{F\}\_\{c\}\}\\frac\{f\(x\)\-\\min\_\{x^\{\\prime\}\\in\\mathcal\{Q\}\_\{c\}\}f\(x^\{\\prime\}\)\}\{\\max\_\{x^\{\\prime\}\\in\\mathcal\{Q\}\_\{c\}\}f\(x^\{\\prime\}\)\-\\min\_\{x^\{\\prime\}\\in\\mathcal\{Q\}\_\{c\}\}f\(x^\{\\prime\}\)\+\\epsilon\},\(7\)whereℱc\\mathcal\{F\}\_\{c\}is the set of capability\-specific control factors and𝒬c\\mathcal\{Q\}\_\{c\}is a calibration prompt pool for capabilitycc\. For example, reasoning templates use factors such as the number of constraints, reasoning hops, and symbolic complexity; code templates use the number of required functions and tests; steerability templates use the number of rules and schema fields\. We split each capability’s calibration pool into easy, medium, and hard buckets by tertiles ofdc\(x\)d\_\{c\}\(x\)\.

#### Curriculum and sampling\.

At decision steptt, ReAD samples a capability label according to the selected allocation𝐰t\\mathbf\{w\}\_\{t\}, instantiates a prompt from the corresponding template set, queries the teacherTTfor a completionyy, and immediately distills on\(x,y\)\(x,y\)\. The curriculum only adjusts the within\-capability difficulty mix: early steps emphasize easy and medium examples, while later steps increase the probability of hard examples\. It does not choose the capability allocation\. We also track recent template frequencies and cap each template’s share so that no single scaffold dominates\.

### 3\.3Contextual Bandit for Capability Allocation

#### Decision loop\.

ReAD runs forTstep=20T\_\{\\mathrm\{step\}\}=20decision steps\. Thus, the bandit update, data resampling, and proxy refresh occur every1\.01\.0M tokens in the2020M setting and every7\.57\.5M tokens in the150150M setting\. At steptt, the context summarizes the task requirement, current student profile, remaining budget, and recent allocation history:

𝐱t=\[𝐫τ;𝐬probe\(St\);bt;𝝆t\],\\mathbf\{x\}\_\{t\}=\[\\mathbf\{r\}\_\{\\tau\};\\mathbf\{s\}^\{\\mathrm\{probe\}\}\(S\_\{t\}\);b\_\{t\};\\boldsymbol\{\\rho\}\_\{t\}\],\(8\)where𝝆t\\boldsymbol\{\\rho\}\_\{t\}stores a short record of recent allocations and observed gains\. The profile𝐬probe\(St\)\\mathbf\{s\}^\{\\mathrm\{probe\}\}\(S\_\{t\}\)is computed with a small fixed monitoring suite, while full held\-out benchmark evaluation is used only for final reporting and coarse calibration\.

#### Candidate allocations and distillation\.

At each step, ReAD selects an allocation𝐰t\\mathbf\{w\}\_\{t\}from a finite candidate set𝒜\(τ\)\\mathcal\{A\}\(\\tau\)consisting of local perturbations around the previous allocation and sparse actions concentrated on the top\-kkcapabilities under𝐫τ\\mathbf\{r\}\_\{\\tau\}\. Weights are chosen from a fixed grid and renormalized to sum to one\. Given𝐰t\\mathbf\{w\}\_\{t\}, ReAD samples capability\-targeted data and updates the student by minimizingℒdistill\(θt\)=−𝔼\(x,y\)∑j=1\|y\|log⁡pθt\(yj∣x,y<j\)\\mathcal\{L\}\_\{\\mathrm\{distill\}\}\(\\theta\_\{t\}\)=\-\\mathbb\{E\}\_\{\(x,y\)\}\\sum\_\{j=1\}^\{\|y\|\}\\log p\_\{\\theta\_\{t\}\}\(y\_\{j\}\\mid x,y\_\{<j\}\)to produceSt\+1S\_\{t\+1\}\.

#### Proxy reward\.

After the update, ReAD measures the probe\-profile changeΔ𝐬tprobe=𝐬probe\(St\+1\)−𝐬probe\(St\)\\Delta\\mathbf\{s\}^\{\\mathrm\{probe\}\}\_\{t\}=\\mathbf\{s\}^\{\\mathrm\{probe\}\}\(S\_\{t\+1\}\)\-\\mathbf\{s\}^\{\\mathrm\{probe\}\}\(S\_\{t\}\)and forms

R^t=𝐫τ⊤Δ𝐬tprobe−βSpillt−λcostt,Spillt=∑c∈𝒞essrτ,c\[−Δst,cprobe\]\+\\widehat\{R\}\_\{t\}=\\mathbf\{r\}\_\{\\tau\}^\{\\top\}\\Delta\\mathbf\{s\}^\{\\mathrm\{probe\}\}\_\{t\}\-\\beta\\,\\mathrm\{Spill\}\_\{t\}\-\\lambda\\,\\mathrm\{cost\}\_\{t\},\\quad\\ \\mathrm\{Spill\}\_\{t\}=\\sum\_\{c\\in\\mathcal\{C\}\_\{\\mathrm\{ess\}\}\}r\_\{\\tau,c\}\\,\[\-\\Delta s^\{\\mathrm\{probe\}\}\_\{t,c\}\]\_\{\+\}\(9\)Here𝒞ess\\mathcal\{C\}\_\{\\mathrm\{ess\}\}contains the top\-kkcapabilities under𝐫τ\\mathbf\{r\}\_\{\\tau\}, andcostt\\mathrm\{cost\}\_\{t\}is the token budget consumed at the step\. The first term rewards task\-aligned capability gains, the second penalizes regressions on task\-essential capabilities, and the third enforces budget awareness\.

#### Reward model and UCB allocation\.

We append\(𝐱t,𝐰t,R^t\)\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\},\\widehat\{R\}\_\{t\}\)to a history buffer and train an ensemble ofJJtwo\-layer MLP reward regressors\{hηj\}j=1J\\\{h\_\{\\eta\_\{j\}\}\\\}\_\{j=1\}^\{J\}\. Each regressor maps\(𝐱,𝐰\)\(\\mathbf\{x\},\\mathbf\{w\}\)to a scalar next\-step reward and is trained on a bootstrap resample of the 10% profiling split plus logged transitions from earlier steps, yielding about3\.23\.2k–4\.94\.9k examples per task in our experiments\. For a candidate allocation𝐰\\mathbf\{w\}, the ensemble mean and uncertainty are

μ\(𝐱t,𝐰\)=1J∑j=1Jhηj\(𝐱t,𝐰\),σ\(𝐱t,𝐰\)=1J−1∑j=1J\(hηj\(𝐱t,𝐰\)−μ\(𝐱t,𝐰\)\)2\.\\mu\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)=\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}h\_\{\\eta\_\{j\}\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\),\\quad\\sigma\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)=\\sqrt\{\\frac\{1\}\{J\-1\}\\sum\_\{j=1\}^\{J\}\\big\(h\_\{\\eta\_\{j\}\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\-\\mu\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\\big\)^\{2\}\}\.\(10\)ReAD then selects the next allocation using an upper\-confidence rule,

𝐰t\+1=arg⁡max𝐰∈𝒜\(τ\)⁡μ\(𝐱t\+1,𝐰\)\+κσ\(𝐱t\+1,𝐰\),\\mathbf\{w\}\_\{t\+1\}=\\arg\\max\_\{\\mathbf\{w\}\\in\\mathcal\{A\}\(\\tau\)\}\\mu\(\\mathbf\{x\}\_\{t\+1\},\\mathbf\{w\}\)\+\\kappa\\,\\sigma\(\\mathbf\{x\}\_\{t\+1\},\\mathbf\{w\}\),\(11\)withbt\+1=bt−costtb\_\{t\+1\}=b\_\{t\}\-\\mathrm\{cost\}\_\{t\}\. At coarse development checkpoints, we fit an affine calibration from cumulative proxy reward to observed utility change and reduce exploration if the development utility stagnates\. Unlike static mixture\-regression approaches such as RegMix[liu2024regmix](https://arxiv.org/html/2605.11290#bib.bib19), ReAD repeatedly re\-estimates the best next allocation from the current student state and remaining budget\.

## 4Theoretical Analysis

This section provides a local theoretical account of ReAD’s allocation problem\. We characterize cross\-capability transfer under shared representations, show when diminishing returns make continued allocation wasteful, and motivate uncertainty\-aware allocation under noisy reward estimates\.

### 4\.1Local Cross\-Capability Transfer

Letθt∈ℝp\\theta\_\{t\}\\in\\mathbb\{R\}^\{p\}denote the parameters of the studentStS\_\{t\}, and letΔθt=θt−θt−1\\Delta\\theta\_\{t\}=\\theta\_\{t\}\-\\theta\_\{t\-1\}\. We analyze one decision step, which is the regime in which ReAD re\-estimates its allocation\.

###### Assumption 4\.1\(Mixture\-structured update\)\.

At intervaltt, given an allocation vector𝐰t∈Δ\|𝒞\|\\mathbf\{w\}\_\{t\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\}, the expected update admits a local mixture approximation𝔼\[Δθt∣𝐰t,θt−1\]≈∑c∈𝒞wt,cdc\(θt−1\)\\mathbb\{E\}\[\\Delta\\theta\_\{t\}\\mid\\mathbf\{w\}\_\{t\},\\theta\_\{t\-1\}\]\\approx\\sum\_\{c\\in\\mathcal\{C\}\}w\_\{t,c\}\\,d\_\{c\}\(\\theta\_\{t\-1\}\), wheredc\(θ\)d\_\{c\}\(\\theta\)is the capability\-conditioned local update direction\.

###### Assumption 4\.2\(Local capability readout\)\.

The measured capability profile𝐬\(S\)\\mathbf\{s\}\(S\)is represented locally as a differentiable function𝐬~\(θ\)\\tilde\{\\mathbf\{s\}\}\(\\theta\)\. Nearθt−1\\theta\_\{t\-1\},

Δst,c=\[𝐬\(St\)\]c−\[𝐬\(St−1\)\]c≈∇θ\[𝐬~\(θt−1\)\]c⊤Δθt\.\\Delta s\_\{t,c\}=\[\\mathbf\{s\}\(S\_\{t\}\)\]\_\{c\}\-\[\\mathbf\{s\}\(S\_\{t\-1\}\)\]\_\{c\}\\approx\\nabla\_\{\\theta\}\[\\tilde\{\\mathbf\{s\}\}\(\\theta\_\{t\-1\}\)\]\_\{c\}^\{\\top\}\\Delta\\theta\_\{t\}\.\(12\)

###### Proposition 4\.3\(Local transfer decomposition\)\.

Under the two local approximations above, the expected change in capabilityccsatisfies

Δst,c≈∑c′∈𝒞wt,c′Γc,c′\(θt−1\),Γc,c′\(θ\):=∇θ\[𝐬~\(θ\)\]c⊤dc′\(θ\)\.\\Delta s\_\{t,c\}\\approx\\sum\_\{c^\{\\prime\}\\in\\mathcal\{C\}\}w\_\{t,c^\{\\prime\}\}\\,\\Gamma\_\{c,c^\{\\prime\}\}\(\\theta\_\{t\-1\}\),\\qquad\\Gamma\_\{c,c^\{\\prime\}\}\(\\theta\):=\\nabla\_\{\\theta\}\[\\tilde\{\\mathbf\{s\}\}\(\\theta\)\]\_\{c\}^\{\\top\}d\_\{c^\{\\prime\}\}\(\\theta\)\.\(13\)

Interpretation\.The matrixΓ\\Gammais the local analogue of the empirical capability\-transfer matrix\. Diagonal terms capture intended gains, while off\-diagonal terms capture transfer or interference\. Off\-diagonal terms are nonzero whenever the update direction for capabilityc′c^\{\\prime\}is not orthogonal to the readout direction of capabilitycc, which is expected under shared representations\. This is the decision\-step explanation for why allocating budget to one capability can alter others\.

### 4\.2Budget Waste from Diminishing Returns

The previous subsection explains spillover\. We now isolate diminishing returns with a simpler cumulative\-budget proxy\. Letbc≥0b\_\{c\}\\geq 0be the total number of distillation tokens assigned to capabilitycc, with∑cbc≤B\\sum\_\{c\}b\_\{c\}\\leq B, and let𝐫τ∈Δ\|𝒞\|\\mathbf\{r\}\_\{\\tau\}\\in\\Delta\_\{\|\\mathcal\{C\}\|\}be the task requirement vector\.

###### Assumption 4\.4\(Concave task\-aligned gains\)\.

For taskτ\\tauand capabilitycc, there is a nondecreasing concave gain functionGτ,c:ℝ≥0→ℝ≥0G\_\{\\tau,c\}:\\mathbb\{R\}\_\{\\geq 0\}\\to\\mathbb\{R\}\_\{\\geq 0\}such that𝔼\[Uτ\(S\)−Uτ\(S0\)\]≈∑c∈𝒞rτ,cGτ,c\(bc\)\\mathbb\{E\}\[U\_\{\\tau\}\(S\)\-U\_\{\\tau\}\(S\_\{0\}\)\]\\approx\\sum\_\{c\\in\\mathcal\{C\}\}r\_\{\\tau,c\}G\_\{\\tau,c\}\(b\_\{c\}\)where∑c∈𝒞bc≤B\\sum\_\{c\\in\\mathcal\{C\}\}b\_\{c\}\\leq B\.

Consider a restricted distillation strategy that allocates budget only to a subset𝒦⊆𝒞\\mathcal\{K\}\\subseteq\\mathcal\{C\}:

max\{bc≥0\}c∈𝒦∑c∈𝒦rτ,cGτ,c\(bc\)s\.t\.∑c∈𝒦bc≤B\.\\max\_\{\\\{b\_\{c\}\\geq 0\\\}\_\{c\\in\\mathcal\{K\}\}\}\\sum\_\{c\\in\\mathcal\{K\}\}r\_\{\\tau,c\}G\_\{\\tau,c\}\(b\_\{c\}\)\\quad\\mathrm\{s\.t\.\}\\quad\\sum\_\{c\\in\\mathcal\{K\}\}b\_\{c\}\\leq B\.\(14\)
###### Proposition 4\.5\(Weighted marginal gain criterion\)\.

Any optimum\{bc⋆\}c∈𝒦\\\{b\_\{c\}^\{\\star\}\\\}\_\{c\\in\\mathcal\{K\}\}of Eq\. \([14](https://arxiv.org/html/2605.11290#S4.E14)\) equalizes weighted marginal gains among active capabilities: there existsν⋆≥0\\nu^\{\\star\}\\geq 0such that

rτ,cGτ,c′\(bc⋆\)≤ν⋆∀c∈𝒦,rτ,cGτ,c′\(bc⋆\)=ν⋆ifbc⋆\>0\.r\_\{\\tau,c\}G^\{\\prime\}\_\{\\tau,c\}\(b\_\{c\}^\{\\star\}\)\\leq\\nu^\{\\star\}\\quad\\forall c\\in\\mathcal\{K\},\\qquad r\_\{\\tau,c\}G^\{\\prime\}\_\{\\tau,c\}\(b\_\{c\}^\{\\star\}\)=\\nu^\{\\star\}\\quad\\text\{if \}b\_\{c\}^\{\\star\}\>0\.\(15\)Moreover, in the single\-capability case targetingc¯\\bar\{c\}, shifting an infinitesimal budget fromc¯\\bar\{c\}to another task\-relevant capabilityc′c^\{\\prime\}improves the proxy objective wheneverrτ,c′Gτ,c′′\(0\)\>rτ,c¯Gτ,c¯′\(bc¯\)r\_\{\\tau,c^\{\\prime\}\}G^\{\\prime\}\_\{\\tau,c^\{\\prime\}\}\(0\)\>r\_\{\\tau,\\bar\{c\}\}G^\{\\prime\}\_\{\\tau,\\bar\{c\}\}\(b\_\{\\bar\{c\}\}\)\.

Implication for ReAD\.This analysis formalizes budget waste: once the target capability saturates, additional tokens can have lower task\-aligned marginal value than tokens assigned elsewhere\. The actual ReAD reward in Eq\. \([9](https://arxiv.org/html/2605.11290#S3.E9)\) is richer than this proxy because it also penalizes observed spillover\. Thus, the theory motivates adaptive allocation, while the allocator directly optimizes a step\-level reward that accounts for both diminishing returns and cross\-capability degradation\.

### 4\.3Why Uncertainty\-Aware Allocation Helps

At steptt, ReAD chooses from a finite action set𝒜\(τ\)\\mathcal\{A\}\(\\tau\)using the ensemble meanμ\(𝐱t,𝐰\)\\mu\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)and uncertaintyσ\(𝐱t,𝐰\)\\sigma\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)in Eq\. \([11](https://arxiv.org/html/2605.11290#S3.E11)\)\. LetR¯\(𝐱,𝐰\):=𝔼\[R^t∣𝐱t=𝐱,𝐰t=𝐰\]\\overline\{R\}\(\\mathbf\{x\},\\mathbf\{w\}\):=\\mathbb\{E\}\[\\widehat\{R\}\_\{t\}\\mid\\mathbf\{x\}\_\{t\}=\\mathbf\{x\},\\mathbf\{w\}\_\{t\}=\\mathbf\{w\}\]be the conditional expected proxy reward\.

###### Assumption 4\.6\(Calibrated ensemble uncertainty\)\.

For all encountered\(𝐱t,𝐰\)\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\), with high probability,\|μ\(𝐱t,𝐰\)−R¯\(𝐱t,𝐰\)\|≤κσ\(𝐱t,𝐰\)\|\\mu\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\-\\overline\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\|\\leq\\kappa\\sigma\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\.

###### Proposition 4\.7\(Step\-level regret bound\)\.

Let𝐰t⋆∈arg⁡max𝐰∈𝒜\(τ\)⁡R¯\(𝐱t,𝐰\)\\mathbf\{w\}\_\{t\}^\{\\star\}\\in\\arg\\max\_\{\\mathbf\{w\}\\in\\mathcal\{A\}\(\\tau\)\}\\overline\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\. Under Assumption[C\.1](https://arxiv.org/html/2605.11290#A3.Thmtheorem1), the action selected by Eq\. \([11](https://arxiv.org/html/2605.11290#S3.E11)\) satisfiesR¯\(𝐱t,𝐰t⋆\)−R¯\(𝐱t,𝐰t\)≤2κσ\(𝐱t,𝐰t\)\\overline\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}^\{\\star\}\)\-\\overline\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}\)\\leq 2\\kappa\\sigma\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}\)\.

Proof sketch\.By calibration, the true reward of any action is at most its upper confidence score, and the selected action maximizes this score\. Applying the same calibration inequality to the selected action yields the bound\. This result does not imply a global optimum for nonconvex fine\-tuning; it justifies the local ranking rule used by ReAD at each budget step\.

## 5Experimental Setting

Models and budgets\.We mainly distill Llama\-3\.3\-70B\-Instruct into Llama\-3\.1\-8B\-Instruct, and report Qwen2\.5\-72B\-Instruct to Qwen2\.5\-14B\-Instruct results in Appendix[D](https://arxiv.org/html/2605.11290#A4)\. All methods use the same student checkpoint, training recipe, and2020M/150150M distillation\-token budgets\.

Evaluation protocol\.We evaluate eight capabilities using Appendix Table[3](https://arxiv.org/html/2605.11290#A1.T3)\. For each benchmark, 10% of non\-test data is used for profiling, requirement inference, and allocation updates; the held\-out split is reserved for final reporting\. Generation uses temperature0with the same prompt template across methods\. We report mean±\\pmstandard deviation over three seeds\.

Baselines\.For each capability, we build a held\-out\-filtered distillation pool from its benchmark prompts\. For multi\-dataset capabilities, we use balanced dataset\-level sampling\. The six main baselines combine three teacher signals—final response \(Resp\), chain\-of\-thought plus response \(CoT\), and token\-level logits \(Logit\)— with two objectives, SFT and KD\. Each baseline is one\-hot at the capability level, spending the full budget on the target\-capability pool\.

ReAD implementation\.ReAD runs for2020decision steps, with bandit updates, proxy refreshes, and data resampling every1\.01\.0M tokens in the2020M setting and every7\.57\.5M tokens in the150150M setting\. Reward regressors are two\-layer MLPs trained on the 10% profiling split plus logged transitions, yielding3\.23\.2k–4\.94\.9k examples per task\. Proxy refresh adds 3\.12% wall\-clock overhead, the full online loop adds 5\.63%, and the reusable requirement identifier costs 3\.79 GPU\-hours to train once\.

## 6Experiments

We study four questions:RQ1utility gains under the same budget;RQ2reduced waste and spillover;RQ3the value of adaptive bandit scheduling over static or greedy alternatives; andRQ4transfer beyond capability\-aligned benchmarks\.

Table 1:Capability profile under a 20M\-token budget on Llama\. ReAD achieves the strongest average profile while using the same student initialization and token budget as all baselines\.![Refer to caption](https://arxiv.org/html/2605.11290v1/x6.png)

![Refer to caption](https://arxiv.org/html/2605.11290v1/x7.png)\(a\)Reasoning
![Refer to caption](https://arxiv.org/html/2605.11290v1/x8.png)\(b\)Math
![Refer to caption](https://arxiv.org/html/2605.11290v1/x9.png)\(c\)Tool Use
![Refer to caption](https://arxiv.org/html/2605.11290v1/x10.png)\(d\)LCU

Figure 3:Performance versus token budget for four bottleneck capabilities\. Curves show mean±\\pmstandard deviation over three seeds\.### 6\.1Results and Analysis

#### RQ1: Utility under a fixed budget\.

Table[1](https://arxiv.org/html/2605.11290#S6.T1)shows that ReAD achieves the strongest capability profile at2020M tokens, consistently outperforming all six baselines\. The gains are especially pronounced on bottleneck capabilities such as Steerability, Math, Tool Use, and LCU\. The150150M Llama results and Qwen results in Appendix[D](https://arxiv.org/html/2605.11290#A4)show the same ordering, suggesting that the improvement is not tied to a single teacher–student pair\.

#### RQ2: Budget efficiency\.

Figure[3](https://arxiv.org/html/2605.11290#S6.F3)reports budget\-scaling curves for Reasoning, Math, Tool Use, and LCU\. ReAD achieves higher performance at the same budget and maintains the strongest scaling trend, indicating that adaptive allocation reduces wasted tokens on low\-marginal\-return updates\.

Table 2:Budget\-allocation diagnostics\.Panel \(a\) reports the 150M\-token gain–spillover trade\-off computed from the capability transfer matrix: ReAD preserves near\-best on\-target gain while substantially reducing off\-target degradation and negative transfer\. Panel \(b\) compares matched\-budget allocation schedules under the same generator, showing that adaptive bandit allocation outperforms static and greedy schedules\.\(a\) Harmful transfer at 150M
\(b\) Allocation schedules

#### RQ2: Gain–spillover trade\-off\.

Table[2](https://arxiv.org/html/2605.11290#S6.T2)\(a\) shows that ReAD improves the gain–spillover trade\-off\. Single\-capability KD achieves strong target gains but causes larger off\-target degradation, while static mixing reduces spillover at the cost of weaker target improvement\. ReAD better balances the two: it keeps strong on\-target gains while suppressing negative transfer across non\-target capabilities\.

#### RQ3: Adaptive scheduling\.

Table[2](https://arxiv.org/html/2605.11290#S6.T2)\(b\) compares ReAD with stronger allocation baselines under the same generator and token budgets\. Static schedules cannot adapt once marginal returns and spillover change during training, while the greedy heuristic ignores uncertainty and remaining budget\. ReAD outperforms both, with a larger margin at150150M where saturation and spillover are stronger\.

![Refer to caption](https://arxiv.org/html/2605.11290v1/x11.png)Figure 4:Ablation of ReAD components\.
#### Component ablation\.

We ablate ReAD by removing one component while fixing the teacher, student, prompt pools, training recipe, and budget\. Removing requirement identification makes the allocation uniform; removing adaptation uses a fixed allocation; and removing interaction awareness drops the spillover penalty\. Figure[4](https://arxiv.org/html/2605.11290#S6.F4)shows that identification and adaptation drive the largest gains, while interaction modeling provides additional improvement by reducing harmful transfer\.

#### Local proxy validation\.

The proxy validation supports ReAD’s use of local action ranking\. Predicted next\-step gains align with observed gains, especially for small budget moves, which matches ReAD’s stepwise allocation regime\. The requirement vectors are stable across runs, and the reward regressors are accurate enough to guide the uncertainty\-aware bandit updates\.

#### RQ4: Transfer beyond aligned benchmarks\.

Finally, we evaluate ReAD on held\-out XSTest[rottger2024xstest](https://arxiv.org/html/2605.11290#bib.bib24)using the same requirement identifier and allocation pipeline without adding a new capability head\. ReAD improves both safe\-refusal and unsafe\-refusal metrics over the best SFT/KD baseline; full results are reported in Appendix Table[7](https://arxiv.org/html/2605.11290#A4.T7)\.

## 7Conclusions

We study capability distillation under a fixed token budget and find systematic cross\-transfer and diminishing returns, often with increasing harmful spillover on task\-critical behaviors\. We propose ReAD, a reinforcement\-guided framework that identifies task\-essential capabilities and adaptively allocates budget across interdependent capabilities via on\-the\-fly supervision and an uncertainty\-aware contextual bandit\. Experiments show ReAD improves downstream utility under the same budget while reducing waste\.

## References

- \[1\]Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al\.Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732, 2021\.
- \[2\]Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li\.LongBench: A bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), pages 3119–3137, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.
- \[3\]Zinuo Cai, Rongbo Ma, Yicheng Fu, Weishan Zhang, Ruhui Ma, and Haibing Guan\.Llmaas: Serving large language models on trusted serverless computing platforms\.IEEE Transactions on Artificial Intelligence, 2024\.
- \[4\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert\-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N\. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba\.Evaluating large language models trained on code\.2021\.
- \[5\]Daixuan Cheng, Shaohan Huang, and Furu Wei\.Adapting large language models via reading comprehension\.InThe Twelfth International Conference on Learning Representations, 2023\.
- \[6\]Wei\-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al\.Vicuna: An open\-source chatbot impressing gpt\-4 with 90%\* chatgpt quality\.See https://vicuna\. lmsys\. org \(accessed 14 April 2023\), 2\(3\):6, 2023\.
- \[7\]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\.Think you have solved question answering? try arc, the ai2 reasoning challenge\.ArXiv, abs/1803\.05457, 2018\.
- \[8\]Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber\-Betley, Jacob Hilton, Samuel Marks, and Owain Evans\.Subliminal learning: Language models transmit behavioral traits via hidden signals in data\.arXiv preprint arXiv:2507\.14805, 2025\.
- \[9\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems, 2021\.
- \[10\]Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, and Ping Ma\.Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions\.arXiv preprint arXiv:2504\.14772, 2025\.
- \[11\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset\.NeurIPS, 2021\.
- \[12\]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean\.Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531, 2015\.
- \[13\]Cheng\-Yu Hsieh, Chun\-Liang Li, Chih\-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, and Chen\-Yu Lee\.Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.arXiv preprint arXiv:2305\.02301, 2023\.
- \[14\]Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu\.Tinybert: Distilling bert for natural language understanding\.InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174, 2020\.
- \[15\]Yimin Jing, Renren Jin, Jiahao Hu, Huishi Qiu, Xiaohua Wang, Peng Wang, and Deyi Xiong\.Followeval: A multi\-dimensional benchmark for assessing the instruction\-following capability of large language models\.arXiv preprint arXiv:2311\.09829, 2023\.
- \[16\]Lei Li, Yankai Lin, Shuhuai Ren, Peng Li, Jie Zhou, and Xu Sun\.Dynamic knowledge distillation for pre\-trained language models\.2021\.
- \[17\]Raymond Li, Erik Nijkamp, Swaroop Mishra, et al\.Starcoder: May the source be with you\!arXiv preprint arXiv:2305\.06161, 2023\.
- \[18\]Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao\.Less is more: Task\-aware layer\-wise distillation for language model compression\.InInternational Conference on Machine Learning, pages 20852–20867\. PMLR, 2023\.
- \[19\]Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin\.Regmix: Data mixture as regression for language model pre\-training\.arXiv preprint arXiv:2407\.01492, 2024\.
- \[20\]Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn\.Teaching small language models to reason\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 2: short papers\), pages 1773–1781, 2023\.
- \[21\]Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao\.Instruction tuning with gpt\-4\.arXiv preprint arXiv:2304\.03277, 2023\.
- \[22\]Edoardo M\. Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen\.XCOPA: A multilingual dataset for causal commonsense reasoning\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), 2020\.
- \[23\]David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman\.Gpqa: A graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling, 2024\.
- \[24\]Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy\.Xstest: A test suite for identifying exaggerated safety behaviours in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024\.
- \[25\]Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf\.Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter\.arXiv preprint arXiv:1910\.01108, 2019\.
- \[26\]Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei\.Language models are multilingual chain\-of\-thought reasoners, 2022\.
- \[27\]Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan\.Distilling reasoning capabilities into smaller language models\.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023\.
- \[28\]Zhihong Sun, Chen Lyu, Bolun Li, Yao Wan, Hongyu Zhang, Ge Li, and Zhi Jin\.Enhancing code generation performance of smaller models by distilling the reasoning ability of llms\.arXiv preprint arXiv:2403\.13271, 2024\.
- \[29\]Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei\.Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.arXiv preprint arXiv:2210\.09261, 2022\.
- \[30\]Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto\.Stanford alpaca: An instruction\-following llama model, 2023\.
- \[31\]Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou\.Minilm: Deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.volume 33, pages 5776–5788, 2020\.
- \[32\]Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen\.Mmlu\-pro: A more robust and challenging multi\-task language understanding benchmark, 2024\.
- \[33\]Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang\.Wizardlm: Empowering large language models to follow complex instructions\.arXiv preprint arXiv:2304\.12244, 2023\.
- \[34\]Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang\.On the tool manipulation capability of open\-source large language models, 2023\.
- \[35\]Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou\.A survey on knowledge distillation of large language models\.arXiv preprint arXiv:2402\.13116, 2024\.
- \[36\]Fanjia Yan, Huanzhi Mao, Charlie Cheng\-Jie Ji, Tianjun Zhang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\.Berkeley function calling leaderboard\.[https://gorilla\.cs\.berkeley\.edu/blogs/8\_berkeley\_function\_calling\_leaderboard\.html](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html), 2024\.
- \[37\]Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, and Yiqiang Chen\.Survey on knowledge distillation for large language models: methods, evaluation, and application\.ACM Transactions on Intelligent Systems and Technology, 2024\.
- \[38\]Yuanhao Yue, Chengyu Wang, Jun Huang, and Peng Wang\.Distilling instruction\-following abilities of large language models with task\-aware curriculum planning\.arXiv preprint arXiv:2405\.13448, 2024\.
- \[39\]Junjie Zhang, Ruobing Xie, Yupeng Hou, Xin Zhao, Leyu Lin, and Ji\-Rong Wen\.Recommendation as instruction following: A large language model empowered recommendation approach\.ACM Transactions on Information Systems, 43\(5\):1–37, 2025\.
- \[40\]Yichi Zhang, Zhuo Chen, Yin Fang, Yanxi Lu, Li Fangming, Wen Zhang, and Huajun Chen\.Knowledgeable preference alignment for llms in domain\-specific question answering\.InFindings of the Association for Computational Linguistics: ACL 2024, pages 891–904, 2024\.
- \[41\]Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al\.A survey of large language models\.arXiv preprint arXiv:2303\.18223, 1\(2\), 2023\.
- \[42\]Qihuang Zhong, Liang Ding, Li Shen, Juhua Liu, Bo Du, and Dacheng Tao\.Revisiting knowledge distillation for autoregressive language models\.arXiv preprint arXiv:2402\.11890, 2024\.
- \[43\]Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou\.Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911, 2023\.
- \[44\]Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang\.Distilling mathematical reasoning capabilities into small language models\.Neural Networks, 179:106594, 2024\.

## Technical Appendices and Supplementary Material

## Appendix ABenchmark Datasets

Table[3](https://arxiv.org/html/2605.11290#A1.T3)summarizes the benchmark suite used to evaluate each capability\. For capabilities associated with multiple benchmarks, we report the average score across the listed benchmarks\.

Table 3:Benchmark suite for evaluating LLM capabilities\.- •MMLU\-Pro\[[32](https://arxiv.org/html/2605.11290#bib.bib32)\]tests broad knowledge and problem\-solving across many academic subjects using multiple\-choice questions\.
- •BIG\-bench \(BIG\)\[[29](https://arxiv.org/html/2605.11290#bib.bib29)\]is a large collection of diverse tasks designed to probe general reasoning and knowledge beyond standard exams\.
- •IFEval\[[43](https://arxiv.org/html/2605.11290#bib.bib43)\]measures instruction following by checking whether model outputs satisfy explicit, verifiable constraints in the prompt\.
- •FollowEval\[[15](https://arxiv.org/html/2605.11290#bib.bib15)\]evaluates how well a model follows user instructions under varied formats and constraint types, complementing IFEval with broader instruction patterns\.
- •GPQA\[[23](https://arxiv.org/html/2605.11290#bib.bib23)\]is a graduate\-level question answering benchmark that stresses hard scientific reasoning with high\-quality, expert\-written questions\.
- •ARC\-Challenge\[[7](https://arxiv.org/html/2605.11290#bib.bib7)\]contains difficult grade\-school science questions that require multi\-step reasoning and commonsense beyond keyword matching\.
- •MATH Hard\[[11](https://arxiv.org/html/2605.11290#bib.bib11)\]is a challenging subset of competition\-style math problems and is scored by exact final\-answer match\.
- •GSM8K\[[9](https://arxiv.org/html/2605.11290#bib.bib9)\]focuses on grade\-school math word problems that require intermediate reasoning steps, also scored by exact final\-answer match\.
- •HumanEval\[[4](https://arxiv.org/html/2605.11290#bib.bib4)\]evaluates code generation via functional correctness on hand\-written programming problems, reported as Pass@1\.
- •MBPP\[[1](https://arxiv.org/html/2605.11290#bib.bib1)\]is a Python programming benchmark with short programming tasks and unit tests, also evaluated by Pass@1\.
- •BFCL V4\[[36](https://arxiv.org/html/2605.11290#bib.bib36)\]measures tool\-use ability by checking whether the model can produce correct function calls under realistic tool specifications\.
- •ToolBench\[[34](https://arxiv.org/html/2605.11290#bib.bib34)\]evaluates tool use in more complex, multi\-tool scenarios and is scored by tool\-call success rate\.
- •LongBench\[[2](https://arxiv.org/html/2605.11290#bib.bib2)\]evaluates long\-context understanding across multiple long\-input tasks and reports an average normalized score\.
- •XCOPA\[[22](https://arxiv.org/html/2605.11290#bib.bib22)\]tests multilingual causal commonsense reasoning across languages and is reported using macro accuracy\.
- •MGSM\[[26](https://arxiv.org/html/2605.11290#bib.bib26)\]is a multilingual version of grade\-school math and is evaluated by macro accuracy across languages\.

## Appendix BEmpirical Study Setting

We study capability interactions under a fixed distillation budget by distilling a student model toward one target capability at a time and then evaluating the resulting student on*all*capabilities in Table[3](https://arxiv.org/html/2605.11290#A1.T3)\. For each capability, we build a capability\-specific distillation pool from the corresponding benchmark prompts, excluding any held\-out evaluation examples to prevent leakage\. We repeat this process under multiple token budgets \(e\.g\., 20M, 80M, 150M tokens\) to form a budget\-dependent capability transfer matrix, where each entry is the score change on a destination capability when distilling toward a source capability at budgetBB\. We use deterministic decoding for generation benchmarks \(temperature=0=0\) with a fixed prompt template across methods, and we compute each capability score as the average over its associated benchmarks in the table\. We instantiate a simple but representative set of capability distillation baselines by combining three supervision forms that expose different levels of teacher knowledge—Resp\(final answer only\),CoT\(rationale plus final answer\), andLogit\(token\-level soft targets\)—with two standard objectives,SFTandKD, yielding six baselines in total\.

## Appendix CAdditional Theoretical Analysis

Uncertainty\-aware UCB allocation\. Here, we conduct an uncertainty\-aware UCB allocation analysis showing that action\-level regret is controlled by predictive uncertainty\. At intervaltt, ReAD maintains a bandit context

𝐱t:=\[𝐫τ;𝐬\(St\);bt\],\\mathbf\{x\}\_\{t\}:=\[\\,\\mathbf\{r\}\_\{\\tau\};\\ \\mathbf\{s\}\(S\_\{t\}\);\\ b\_\{t\}\\,\],\(16\)where𝐫τ∈Δ\\mathbf\{r\}\_\{\\tau\}\\in\\Deltais the task requirement vector,𝐬\(St\)∈ℝ\|𝒞\|\\mathbf\{s\}\(S\_\{t\}\)\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\|\}is the probe\-defined capability profile of the current student, andbtb\_\{t\}is the remaining budget\. Let𝒜\(τ\)\\mathcal\{A\}\(\\tau\)be the discrete candidate action set of allocation vectors\. Define the conditional expected proxy reward

R¯\(𝐱,𝐰\):=𝔼\[R^t∣𝐱t=𝐱,𝐰t=𝐰\],\\bar\{R\}\(\\mathbf\{x\},\\mathbf\{w\}\):=\\mathbb\{E\}\\\!\\left\[\\widehat\{R\}\_\{t\}\\mid\\mathbf\{x\}\_\{t\}=\\mathbf\{x\},\\ \\mathbf\{w\}\_\{t\}=\\mathbf\{w\}\\right\],\(17\)whereR^t\\widehat\{R\}\_\{t\}is the proxy reward used as shown in Eq\. \([9](https://arxiv.org/html/2605.11290#S3.E9)\)\.

###### Assumption C\.1\(Calibrated uncertainty\)\.

For a confidence levelδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta, for all encountered\(𝐱t,𝐰\)\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\),

\|μ\(𝐱t,𝐰\)−R¯\(𝐱t,𝐰\)\|≤σ\(𝐱t,𝐰\),\\big\|\\mu\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\-\\bar\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\\big\|\\leq\\sigma\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\),\(18\)whereμ\\muandσ\\sigmaare the ensemble mean and standard deviation defined from the reward regressors\.

Since ReAD selects allocations by the UCB rule in Eq\. \([11](https://arxiv.org/html/2605.11290#S3.E11)\), we can compare its chosen action𝐰t\\mathbf\{w\}\_\{t\}to the optimal action𝐰t⋆\\mathbf\{w\}\_\{t\}^\{\\star\}under the same context and bound the resulting gap in conditional expected reward using the calibrated uncertainty condition in Assumption[C\.1](https://arxiv.org/html/2605.11290#A3.Thmtheorem1), which leads to the following instantaneous regret bound\.

###### Proposition C\.2\(Instantaneous regret bound\)\.

Let𝐰t⋆∈arg⁡max𝐰∈𝒜\(τ\)⁡R¯\(𝐱t,𝐰\)\\mathbf\{w\}\_\{t\}^\{\\star\}\\in\\arg\\max\_\{\\mathbf\{w\}\\in\\mathcal\{A\}\(\\tau\)\}\\bar\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\)\. Under Assumption[C\.1](https://arxiv.org/html/2605.11290#A3.Thmtheorem1),

R¯\(𝐱t,𝐰t⋆\)−R¯\(𝐱t,𝐰t\)≤2κσ\(𝐱t,𝐰t\),\\bar\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}^\{\\star\}\)\-\\bar\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}\)\\ \\leq\\ 2\\kappa\\,\\sigma\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}\),\(19\)and consequently,

∑t=1T\(R¯\(𝐱t,𝐰t⋆\)−R¯\(𝐱t,𝐰t\)\)≤2κ∑t=1Tσ\(𝐱t,𝐰t\)\.\\sum\_\{t=1\}^\{T\}\\Big\(\\bar\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}^\{\\star\}\)\-\\bar\{R\}\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}\)\\Big\)\\ \\leq\\ 2\\kappa\\sum\_\{t=1\}^\{T\}\\sigma\(\\mathbf\{x\}\_\{t\},\\mathbf\{w\}\_\{t\}\)\.\(20\)

## Appendix DAdditional Experiments

Table 4:Capability profile of different methods under a 150M\-token budget \(LLama\)\. ReAD consistently outperforms all six baselines\.Table 5:Capability profile of different methods under a 20M\-token budget \(Qwen\)\. ReAD consistently outperforms all six baselines\.Table 6:Capability profile of different methods under a 150M\-token budget \(Qwen\)\. ReAD consistently outperforms all six baselines\.Additional evidence showing that ReAD improves capability distillation under a fixed budget\. We compare ReAD with six capability distillation baselines\. Tables[1](https://arxiv.org/html/2605.11290#S6.T1)\(in the main text\) and[4](https://arxiv.org/html/2605.11290#A4.T4)report the Llama results atB=20B\{=\}20M andB=150B\{=\}150M, and Tables[5](https://arxiv.org/html/2605.11290#A4.T5)and[6](https://arxiv.org/html/2605.11290#A4.T6)report the corresponding Qwen results\. In both budgets and for both model families, ReAD achieves the best overall capability profile and outperforms all six baselines on every capability, with the largest improvements appearing on bottleneck capabilities such as Steerability and Math while still maintaining strong gains on Code, Tool Use, and LCU\. Scaling the budget from 20M to 150M strengthens all methods, but ReAD benefits more consistently, indicating that its allocation policy converts additional tokens into broader capability improvements rather than concentrating gains on a narrow subset of skills\. Across model families, Qwen starts from a stronger student and teacher on most capabilities and also reaches higher absolute performance after distillation, but the relative ordering between methods is unchanged and ReAD remains the top performer, suggesting that ReAD is robust to the choice of teacher–student backbone and that its advantage is not tied to a specific LLM family\.

![Refer to caption](https://arxiv.org/html/2605.11290v1/x12.png)Figure 5:ReAD beats SOTA baselines that are specifically designed for reasoning, code, and math capabilities consistently\.ReAD beats SOTA baselines\.To further validate ReAD against SOTA capability distillation baselines that are specifically designed for certain capabilities, we include three representative methods that are commonly used for specializing student models, one per capability: step\-by\-step distillation\[[13](https://arxiv.org/html/2605.11290#bib.bib13)\]for reasoning, Equation\-of\-Thought Distillation \(EoTD\)\[[44](https://arxiv.org/html/2605.11290#bib.bib44)\]for math, and CodePLAN\[[28](https://arxiv.org/html/2605.11290#bib.bib28)\]for code\. For a fair comparison under our setting, we adapt each baseline to our LLaMa teacher\-student pair and token budget as150150M, and we distill the student using only the corresponding capability pool\. We report the resulting capability profiles on all eight benchmarks using the same evaluation protocol as the main experiments in Figure[5](https://arxiv.org/html/2605.11290#A4.F5)\. The results show that these capability\-targeted baselines mainly concentrate gains on their target capability and exhibit weaker transfer to other capabilities, producing an imbalanced profile that matches the interaction patterns observed in our empirical study\. In contrast, ReAD achieves a consistently stronger and more balanced profile across capabilities, and it is never worse than these baselines on its target capability, while improving substantially on non\-target capabilities\.

Held\-out Downstream Transfer\. To test whether ReAD transfers beyond capability\-aligned benchmark tasks, we evaluate on XSTest using the same requirement identifier and allocation pipeline without adding a new capability head\. Lower safe\-refusal and higher unsafe\-refusal scores are better\.

Table 7:Held\-out XSTest evaluation\. “Best baseline” denotes the best of the six SFT/KD baselines for each metric\.
## Appendix ERelated Work

Capability Distillation for Large Language Models\.Recent work on capability distillation has focused on transferring specific capabilities such as instruction following and context manipulation\[[30](https://arxiv.org/html/2605.11290#bib.bib30),[6](https://arxiv.org/html/2605.11290#bib.bib6),[21](https://arxiv.org/html/2605.11290#bib.bib21),[33](https://arxiv.org/html/2605.11290#bib.bib33)\], thinking patterns\[[40](https://arxiv.org/html/2605.11290#bib.bib40),[5](https://arxiv.org/html/2605.11290#bib.bib5)\], or text and code generation capability\[[39](https://arxiv.org/html/2605.11290#bib.bib39),[17](https://arxiv.org/html/2605.11290#bib.bib17)\]\. These methods typically steer a teacher LLM via prompts or rationales, generate distillation data or logits, and fine\-tune the student on that evidence\. However, the majority of methods operate under the assumption that distilling one target capability or domain is sufficient and rarely analyze how that focused training influences broader capability interactions within the student model\. To the best of our knowledge, ReAD is the first framework that explicitly accounts for capability interdependence during the distillation\.

## Appendix FLLM Usage

Large language models were used only for writing assistance, including grammar checking, wording refinement, formatting suggestions, and readability editing\. They were not used to generate the core methodology, design experiments, produce experimental results, create evaluation labels, or make routing decisions\. All technical claims, mathematical formulations, experimental settings, and reported results were checked and finalized by the authors\.

## Appendix GBroader Impact

ReAD aims to make capability distillation more budget\-efficient by allocating limited training tokens across interdependent LLM capabilities\. A positive impact is that smaller student models can become more useful under fixed compute budgets, potentially reducing deployment cost and improving access to LLM\-based systems in resource\-constrained settings\. ReAD also explicitly monitors cross\-capability transfer, which can help practitioners diagnose when improving one capability degrades others\.

At the same time, more efficient distillation may lower the cost of producing specialized models with dual\-use capabilities, such as code generation, tool use, or instruction following\. Distilled models may also inherit biases, unsafe behaviors, privacy risks, or other limitations from teacher models and training data\. In addition, if the capability probes or task requirement vectors are misaligned with the intended deployment objective, ReAD may optimize benchmark utility while overlooking safety\-, fairness\-, or robustness\-relevant behaviors\.

We therefore view ReAD as a research framework rather than a deployment\-ready safety guarantee\. For sensitive applications, practitioners should combine ReAD with task\-specific safety evaluation, robustness and bias audits, data filtering, red\-team testing, and monitoring of off\-target capability changes\. Any release or deployment of distilled models should follow the usage restrictions of the underlying models and include appropriate documentation of intended use, limitations, and safety evaluations\.

## Appendix HLimitation and Future Work

While ReAD improves budget efficiency by using low\-cost capability monitoring and adaptive allocation, it still relies on two practical components that can limit performance\. First, the task requirement estimator and the probe suite may be imperfect: if the inferred essential capabilities or probe signals are misaligned with true task utility, ReAD can allocate budget suboptimally\. Second, our capability set and data generator are necessarily finite and template\-driven, which may not cover all behaviors a real task depends on, especially for highly domain\-specific or long\-horizon tasks\. Future work can address these limitations by learning more robust and transferable task requirement models, designing probe suites with stronger coverage and better calibration to task utility, and expanding capability definitions and data generation beyond templates \(e\.g\., automatic discovery of capability axes and open\-ended curriculum generation\)\. Another promising direction is to jointly optimize allocation and student training dynamics \(e\.g\., optimizer schedules and parameter\-efficient adapters\) under the same budget, and to extend ReAD to settings with distribution shift, multi\-task deployment, or continual updates where requirements and transfer patterns evolve over time\.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

Similar Articles

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Attribution-Guided Continual Learning for Large Language Models

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Submit Feedback

Similar Articles

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Attribution-Guided Continual Learning for Large Language Models
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes