Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility
Summary
Introduces the Data-Model Compatibility (DMC) metric to evaluate how well a reasoning dataset aligns with a student model during distillation. Experiments show DMC strongly correlates with distillation performance and that dynamically selecting datasets based on DMC further improves reasoning capabilities.
View Cached Full Text
Cached at: 05/29/26, 09:14 AM
# Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility
Source: [https://arxiv.org/html/2605.29229](https://arxiv.org/html/2605.29229)
Jiahao Huang1Fei Cheng2,3Junfeng Jiang3Akiko Aizawa1,3 1University of Tokyo2Kyoto University3National Institute of Informatics jiahao\-huang@g\.ecc\.u\-tokyo\.ac\.jpfeicheng@i\.kyoto\-u\.ac\.jp \{jiang, aizawa\}@nii\.ac\.jp
###### Abstract
Reasoning distillation transfers complex reasoning abilities from large language models \(LLMs\) to smaller ones, yet its success depends on how well the training data align with the student model\. This paper introduces the Data–Model Compatibility \(DMC\) metric, which can be used to assess the suitability of a dataset for reasoning distillation on a student model\. DMC provides an assessment by jointly considering data quality, relative difficulty, and student capability\. We validated the effectiveness of DMC from two perspectives: \(1\) DMC exhibits a strong correlation with reasoning distillation performance; and \(2\) using DMC as the criterion for data selection leads to improved reasoning distillation performance\. Both findings are consistently demonstrated across multiple student models and tasks\. Moreover, since the DMC of each dataset dynamically changes during training, our experiments demonstrate that dynamically selecting datasets based on DMC can further enhance performance\.
Tailoring the Curriculum: Student\-Centered Reasoning Distillation via Dynamic Data\-Model Compatibility
Jiahao Huang1Fei Cheng2,3Junfeng Jiang3Akiko Aizawa1,31University of Tokyo2Kyoto University3National Institute of Informaticsjiahao\-huang@g\.ecc\.u\-tokyo\.ac\.jpfeicheng@i\.kyoto\-u\.ac\.jp\{jiang, aizawa\}@nii\.ac\.jp
## 1Introduction
In recent years, a significant number of reasoning models, including OpenAI o1\(Jaechet al\.,[2024](https://arxiv.org/html/2605.29229#bib.bib11)\), DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2605.29229#bib.bib12)\), and QwQ\(Team,[2025](https://arxiv.org/html/2605.29229#bib.bib13)\), have emerged\. These large models have demonstrated outstanding performance on reasoning\-dependent tasks such as logic and mathematics\. However, the reasoning capabilities of small and medium\-sized models remain underdeveloped\. Due to their lower resource consumption and higher flexibility, smaller models are more widely adopted in scenarios where both efficiency and effectiveness are required\. Therefore, researchers aim to compress the reasoning ability of large models into small ones, which we refer to in this paper asreasoning distillation\.
In reasoning distillation, the student models are finetuned on datasets comprising questions, answers, and corresponding reasoning processes generated by the teacher models\. Prior workZhanget al\.\([2025b](https://arxiv.org/html/2605.29229#bib.bib1)\); Xuet al\.\([2025](https://arxiv.org/html/2605.29229#bib.bib3)\); Youet al\.\([2017](https://arxiv.org/html/2605.29229#bib.bib24)\); Liet al\.\([2025c](https://arxiv.org/html/2605.29229#bib.bib29)\)primarily focused on how to choose the combination of teacher models and reasoning process generation methods to improve the performance of reasoning distillation\. However, we argue that teacher models and generation methods are just indirect factors affecting reasoning distillation, while the most direct factors are the features of the reasoning dataset and the student model\. In this paper, we aim to investigate reasoning distillation from a new perspective, focusing on the selection, evaluation, and combination of the features of the dataset and the student model\.
RQ1:Which features of the datasets and student models can effectively reflect the performance of reasoning distillation?
We analyze features of the dataset and the student model from three perspectives:Data QualityQQ\(a feature of the data\),Relative DifficultyDD\(a joint feature of the data and the student model\), andStudent CapabilityCC\(a feature of the student model\)\. Their precise definitions and computation are given in Section[3\.2](https://arxiv.org/html/2605.29229#S3.SS2)\.
Based on these features, we propose data\-model compatibility \(DMC\), formulated as a function ofQQ,DD, andCC, to evaluate the suitability of a dataset for reasoning distillation on a student model\. We demonstrate its effectiveness in two ways: \(i\) DMC values correlate strongly with reasoning\-distillation performance across datasets and students; and \(ii\) constructing datasets from high\-DMC data yields better\-performing students\.
The relative difficultyDDand student capabilityCCare naturally dynamic, as they depend on the evolving capacity of the model itself during training\. Therefore, the data exhibiting high DMC values will also change dynamically over training\. We thus pose the second research question:
RQ2:Can dynamic data selection according to the evolving DMC values further enhance the performance of reasoning distillation?
We address two research gaps through this research question\. First, we propose an innovative data selection approach for reasoning distillation, adaptively selects the most compatible training data, making the data selection process responsive and tailored to the model’s reasoning level\. Second, compared with traditional dynamic data‑selection methods that rely solely on perplexityLiet al\.\([2024a](https://arxiv.org/html/2605.29229#bib.bib32)\); Zhanget al\.\([2025a](https://arxiv.org/html/2605.29229#bib.bib49)\), DMC is empirically derived from extensive model–data experiments, which offers a more substantial data‑driven basis\.
In summary, the main contributions of this paper are as follows: \(1\) We propose data\-model compatibility \(DMC\), a metric jointly modeling three features: data quality, relative difficulty, and student capability, for effectively assessing whether a dataset is suitable for performing reasoning distillation on a student model\. \(2\) We propose a dynamic data selection approach based on DMC that adaptively re\-selects training data to match the student’s evolving capability throughout training, effectively enhancing reasoning distillation performance on the test set\.
## 2Related Work
#### Reasoning Ability of LLMs
Studies have demonstrated that incorporating a reasoning process into LLMs in question\-answering \(QA\) tasks can enhance model performance\(Weiet al\.,[2022](https://arxiv.org/html/2605.29229#bib.bib18); Kojimaet al\.,[2022](https://arxiv.org/html/2605.29229#bib.bib14)\)\. Multiple methods have been proposed to generate reasoning processes in large language models \(LLM\), such as vanilla chain\-of\-thought \(CoT\)\(Kojimaet al\.,[2022](https://arxiv.org/html/2605.29229#bib.bib14); Hsiehet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib2); Mukherjeeet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib5); Mitraet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib6); Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.29229#bib.bib4)\), tree\-of\-thought\(Yaoet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib22)\), reverse thinking\(Chenet al\.,[2024b](https://arxiv.org/html/2605.29229#bib.bib19)\), and self\-reflection\(Liet al\.,[2025b](https://arxiv.org/html/2605.29229#bib.bib20),[a](https://arxiv.org/html/2605.29229#bib.bib21)\)\.
#### Data Selection
Recently, with the growing number of data generation methods, increasing attention has been devoted to data selection to further enhance the effectiveness of reasoning distillation\. From the perspective of method,Zhanget al\.\([2025b](https://arxiv.org/html/2605.29229#bib.bib1)\),Tianet al\.\([2025](https://arxiv.org/html/2605.29229#bib.bib9)\)andChenet al\.\([2025](https://arxiv.org/html/2605.29229#bib.bib8)\)identified teacher model and generation approach as significant factors in reasoning distillation\. From the perspective of data, quality and difficulty are usually employed\. Quality is usually evaluated by LLM evaluators\(Chenet al\.,[2024a](https://arxiv.org/html/2605.29229#bib.bib25); Liuet al\.,[2024b](https://arxiv.org/html/2605.29229#bib.bib26); Leeet al\.,[2024](https://arxiv.org/html/2605.29229#bib.bib10)\)or reward models\(Xuet al\.,[2024](https://arxiv.org/html/2605.29229#bib.bib27)\)\. Difficulty can be quantified by approaches such as perplexity \(PPL\), conditional PPL\(Liet al\.,[2024b](https://arxiv.org/html/2605.29229#bib.bib28)\), IFDLiet al\.\([2024a](https://arxiv.org/html/2605.29229#bib.bib32)\)\. Integration of both metrics like Compatibility\-Adusted Reward \(CAR\)\(Xuet al\.,[2025](https://arxiv.org/html/2605.29229#bib.bib3)\)is also recently introduced\. Unlike these methods, which rely on fixed, hand\-designed assumptions \(e\.g\., higher quality or difficulty is always better\) and overlook the student model itself, our DMC is data\-driven and explicitly conditions on the student’s evolving capability, selecting data that best matches the student’s current state throughout training\.
## 3Preliminaries
Figure 1:Pipeline of this paper\. The DMC metric is formulated in Section[4](https://arxiv.org/html/2605.29229#S4); both research questions are empirically validated in Section[6](https://arxiv.org/html/2605.29229#S6)\.### 3\.1Problem Formulation
Fig\.[1](https://arxiv.org/html/2605.29229#S3.F1)illustrates the formulation of this research problem as well as some of its specific settings\. Reasoning distillation starts from a raw question\-answering \(QA\) dataset𝒟0=\{\(q,a\)\}\\mathcal\{D\}\_\{0\}=\\\{\(q,a\)\\\}, whereqqis the question andaais the ground truth answer\. For each question\-answer pair, the teacher modelTTemploys an augmentation methodAugAugto generate a reasoning processrr\. Consequently, the entire dataset can be expanded into a reasoning dataset𝒟\(T,Aug\)=\{\(q,r,a\)\}\\mathcal\{D\}\(T,Aug\)=\\\{\(q,r,a\)\\\}\. To ensure diversity in the reasoning processes, multiple teacher models and augmentation methods can be employed\. We aggregate all the reasoning datasets into a reasoning data pool:U=⋃T,Aug𝒟\(T,Aug\)U=\\bigcup\_\{T,Aug\}\\mathcal\{D\}\(T,Aug\)\.
For comparative purposes, we sample several subsets𝒟i⊆U\\mathcal\{D\}\_\{i\}\\subseteq Ufrom the data pool\. On one hand, we finetune the student modelSSon𝒟i\\mathcal\{D\}\_\{i\}and denote its test set performance byPS\(𝒟i\)P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\. On the other hand, we aim to identify a metricMMto evaluate the suitability of employing𝒟i\\mathcal\{D\}\_\{i\}for reasoning distillation on the student modelSS, denoted asMS\(𝒟i\)M\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\. High correlation between\[MS\(𝒟i\)\]\|i\[M\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}and\[PS\(𝒟i\)\]\|i\[P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}indicates thatMMserves as an effective metric\.
### 3\.2Foundational Features
Whether a dataset benefits reasoning distillation is not an intrinsic property of the data alone; it depends on how well the data fits the particular student model\. We therefore characterize this fit along three complementary features\.Data quality\(QQ\) captures the data side: a reasoning chain that is incorrect, incoherent, or fails to reach the answer teaches the student flawed patterns no matter which model is trained on it\.Relative difficulty\(DD\) captures the interaction between the data and the model: a chain that is too hard cannot be absorbed by the current student, whereas one that is too easy conveys little new signal, so what matters is the difficulty*relative*to the student\.Student capability\(CC\) captures the model side: since the same data suits different students to different degrees, a metric that ignores the student cannot express compatibility at all\. Together,QQ,DD, andCCcover the data, the model, and their interaction, which makes them a natural basis for analyzing data\-model compatibility \(DMC\)\. In this subsection we describe how each feature is computed; how they are integrated into the DMC metric is presented in Section[4](https://arxiv.org/html/2605.29229#S4)\.
#### Data QualityQQ
Data quality is a metric for evaluating whether the reasoning processrris correct, accurate, coherent and leading to the ground truth answer\. FollowingXuet al\.\([2025](https://arxiv.org/html/2605.29229#bib.bib3),[2024](https://arxiv.org/html/2605.29229#bib.bib27)\), we employ state\-of\-the\-art reward models to score the quality of each single data entry; the specific reward models are detailed in Section[5](https://arxiv.org/html/2605.29229#S5)\. For each subset𝒟i\\mathcal\{D\}\_\{i\}, we define the quality of the dataset as the average quality of all data entries within it\. Since data quality depends only on the data, the subscript for student modelSSis omitted and denoted asQ\(𝒟i\)Q\(\\mathcal\{D\}\_\{i\}\)\.
#### Relative DifficultyDD
We emphasize that difficulty here is not an intrinsic property of the data but is always defined*relative to a given student model*: it measures to what extent that particular model finds the reasoning chain hard to comprehend, so the same chain can be easy for a strong student and hard for a weak one\. Perplexity \(PPL\), conditional perplexity \(CPPL\) and instruction following difficulty \(IFD\)Liet al\.\([2024a](https://arxiv.org/html/2605.29229#bib.bib32)\)are the mainstream metrics for evaluating this relative difficulty\. For student modelSS, PPL of a reasoningrrcan be calculated by
PPLS\(r\)=exp\(−1N∑i=1NlogpS\(ri\|r1:i−1\)\)\\displaystyle\\text\{PPL\}\_\{S\}\(r\)=\\exp\(\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\log p\_\{S\}\(r^\{i\}\|r^\{1:i\-1\}\)\)\(1\)whererir^\{i\}represents theii\-th token ofrrandNNis the number of tokens inrr\. CPPL can be calculated byPPLS\(r\|q\)\\text\{PPL\}\_\{S\}\(r\|q\)and IFD is defined as the ratio of CPPL and PPL\. Similar toQQ, the difficulty of dataset𝒟i\\mathcal\{D\}\_\{i\}for student modelSSis defined as the average difficulty of all data samples in𝒟i\\mathcal\{D\}\_\{i\}\.
#### Student CapabilityCC
Previous work mainly focused on the features of data while disregarding the model itself; therefore, we introduce student capabilityCCas another key factor in the formulation of DMC\.
The concept of student capability is inspired by the placement test in human education, which first evaluates a student’s capability and then designs a suitable training course\. From the data poolUU, we isolate a small subset𝒟p\\mathcal\{D\}\_\{p\}of high\-quality samples to serve as a placement test for the preliminary evaluation of the student capability; these samples are held out from subsequent training, and the sampling details are given in Section[5](https://arxiv.org/html/2605.29229#S5)\. We consider student capability as the extent to which the model can comprehend the reasoning data from the placement test\. Therefore, we define the absolute capability value of the student modelSSas:
CSabs=𝔼\(q,r\)∈Dp1N∑i=1NlogpS\(ri\|q,r1:i−1\)\\displaystyle C\_\{S\}^\{abs\}=\\mathbb\{E\}\_\{\(q,r\)\\in D\_\{p\}\}\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\log p\_\{S\}\(r^\{i\}\|q,r^\{1:i\-1\}\)\(2\)A higher value ofCSabsC\_\{S\}^\{abs\}indicates greater confidence and familiarity with high\-quality reasoning in the placement test, thereby reflecting stronger capability\. By linearly mapping absolute capability values of different student models on various raw datasets𝒟0\\mathcal\{D\}\_\{0\}to the range \(\[0,1\]\), we can obtain corresponding relative capability valueCSrelC\_\{S\}^\{rel\}\.
## 4Method: Data\-Model Compatibility
### 4\.1DMC Formulation
Instead of directly measuring the features of the data, student capabilityCCplays the role of a*modulator*: we adopt two evaluation metrics,MSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\)andMSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\), designed for student models with the lowest and highest capability respectively, and let the modulator smoothly transition between them\. Providing different evaluation criteria to students of different capability resembles how a teacher tailors a curriculum to each student’s level: what matters is not the sheer difficulty of the data, but its compatibility with the student\.
Given a data entryddand a student modelSS, to allow DMC to adapt its evaluation metric based on the capability of the model, we employ a nonlinear interpolation method to model DMC:
DMCS\(d;Q,D,C,MSL\(⋅\),MSH\(⋅\),f\(⋅\)\)=\(1−f\(CS\)\)∗MSL\(Q\(d\),DS\(d\)\)\+f\(CS\)∗MSH\(Q\(d\),DS\(d\)\)\\text\{DMC\}\_\{S\}\(d;Q,D,C,M\_\{S\}^\{L\}\(\\cdot\),M\_\{S\}^\{H\}\(\\cdot\),f\(\\cdot\)\)=\\\\ \(1\-f\(C\_\{S\}\)\)\*M\_\{S\}^\{L\}\(Q\(d\),D\_\{S\}\(d\)\)\\\\ \+f\(C\_\{S\}\)\*M\_\{S\}^\{H\}\(Q\(d\),D\_\{S\}\(d\)\)\(3\)Here,MSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\)andMSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\)are the two limiting evaluation metrics that DMC interpolates between, corresponding respectively to the lowest\- and highest\-capability regimes of the student\. Both are functions of the qualityQ\(d\)Q\(d\)and difficultyDS\(d\)D\_\{S\}\(d\)of the data entryddwith respect to studentSS\. The interpolation functionf\(CS\)∈\[0,1\]f\(C\_\{S\}\)\\in\[0,1\]controls the relative influence of each term, enabling a smooth, nonlinear transition fromMSLM\_\{S\}^\{L\}toMSHM\_\{S\}^\{H\}as the student’s capabilityCSC\_\{S\}increases\. The overall DMC for a dataset𝒟i\\mathcal\{D\}\_\{i\}with respect to model \(S\) is computed as the average DMC across all entriesd∈𝒟id\\in\\mathcal\{D\}\_\{i\}\.
### 4\.2Optimization Procedure
Our objective is to determine the optimal configuration for Equation[3](https://arxiv.org/html/2605.29229#S4.E3), including the selection of evaluation metric forQQ,DD, andCC, the functional forms ofMSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\)andMSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\), and the interpolation functionf\(⋅\)f\(\\cdot\)\. To ensure generalizability, we aim to maximize the average value of this correlation across different student models and initial datasets:
DMCopt=argmaxQ\(⋅\),D\(⋅\),CS,MSL\(⋅\),MSH\(⋅\),f\(⋅\)𝔼S,𝒟0Corr\(\[DMCS\(𝒟i\)\]\|i,\[PS\(𝒟i\)\]\|i\)\\text\{DMC\}^\{\\text\{opt\}\}=\\text\{argmax\}\_\{Q\(\\cdot\),D\(\\cdot\),C\_\{S\},M\_\{S\}^\{L\}\(\\cdot\),M\_\{S\}^\{H\}\(\\cdot\),f\(\\cdot\)\}\\\\ \\mathbb\{E\}\_\{S,\\mathcal\{D\}\_\{0\}\}Corr\(\[\\text\{DMC\}\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\},\[P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}\)\(4\)
To obtain the optimal configuration forDMC, we enumerate all combinations of candidate evaluation metrics forQQ,DD, andCC, and derive the functional forms ofMSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\),MSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\)andf\(CS\)f\(C\_\{S\}\)via symbolic regression\. Concretely, the procedure consists of four steps: \(1\) enumerate all combinations of candidate metrics forQQ,DD, andCC; \(2\) for each combination, run symbolic regression to obtain candidate functional forms on the Pareto frontier balancing correlation against complexity; \(3\) filter the candidates to retain those that are human\-interpretable; and \(4\) apply grid search over the remaining constants and select the configuration with the highest correlation\. The detailed pseudocode is given in Algorithm[1](https://arxiv.org/html/2605.29229#alg1), Appendix[C\.1](https://arxiv.org/html/2605.29229#A3.SS1)\. The discovered configuration and its empirical analysis are reported in Section[6\.1](https://arxiv.org/html/2605.29229#S6.SS1)\.
### 4\.3DMC\-Based Data Selection
Once DMC is obtained, it can be used as a criterion for selecting training data\. We consider two strategies, both evaluated in Section[6\.3](https://arxiv.org/html/2605.29229#S6.SS3)\.
#### Static Selection
Datasets are fixed prior to training and remain unchanged throughout\. For each student modelSS, we select the topk%k\\%of data samples from the data poolUUaccording toDMCS\\text\{DMC\}\_\{S\}, and use them for reasoning distillation\.
#### Dynamic Selection
Datasets are adjusted throughout training according to the evolving model parameters\. Before each training epoch, all data inUUare re\-evaluated withDMCS\\text\{DMC\}\_\{S\}, and the topk%k\\%of samples are selected as the training data for that epoch\. Since data qualityQQis static, the dynamic behavior is driven by the relative difficultyDSD\_\{S\}and student capabilityCSC\_\{S\}, both of which evolve as the student learns\.
## 5Experimental Setup
#### Data PoolUU
The data poolUUis constructed based on DC\-CoT\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.29229#bib.bib1)\)\. DC\-CoT is a data\-centric benchmark that collects reasoning process generated with various teacher models and augmentation strategies\. The selection of teacher modelTTincludes Gemini\-1\.5\-Pro\(Teamet al\.,[2024a](https://arxiv.org/html/2605.29229#bib.bib34)\)and GPT\-4\(Achiamet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib35)\)\. The selection of augmentation strategyAugAugis comprised of vanilla CoT, rephrase questions\(Yuet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib7)\), reverse thinking\(Chenet al\.,[2024b](https://arxiv.org/html/2605.29229#bib.bib19)\)and answer augmentation\(Yuet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib7)\)\.
We will focus on the augmentation of the following raw datasets𝒟0\\mathcal\{D\}\_\{0\}from DC\-CoT:Commonsense Reasoning: StrategyQA \(SQA\)\(Gevaet al\.,[2021](https://arxiv.org/html/2605.29229#bib.bib36)\), CommonsenseQA \(CSQA\)\(Talmoret al\.,[2018](https://arxiv.org/html/2605.29229#bib.bib40)\), ARC\-challenge \(ARC\)\(Clarket al\.,[2018](https://arxiv.org/html/2605.29229#bib.bib37)\);Math Reasoning: GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.29229#bib.bib38)\), MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.29229#bib.bib39)\)\. These datasets differ in question formats: SQA are comprised of yes/no questions, CSQA and ARC are comprised of multiple\-choice questions, and GSM8K and MATH are comprised of arithmetic problems\.
#### Quality Evaluation
To instantiate the data qualityQQ\(Section[3\.2](https://arxiv.org/html/2605.29229#S3.SS2)\), we followXuet al\.\([2025](https://arxiv.org/html/2605.29229#bib.bib3),[2024](https://arxiv.org/html/2605.29229#bib.bib27)\)and consider two state\-of\-the\-art reward models as candidates:Skywork\-Reward\-V2\-Llama\-3\.1\-8B\(SRL\)\(Liuet al\.,[2025](https://arxiv.org/html/2605.29229#bib.bib30)\)andSkywork\-Reward\-Gemma\-2\-27B\(SRG\)\(Liuet al\.,[2024a](https://arxiv.org/html/2605.29229#bib.bib31)\)\.
#### Method for Sampling𝒟i\\mathcal\{D\}\_\{i\}
From each data poolUU, we sampled 17 representative datasets𝒟0:16\\mathcal\{D\}\_\{0:16\}\. First, the non\-CoT datasets𝒟0=\{\(q,None,a\)\}\\mathcal\{D\}\_\{0\}=\\\{\(q,None,a\)\\\}are included as baselines\. The sampling method for dataset𝒟1:8\\mathcal\{D\}\_\{1:8\}follows the previous work, where each dataset corresponds to a combination of a teacher model and a data‑augmentation method\.𝒟9\\mathcal\{D\}\_\{9\}is equivalent to the full data poolUU, while𝒟10\\mathcal\{D\}\_\{10\}is a random sample from the full dataset, with a sample size equal to𝒟0\\mathcal\{D\}\_\{0\}\. Then, we sorted the data inUUbased on their quality scores and partitioned them into three equally sized datasets: high\-quality, medium\-quality, and low\-quality\. We independently sampled two subsets of the same size as𝒟0\\mathcal\{D\}\_\{0\}from each of the three data pools, thereby constructing𝒟11:16\\mathcal\{D\}\_\{11:16\}\. These sampled datasets cover a range of teacher models, augmentation strategies, as well as diversity and quality levels, which ensures that they are representative\.
#### Student ModelSS
For a comprehensive comparison, we will finetune student models from four model families with different parameter sizes: Gemma\-2B\(Teamet al\.,[2024b](https://arxiv.org/html/2605.29229#bib.bib41)\); Mistral\-7B\(Jianget al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib42)\);Qwen: Qwen2\.5\-1\.5B, Qwen2\.5\-3B, Qwen2\.5\-7B\(Qwenet al\.,[2025](https://arxiv.org/html/2605.29229#bib.bib43)\);Llama3: Llama\-3\.2\-1B, Llama\-3\.2\-3B, Llama\-3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.29229#bib.bib44)\)\.
#### Placement Test
To estimate the student capabilityCC\(Section[3\.2](https://arxiv.org/html/2605.29229#S3.SS2)\), the placement set𝒟p\\mathcal\{D\}\_\{p\}consists of 100 samples drawn from the top 10% highest\-quality data, which are held out from training\. This small, high\-quality probe keeps the estimate efficient and stable while preserving the diversity of the training pool\.
#### Finetuning and Test Set PerformancePP
The finetuning is conducted on four A100 GPUs, with LoRA finetuning for the student models\. The LoRA rank is set as 8, while LoRA alpha is set as 32\. The total number of training steps for all datasets is set as 30k, with a batch size of 32 and a learning rate of1∗10−51\*10^\{\-5\}\. The prompt for finetuning is shown in Table[5](https://arxiv.org/html/2605.29229#A1.T5), Appendix[A](https://arxiv.org/html/2605.29229#A1)\. After finetuning, we evaluate the finetuned model on the corresponding test set and adopt accuracy as the criterion for evaluating the performance\.
#### Correlation Calculation
We follow previous workXuet al\.\([2025](https://arxiv.org/html/2605.29229#bib.bib3)\)and use Spearman’s rank correlation coefficientρ\\rhoZar \([2005](https://arxiv.org/html/2605.29229#bib.bib48)\)to evaluate the consistency in relative ranking between a given metric and the test\-set performance, which can be calculated as:
ρ=1−6∑idi2n\(n2−1\)\\displaystyle\\rho=1\-\\frac\{6\\sum\_\{i\}d\_\{i\}^\{2\}\}\{n\(n^\{2\}\-1\)\}\(5\)wheredid\_\{i\}is the difference between the rank ofMS\(𝒟i\)M\_\{S\}\(\\mathcal\{D\}\_\{i\}\)andPS\(𝒟i\)P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\. The domain ofρ\\rhoranges from \-1 to 1, where 1 indicates a perfect positive correlation between two variables\. Our target is to find out theMMwhich maximizesρ\(\[MS\(𝒟i\)\]\|i,\[PS\(𝒟i\)\]\|i\)\\rho\(\[M\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\},\[P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}\)across different raw datasets and student models\.
#### Baselines
To evaluate the effectiveness of DMC, we use data quality and relative difficulty as baselines for comparison\. Moreover, we employ compatibility\-adjusted reward \(CAR\)\(Xuet al\.,[2025](https://arxiv.org/html/2605.29229#bib.bib3)\), a multi‑metric integration metric, as an additional baseline\.
## 6Results and Analysis
### 6\.1Discovered DMC Configuration
Running the optimization procedure of Section[4\.2](https://arxiv.org/html/2605.29229#S4.SS2)over all candidate metrics and functional forms yields the configuration shown in Table[1](https://arxiv.org/html/2605.29229#S6.T1)\. For brevity, subsequent mentions of DMC denote this optimal configuration\. The visualizations ofMSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\),MSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\)andf\(CS\)f\(C\_\{S\}\)are provided in Figure[4](https://arxiv.org/html/2605.29229#A3.F4)and[5](https://arxiv.org/html/2605.29229#A3.F5), Appendix[C\.2](https://arxiv.org/html/2605.29229#A3.SS2)\.
MetricSelectionQQSRLDDlog\(IFD\)\\log\(\{\\text\{IFD\}\}\)CCCSrelC\_\{S\}^\{rel\}FunctionFormMSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\)𝕀\(Q\>Qth\)∗μ1∗D\+μ2∗Q\)\\mathbb\{I\}\(Q\>Q\_\{th\}\)\*\\mu\_\{1\}\*D\+\\mu\_\{2\}\*\\sqrt\{Q\}\)MSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\)μ3∗Q∗e−μ4\(D−Dbase\)\\mu\_\{3\}\*Q\*e^\{\-\\mu\_\{4\}\(D\-D\_\{base\}\)\}f\(CS\)f\(C\_\{S\}\)CS2/\(CS2\+\(1−CS\)2\)C\_\{S\}^\{2\}/\(C\_\{S\}^\{2\}\+\(1\-C\_\{S\}\)^\{2\}\)Param\.QthQ\_\{th\}1\.1DbaseD\_\{base\}0\.056μ1\\mu\_\{1\}2\.1μ2\\mu\_\{2\}5\.0μ3\\mu\_\{3\}5\.0μ4\\mu\_\{4\}0\.10
Table 1:Optimal configuration for DMC\. SRL is an abbreviation forSkywork\-Reward\-V2\-Llama\-3\.1\-8B\.
### 6\.2Correlation with Distillation Performance
#### Effectiveness of DMC
Table[2](https://arxiv.org/html/2605.29229#S6.T2)presents the average correlation between\[MS\(𝒟i\)\]\|i\[M\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}and\[PS\(𝒟i\)\]\|i\[P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}, while the complete data are provided in Table[7](https://arxiv.org/html/2605.29229#A3.T7), Appendix[C\.3](https://arxiv.org/html/2605.29229#A3.SS3)\. Our proposed DMC exhibits a high correlation withPS\(𝒟\)P\_\{S\}\(\\mathcal\{D\}\)across various student modelsSSand multiple raw datasets𝒟0\\mathcal\{D\}\_\{0\}, attaining the highest average correlation of0\.6120\.612and ranking first on every one of the eight student models\. This clearly surpasses the strongest single\-metric baseline, data quality \(SRL,0\.5550\.555\), and the multi\-metric baseline CAR \(0\.4650\.465\), demonstrating that DMC is an effective metric for assessing whether a dataset𝒟i\\mathcal\{D\}\_\{i\}is suitable for reasoning distillation with a given student modelSS\.
Data quality also serves as a strong metric, as many previous work has pointed out\. Higher\-quality data naturally enable the student model to perform reasoning of higher quality, thereby increasing the likelihood of producing correct answers\. In contrast, relative difficulty is not a consistently effective metric\. We find that its effectiveness depends on the student capability \(see Table[6](https://arxiv.org/html/2605.29229#A2.T6)for reference\)\. For student models with relatively low capability, such as Gemma‑2B and Mistral‑7B, both CPPL and IFD can serve as effective evaluation metrics\. However, once the student capability increases, the relationship between performance and difficulty, particularly IFD, ceases to be purely positively correlated, as shown in Figure[6](https://arxiv.org/html/2605.29229#A3.F6), Appendix[C\.4](https://arxiv.org/html/2605.29229#A3.SS4)\.
MetricMMStudent ModelsSSAvg\.G\-2bM\-7bQ\-1\.5bQ\-3BQ\-7BL\-1bL\-3bL\-8bQualitySRL0\.4520\.4400\.6500\.6560\.7630\.4810\.6620\.3390\.555SRG0\.2320\.3090\.5380\.6120\.5680\.5380\.4740\.3300\.450DifficultyPPL\-0\.513\-0\.463\-0\.677\-0\.512\-0\.445\-0\.469\-0\.722\-0\.590\-0\.549CPPL0\.4990\.4330\.3970\.2020\.1350\.2750\.4390\.3850\.346IFD0\.4750\.4440\.098\-0\.050\-0\.0240\.050\-0\.014\-0\.1160\.108Multi\-MetricCAR0\.2660\.4080\.5110\.5130\.5820\.4110\.6510\.3800\.465DMC0\.5390\.5390\.7020\.6690\.7770\.5460\.7350\.4110\.612
Table 2:Average correlation betweenPS\(𝒟i\)P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)andMS\(𝒟i\)M\_\{S\}\(\\mathcal\{D\}\_\{i\}\)across different raw datasets\. The highest correlation for each model is highlighted in bold, and the second\-highest is highlighted withunderline\.
#### Interpretation of DMC Configuration
Beyond predicting performance, these results also justify*why*we model DMC with the three features and the interpolation form of Equation[3](https://arxiv.org/html/2605.29229#S4.E3)\. Figure[3](https://arxiv.org/html/2605.29229#A3.F3)\(Appendix[C\.2](https://arxiv.org/html/2605.29229#A3.SS2)\) shows, for representative settings, that as the student capabilityCCincreases, both the best\-performing subsets𝒟i\\mathcal\{D\}\_\{i\}and the high\-DMC region shift from high\-difficulty toward moderate\-difficulty data\. This shift supports our design in two ways\. \(i\)*Three features\.*Because the difficulty that benefits a student depends on*which*student, a metric defined on the data alone \(quality and difficulty\) cannot express this dependence, which makes the student capabilityCCindispensable alongsideQQandDD; meanwhile, the consistently high correlation of quality across models \(Table[2](https://arxiv.org/html/2605.29229#S6.T2)\) confirms thatQQshould remain a core term\. \(ii\)*Interpolation form\.*Because the preferred difficulty changes continuously withCCrather than being fixed, a single static scoring function is inadequate; this is exactly what the capability\-modulated interpolation of Equation[3](https://arxiv.org/html/2605.29229#S4.E3)captures, blending a low\-capability metricMSLM\_\{S\}^\{L\}\(favoring high\-quality, high\-difficulty data\) and a high\-capability metricMSHM\_\{S\}^\{H\}\(favoring moderate\-difficulty data\) throughf\(CS\)f\(C\_\{S\}\)\.
Concretely, low\-capability models tend to benefit from high\-quality and high\-difficulty data, while high\-capability models are better suited to data of moderate difficulty\. This runs opposite to the curriculum\-learning intuition that data difficulty should increase with capabilityBengioet al\.\([2009](https://arxiv.org/html/2605.29229#bib.bib50)\)\. We attribute the reversal to the nature of reasoning distillation: since the student already possesses basic language understanding at initialization, low\-capability models mainly lack the ability to*generate*coherent reasoning chains, so challenging examples quickly activate this ability; high\-capability models, in contrast, have stable reasoning abilities that overly difficult data may destabilize, making medium\-difficulty data the better trade\-off between challenge and stability\.
The strong correlation observed between\[DMCS\(𝒟i\)\]\|i\[\\text\{DMC\}\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}and\[PS\(𝒟i\)\]\|i\[P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}confirms that DMC effectively predicts the suitability of a dataset for reasoning distillation, thereby answering RQ1\. We next use DMC as a selection criterion to verify its effectiveness from the perspective of downstream distillation performance, addressing RQ2\.
### 6\.3Data Selection Results
We now examine to what extent employing DMC as a data\-selection criterion improves reasoning distillation, under both the static and dynamic strategies defined in Section[4\.3](https://arxiv.org/html/2605.29229#S4.SS3)\. For the dynamic setting we additionally compare against the dynamic curriculum\-learning method ADCLZhanget al\.\([2025a](https://arxiv.org/html/2605.29229#bib.bib49)\); all other training settings follow Section[5](https://arxiv.org/html/2605.29229#S5)\. In the main text, we use Qwen2\.5‑3B as a representative example to illustrate the performance of this student model when conducting reasoning distillation on different datasets, which is listed in Table[3](https://arxiv.org/html/2605.29229#S6.T3)\. The performance of the remaining models are provided in Appendix[D](https://arxiv.org/html/2605.29229#A4)\.
TrainingDatasetSelection Metric\(Topk%k\\%\)TasksSQACSQAARCMATHGSM8KNone\-53\.941\.063\.210\.216\.4StaticFull \(100%\)59\.868\.275\.811\.471\.0Random \(12\.5%\)57\.270\.872\.210\.664\.8Quality \(12\.5%\)66\.871\.876\.056\.873\.4Difficulty \(12\.5%\)52\.869\.464\.021\.250\.6CAR \(12\.5%\)56\.870\.675\.849\.473\.8DMC \(12\.5%\)67\.072\.177\.466\.475\.6DMC \(10%\)69\.071\.676\.464\.877\.4DMC \(7\.5%\)67\.571\.477\.058\.673\.8DynamicADCL62\.969\.665\.437\.057\.8Difficulty \(12\.5%\)48\.568\.265\.834\.850\.6CAR \(12\.5%\)57\.270\.675\.855\.676\.6DMC \(12\.5%\)68\.572\.679\.669\.079\.2DMC \(10%\)69\.173\.478\.266\.078\.2DMC \(7\.5%\)68\.273\.877\.467\.478\.8
Table 3:Performance of Qwen2\.5\-3B\. For the static and dynamic settings, the dataset selection metric achieving the best and second\-best performance are indicated inboldandunderline, respectively\.As shown, in reasoning distillation, the use of the full dataset does not lead to improved performance, suggesting that the scaling law no longer applies in this task; on MATH, for instance, the full dataset reaches only11\.411\.4, barely above the10\.210\.2of the non\-distilled model\. The reason is that at this stage, the student model already possesses a certain level of capability in understanding and processing natural language, making the training process more sensitive to data‑model compatibility\. Consequently, effective data selection has already become a crucial task\.
From the perspective of the metrics, DMC serves as a superior metric for data selection compared with others\. This advantage arises because DMC not only takes into account the inherent quality of the data but also considers the difficulty of the data as perceived by the student model and the model’s capability, effectively embodying a “personalized learning” principle\. This improvement is particularly evident in tasks such as GSM8K and MATH, where reasoning ability plays a more critical role: on MATH, dynamic DMC \(top12\.5%12\.5\\%\) attains69\.069\.0, far surpassing the full dataset \(11\.411\.4\), CAR \(55\.655\.6\), and ADCL \(37\.037\.0\)\. Consistent with Section[6\.2](https://arxiv.org/html/2605.29229#S6.SS2), higher difficulty is not always better: especially for high\-capability students, stabilizing already\-acquired reasoning skills matters more than exploring harder data\.
With dynamic data selection, employing DMC as the evaluation criterion yields additional improvement in model performance\. Re‑evaluating the data before each epoch allows the dataset to remain aligned with the student model’s current parameter state, thereby continuously enhancing performance\. Furthermore, under the samekkvalue, such re‑evaluation gives the model access to new data, increasing data diversity during training and further improving generalization ability\. Other metrics, however, are comparatively simplistic, and therefore show limited benefit even when used with dynamic dataset strategies\. The selection ratiokkis itself a trade\-off: a largerkkadapts less promptly to DMC, whereas a smallerkkrequires more frequent re\-evaluation and yields a less stable selection\.
In summary, dynamic DMC\-based selection effectively enhances reasoning distillation; we next analyze how the model and the selected data evolve during training\.
### 6\.4Analysis
Under the dynamic setting, student capability, the selected dataset, and relative difficulty all evolve together during training; we use MATH as𝒟0\\mathcal\{D\}\_\{0\}with Qwen2\.5\-3B \(high\-capacity\) and Gemma\-2B \(low\-capacity\) as representatives\. Two dynamics are intuitive and deferred to Appendix[E](https://arxiv.org/html/2605.29229#A5): the student capability rises steadily as training proceeds, and the relative difficulty of individual samples shifts over time so that DMC keeps a sample only while it lies within the effective compatibility range\. We highlight here how the selected data is distributed\.
#### Distribution of Selected Data
We rank all data by descending quality and observe the frequency distribution of data selection over the course of training, as shown in Figure[2](https://arxiv.org/html/2605.29229#S6.F2)\. The highest\-quality 5% of data are nearly always included, acting as compulsory “core courses” that establish the foundation of the model’s primary reasoning ability and must be consistently retained\. Data ranked between the top 5% and 20% are dynamically adjusted with the model’s training progress; this portion acts as “elective courses” that flexibly tune reasoning generalization and prevent overfitting on the core data, and its range can be controlled via the epoch size to trade off generalization against reasoning quality\. Data beyond this range, being of relatively low quality, are seldom selected\. This pattern mirrors human curriculum design and underlies the consistent gains of dynamic DMC selection\.
Figure 2:Distribution of data selection frequency, with student model chosen as Gemma\-2b, raw dataset𝒟0\\mathcal\{D\}\_\{0\}chosen as MATH\.
#### Out\-of\-Distribution Generalization
Finally, we test whether DMC captures a transferable notion of learnability rather than a heuristic specific to math and commonsense reasoning, by applying our selection method to two out\-of\-distribution \(OOD\) tasks that differ substantially from the data pool: ANLI\(Nieet al\.,[2020](https://arxiv.org/html/2605.29229#bib.bib51)\)\(natural language inference\) and Date Understanding\(Srivastavaet al\.,[2023](https://arxiv.org/html/2605.29229#bib.bib52)\)\(temporal reasoning\), again with Gemma\-2B as the representative student\. As shown in Table[4](https://arxiv.org/html/2605.29229#S6.T4), dynamic DMC\-based selection \(top 12\.5%\) outperforms both the full\-dataset baseline and the dynamic baselines \(ADCL, Difficulty, CAR\) on both OOD tasks, indicating that the data\-model compatibility signal generalizes beyond the domains used to derive it\.
Selection Metric\(Topk%k\\%\)TasksANLIDateFull \(100%\)35\.4261\.80ADCL43\.9260\.37Difficulty \(12\.5%\)42\.4767\.02CAR \(12\.5%\)40\.2968\.41DMC \(12\.5%\)49\.7570\.41
Table 4:Out\-of\-distribution accuracy \(%\\%\) on ANLI and Date Understanding with Gemma\-2B as the student\. Selection methods are under the dynamic setting \(top 12\.5%\) and compared against the full\-dataset baseline; the best result per task is inbold\.
## 7Conclusion
In this paper, we reveal that effective reasoning distillation depends on a nuanced interplay between data quality, relative difficulty, and student capability\. By conceptualizing and formulating Data\-Model Compatibility \(DMC\), our work provides a quantitative and interpretable means to evaluate dataset suitability for specific student models\. Empirical results confirm that DMC serves as a stronger predictor of downstream reasoning performance than conventional measures\. The functional form of DMC indicates that, in reasoning distillation, higher data quality is naturally preferable; however, the level of data difficulty that best fits the model varies with the model’s capability\. On the other hand, dynamic data selection based on DMC can further improve the performance\. The dynamic data selection scheme is analogous to human curriculum design: a set of core courses represents essential knowledge that the model must continually learn and retain, while elective courses are dynamically adjusted to further refine the model’s reasoning ability\. Furthermore, the gains transfer to out\-of\-distribution tasks, suggesting that DMC captures a generalizable signal of learnability rather than a domain\-specific heuristic\.
## Limitations
Although our proposed DMC demonstrates promising results in data selection for reasoning distillation, several limitations remain to be addressed in future work:
Computational efficiency\.The dynamic data selection based on DMC requires an additional re\-evaluation stage after each training epoch, resulting in an approximately 53% increase in training time, which may limit scalability when applied to massive datasets or large student models\. However, this overhead is largely optimizable: since low\-quality data are rarely selected \(Figure[2](https://arxiv.org/html/2605.29229#S6.F2), Section[6\.4](https://arxiv.org/html/2605.29229#S6.SS4)\), the re\-evaluation can be restricted to the top 20% highest\-quality candidates, reducing the additional training\-time overhead from∼\\sim53% to∼\\sim10\.6% \(see Appendix[F](https://arxiv.org/html/2605.29229#A6)\)\. Given that distillation targets inference efficiency, this marginal one\-time cost is an acceptable trade\-off for a stronger student model\.
Scope of focus\.Our current work primarily concentrates on selecting compatible data instances from existing reasoning datasets rather than generating new ones\. In future work, we plan to extend DMC as a guiding metric to assist teacher models in actively generating reasoning data that better aligns with the student model’s capabilities, thereby unifying data generation and selection under a single adaptive framework\.
Domain generalization\.Our primary experiments focus on the reasoning distillation domain\. Beyond the in\-domain benchmarks, we additionally verify cross\-domain transfer on out\-of\-distribution tasks \(Section[6\.4](https://arxiv.org/html/2605.29229#S6.SS4.SSS0.Px2)\); extending and validating DMC for other synthetic\-data\-centric settings, such as instruction tuning, dialogue generation, or data\-centric LLM optimization, remains a promising direction\.
Theoretical grounding\.Rather than assuming a fixed heuristic \(e\.g\., “harder is better”\), DMC is*discovered*from data via symbolic regression, and a Pareto\-frontier selection mitigates the risk of arbitrary or overfitted forms\. Nonetheless, the resulting formula is empirically derived rather than analytically proven\. We view its interpretable structure and consistent behavior across models and tasks as empirical evidence and a foundation for future theoretical investigations into capability\-conditioned data evaluation\.
## References
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p1.3)\.
- Curriculum learning\.InProceedings of the 26th annual international conference on machine learning,pp\. 41–48\.Cited by:[§6\.2](https://arxiv.org/html/2605.29229#S6.SS2.SSS0.Px2.p2.1)\.
- J\. Chen, R\. Qadri, Y\. Wen, N\. Jain, J\. Kirchenbauer, T\. Zhou, and T\. Goldstein \(2024a\)Genqa: generating millions of instructions from a handful of prompts\.arXiv preprint arXiv:2406\.10323\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1)\.
- J\. C\. Chen, Z\. Wang, H\. Palangi, R\. Han, S\. Ebrahimi, L\. Le, V\. Perot, S\. Mishra, M\. Bansal, C\. Lee,et al\.\(2024b\)Reverse thinking makes llms stronger reasoners\.arXiv preprint arXiv:2411\.19865\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p1.3)\.
- X\. Chen, Z\. Sun, W\. Guo, M\. Zhang, Y\. Chen, Y\. Sun, H\. Su, Y\. Pan, D\. Klakow, W\. Li,et al\.\(2025\)Unveiling the key factors for distilling chain\-of\-thought reasoning\.arXiv preprint arXiv:2502\.18001\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p2.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px4.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p2.1)\.
- C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 8003–8017\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px4.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Lee, J\. Kim, and S\. Lee \(2024\)Mentor\-kd: making small language models better multi\-step reasoners\.arXiv preprint arXiv:2410\.09037\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo,et al\.\(2022\)Solving quantitative reasoning problems with language models\.Advances in Neural Information Processing Systems35,pp\. 3843–3857\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Li, X\. Dong, Y\. Liu, Z\. Yang, Q\. Wang, X\. Wang, S\. Zhu, Z\. Jia, and Z\. Zheng \(2025a\)Reflectevo: improving meta introspection of small llms by learning self\-reflection\.arXiv preprint arXiv:2505\.16475\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Li, Y\. Zhang, S\. He, Z\. Li, H\. Zhao, J\. Wang, N\. Cheng, and T\. Zhou \(2024a\)Superfiltering: weak\-to\-strong data filtering for fast instruction\-tuning\.arXiv preprint arXiv:2402\.00530\.Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p8.1),[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.29229#S3.SS2.SSS0.Px2.p1.2)\.
- M\. Li, Y\. Zhang, Z\. Li, J\. Chen, L\. Chen, N\. Cheng, J\. Wang, T\. Zhou, and J\. Xiao \(2024b\)From quantity to quality: boosting LLM performance with self\-guided data selection for instruction tuning\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 7602–7635\.External Links:[Link](https://aclanthology.org/2024.naacl-long.421/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.421)Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Li, J\. Xu, T\. Liang, X\. Chen, Z\. He, Q\. Liu, R\. Wang, Z\. Zhang, Z\. Tu, H\. Mi,et al\.\(2025b\)Dancing with critiques: enhancing llm reasoning with stepwise natural language self\-critique\.arXiv preprint arXiv:2503\.17363\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, L\. Yang, W\. Shen, P\. Zhou, Y\. Wan, W\. Lin, and D\. Chen \(2025c\)CrowdSelect: synthetic instruction data selection with multi\-llm wisdom\.arXiv preprint arXiv:2503\.01836\.Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p2.1)\.
- C\. Y\. Liu, L\. Zeng, J\. Liu, R\. Yan, J\. He, C\. Wang, S\. Yan, Y\. Liu, and Y\. Zhou \(2024a\)Skywork\-reward: bag of tricks for reward modeling in llms\.arXiv preprint arXiv:2410\.18451\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px2.p1.1)\.
- C\. Y\. Liu, L\. Zeng, Y\. Xiao, J\. He, J\. Liu, C\. Wang, R\. Yan, W\. Shen, F\. Zhang, J\. Xu, Y\. Liu, and Y\. Zhou \(2025\)Skywork\-reward\-v2: scaling preference data curation via human\-ai synergy\.arXiv preprint arXiv:2507\.01352\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px2.p1.1)\.
- L\. Liu, X\. Liu, D\. F\. Wong, D\. Li, Z\. Wang, B\. Hu, and M\. Zhang \(2024b\)Selectit: selective instruction tuning for large language models via uncertainty\-aware self\-reflection\.arXiv preprint arXiv:2402\.16705\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Mitra, L\. Del Corro, S\. Mahajan, A\. Codas, C\. Simoes, S\. Agarwal, X\. Chen, A\. Razdaibiedina, E\. Jones, K\. Aggarwal,et al\.\(2023\)Orca 2: teaching small language models how to reason\.arXiv preprint arXiv:2311\.11045\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Mukherjee, A\. Mitra, G\. Jawahar, S\. Agarwal, H\. Palangi, and A\. Awadallah \(2023\)Orca: progressive learning from complex explanation traces of gpt\-4\.arXiv preprint arXiv:2306\.02707\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Nie, A\. Williams, E\. Dinan, M\. Bansal, J\. Weston, and D\. Kiela \(2020\)Adversarial nli: a new benchmark for natural language understanding\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 4885–4901\.Cited by:[§6\.4](https://arxiv.org/html/2605.29229#S6.SS4.SSS0.Px2.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px4.p1.1)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso,et al\.\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Transactions on machine learning research\.Cited by:[§6\.4](https://arxiv.org/html/2605.29229#S6.SS4.SSS0.Px2.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2018\)Commonsenseqa: a question answering challenge targeting commonsense knowledge\.arXiv preprint arXiv:1811\.00937\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p2.1)\.
- G\. Team, P\. Georgiev, V\. I\. Lei, R\. Burnell, L\. Bai, A\. Gulati, G\. Tanzer, D\. Vincent, Z\. Pan, S\. Wang,et al\.\(2024a\)Gemini 1\.5: unlocking multimodal understanding across millions of tokens of context\.arXiv preprint arXiv:2403\.05530\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p1.3)\.
- G\. Team, T\. Mesnard, C\. Hardin, R\. Dadashi, S\. Bhupatiraju, S\. Pathak, L\. Sifre, M\. Rivière, M\. S\. Kale, J\. Love,et al\.\(2024b\)Gemma: open models based on gemini research and technology\.arXiv preprint arXiv:2403\.08295\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px4.p1.1)\.
- Q\. Team \(2025\)QwQ\-32b: embracing the power of reinforcement learning\.External Links:[Link](https://qwenlm.github.io/blog/qwq-32b/)Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p1.1)\.
- X\. Tian, Y\. Ji, H\. Wang, S\. Chen, S\. Zhao, Y\. Peng, H\. Zhao, and X\. Li \(2025\)Not all correct answers are equal: why your distillation source matters\.arXiv preprint arXiv:2505\.14464\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Xu, F\. Jiang, L\. Niu, Y\. Deng, R\. Poovendran, Y\. Choi, and B\. Y\. Lin \(2024\)Magpie: alignment data synthesis from scratch by prompting aligned llms with nothing\.arXiv preprint arXiv:2406\.08464\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.29229#S3.SS2.SSS0.Px1.p1.4),[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Xu, F\. Jiang, L\. Niu, B\. Y\. Lin, and R\. Poovendran \(2025\)Stronger models are not always stronger teachers for instruction tuning\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4392–4405\.Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p2.1),[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.29229#S3.SS2.SSS0.Px1.p1.4),[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px7.p1.1),[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px8.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px1.p1.1)\.
- S\. You, C\. Xu, C\. Xu, and D\. Tao \(2017\)Learning from multiple teacher networks\.InProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining,pp\. 1285–1294\.Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p2.1)\.
- L\. Yu, W\. Jiang, H\. Shi, J\. Yu, Z\. Liu, Y\. Zhang, J\. T\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2023\)Metamath: bootstrap your own mathematical questions for large language models\.arXiv preprint arXiv:2309\.12284\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p1.3)\.
- J\. H\. Zar \(2005\)Spearman rank correlation\.Encyclopedia of biostatistics7\.Cited by:[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px7.p1.1)\.
- E\. Zhang, X\. Yan, W\. Lin, Tianxiang\. Zhang, and L\. Qianchun \(2025a\)Learning like humans: advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert\-guided self\-reformulation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 6630–6644\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.336/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.336),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p8.1),[§6\.3](https://arxiv.org/html/2605.29229#S6.SS3.p1.1)\.
- R\. Zhang, R\. M\. S\. Khan, Z\. Tan, D\. Li, S\. Wang, and T\. Chen \(2025b\)The quest for efficient reasoning: a data\-centric benchmark to cot distillation\.arXiv preprint arXiv:2505\.18759\.Cited by:[§1](https://arxiv.org/html/2605.29229#S1.p2.1),[§2](https://arxiv.org/html/2605.29229#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.29229#S5.SS0.SSS0.Px1.p1.3)\.
## Appendix APrompt Templates
Prompt templates used for finetuning are given in Table[5](https://arxiv.org/html/2605.29229#A1.T5)\.
Prompt TemplateAnswer the following question: \#\#\# Question: \{question\} \#\#\# Answer: \{reasoning\}\.Table 5:Prompt templates used in finetuning\. During training, the loss was computed only for the reasoning and the answer\.
## Appendix BStudent Capability
The relative student capability of student models on different tasks are given in Table[6](https://arxiv.org/html/2605.29229#A2.T6)\.
TaskStudent ModelsSSG\-2bM\-7bQ\-1\.5bQ\-3bQ\-7bL\-1bL\-3bL\-8bSQA0\.0750\.1570\.6220\.6510\.6030\.6750\.6950\.683CSQA0\.3840\.5310\.9181\.0000\.9260\.9190\.9670\.975ARC0\.3900\.4640\.6940\.7290\.6900\.7220\.7560\.755MATH0\.0000\.1180\.3770\.4110\.3800\.4380\.4520\.440GSM8K0\.0790\.1900\.5400\.5950\.5440\.6760\.7080\.668Average0\.1860\.2920\.6300\.6770\.6290\.6860\.7160\.704
Table 6:Capability of student models across different tasks\.
## Appendix CDetails for Data\-Model Compatibility
### C\.1Psuedocode for DMC Optimization
To obtain the optimal configuration forDMC, we followed the steps below, as shown in Algorithm[1](https://arxiv.org/html/2605.29229#alg1): \(1\) Enumerate all possible combinations of evaluation metrics forQQ,DD, andCC; \(2\) For each combination, employ symbolic regression to derive the functional forms ofMSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\),MSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\)andf\(CS\)f\(C\_\{S\}\); \(3\) Manually filter the resulting equations to retain those with stronger interpretability and better performance; \(4\) Apply grid search to finetune the parameters within the candidates forMSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\)andMSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\)and identify the optimal configuration among all combinations\.
Algorithm 1DMC Optimization and Evaluation Strategy0:Candidate metric sets
ℚ,𝔻,ℂ\\mathbb\{Q\},\\mathbb\{D\},\\mathbb\{C\}; Student model
SS; Datasets
\[𝒟i\]\|i\[\\mathcal\{D\}\_\{i\}\]\|\_\{i\}; Ground\-truth performance
\[PS\(𝒟i\)\]\|i\[P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}\.
0:Optimal DMC configuration
\(Q∗,D∗,C∗\)\(Q^\{\*\},D^\{\*\},C^\{\*\}\)and functional forms
MSL,MSH,fM\_\{S\}^\{L\},M\_\{S\}^\{H\},f\.
1:\{// Step 1: Enumeration of Metric Combinations\}
2:
ℋ←∅\\mathcal\{H\}\\leftarrow\\emptyset\{Candidate pool\}
3:for all
Qj∈ℚ,Dk∈𝔻,Cl∈ℂQ\_\{j\}\\in\\mathbb\{Q\},D\_\{k\}\\in\\mathbb\{D\},C\_\{l\}\\in\\mathbb\{C\}do
4:
CS←PlacementTest\(S,Cl\)C\_\{S\}\\leftarrow\\text\{PlacementTest\}\(S,C\_\{l\}\)
5:
Q\(𝒟i\)←EvaluateQuality\(𝒟i,Qj\)Q\(\\mathcal\{D\}\_\{i\}\)\\leftarrow\\text\{EvaluateQuality\}\(\\mathcal\{D\}\_\{i\},Q\_\{j\}\)
6:
DS\(𝒟i\)←EvaluateDifficulty\(𝒟i,Dk,S\)D\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\\leftarrow\\text\{EvaluateDifficulty\}\(\\mathcal\{D\}\_\{i\},D\_\{k\},S\)
7:\{// Step 2: Symbolic Regression for Functional Forms\}
8:Objective:
maxCorr\(\[DMCS\(𝒟i\)\]\|i,\[PS\(𝒟i\)\]\|i\\max\\text\{Corr\}\(\[\\text\{DMC\}\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\},\[P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)\]\|\_\{i\}
9:
\{MSL,MSH,f\}←SymbolicRegression\(Inputs=\{Q,DS,CS\}\)\\\{M\_\{S\}^\{L\},M\_\{S\}^\{H\},f\\\}\\leftarrow\\text\{SymbolicRegression\}\(\\text\{Inputs\}=\\\{Q,D\_\{S\},C\_\{S\}\\\}\)
10:Store
\(Qj,Dk,Cl,MSL,MSH,f,Score\)\(Q\_\{j\},D\_\{k\},C\_\{l\},M\_\{S\}^\{L\},M\_\{S\}^\{H\},f,\\text\{Score\}\)in
ℋ\\mathcal\{H\}
11:endfor
12:\{// Step 3: Interpretability Filtering\}
13:
ℋfiltered←\{h∈ℋ∣his human\-interpretable and adheres to Occam’s Razor\}\\mathcal\{H\}\_\{filtered\}\\leftarrow\\\{h\\in\\mathcal\{H\}\\mid h\\text\{ is human\-interpretable and adheres to Occam's Razor\}\\\}
14:\{// Step 4: Fine\-tuning via Grid Search\}
15:
BestCorr←−1\\text\{BestCorr\}\\leftarrow\-1
16:for allcandidate
h∈ℋfilteredh\\in\\mathcal\{H\}\_\{filtered\}do
17:
Θ∗←argmaxΘCorr\(DMCS\(𝒟i;Θ,h\),PS,𝒟i\)\\Theta^\{\*\}\\leftarrow\\arg\\max\_\{\\Theta\}\\text\{Corr\}\(\\text\{DMC\}\_\{S\}\(\\mathcal\{D\}\_\{i\};\\Theta,h\),P\_\{S,\\mathcal\{D\}\_\{i\}\}\)\{Optimize constants\}
18:if
Corr\(h;Θ∗\)\>BestCorr\\text\{Corr\}\(h;\\Theta^\{\*\}\)\>\\text\{BestCorr\}then
19:
BestCorr←Corr\(h;Θ∗\)\\text\{BestCorr\}\\leftarrow\\text\{Corr\}\(h;\\Theta^\{\*\}\)
20:
\(Q∗,D∗,C∗,MSL,MSH,f\)←\(h,Θ∗\)\(Q^\{\*\},D^\{\*\},C^\{\*\},M\_\{S\}^\{L\},M\_\{S\}^\{H\},f\)\\leftarrow\(h,\\Theta^\{\*\}\)
21:endif
22:endfor
23:returnBest DMC configuration
### C\.2Visualization of Optimal DMC Configuration
Representative examples illustrating how the model’s performance ranking across subsets𝒟i\\mathcal\{D\}\_\{i\}and the distribution of DMC values vary as student capability increases are presented in Figure[3](https://arxiv.org/html/2605.29229#A3.F3)\. Each point in the figure corresponds to a subset𝒟i\\mathcal\{D\}\_\{i\}, where darker point colors indicate better distillationPS\(𝒟i\)P\_\{S\}\(\\mathcal\{D\}\_\{i\}\)on it\. The background color depicts the distribution ofDMCS\\text\{DMC\}\_\{S\}, with darker shades signifying higherDMCS\\text\{DMC\}\_\{S\}levels\.
Figure 3:Model’s performance ranking across subsets𝒟i\\mathcal\{D\}\_\{i\}and the distribution of DMC values with student capability varying\. From top to bottom, the plots represent the performance of Gemma‑2B on GSM8K, Qwen2\.5‑7B on MATH, and Llama3\.2‑1B on CSQA, respectively\.CCadopts the relative student capability, taking values within the range\[0,1\]\[0,1\]\.The visualization ofMSL\(⋅\)M\_\{S\}^\{L\}\(\\cdot\),MSH\(⋅\)M\_\{S\}^\{H\}\(\\cdot\)andf\(CS\)f\(C\_\{S\}\)are given in Figure[4](https://arxiv.org/html/2605.29229#A3.F4)and[5](https://arxiv.org/html/2605.29229#A3.F5)\.
Figure 4:Visualization ofMSLM\_\{S\}^\{L\}andMSHM\_\{S\}^\{H\}\.Figure 5:Visualization off\(CS\)f\(C\_\{S\}\)
### C\.3Full Table of Correlation
Table[7](https://arxiv.org/html/2605.29229#A3.T7)is the full table demonstrating the correlation between reasoning distillation performance and evaluation metrics across different student models and tasks\.
MetricMMStudent ModelsSSAvg\.G\-2bM\-7bQ\-1\.5bQ\-3bQ\-7bL\-1bL\-3bL\-8bSQAQualitySRL0\.3570\.1190\.7410\.3700\.4440\.4270\.5090\.4840\.431SRG0\.2680\.0830\.6350\.5360\.2570\.3900\.3300\.4290\.366DifficultyPPL\-0\.4040\.186\-0\.542\-0\.355\-0\.307\-0\.678\-0\.671\-0\.309\-0\.385CPPL0\.427\-0\.1270\.7290\.6510\.1070\.4500\.6300\.3890\.407IFD0\.450\-0\.1130\.8040\.5750\.1410\.3820\.6150\.1840\.380Multi\-MetricCAR0\.1880\.0180\.8020\.4090\.3420\.4100\.5150\.4480\.392DMC0\.4250\.2040\.7730\.4330\.4020\.4700\.5900\.5060\.475CSQAQualitySRL0\.3850\.6730\.8600\.6940\.8400\.7470\.8770\.0640\.642SRG0\.3820\.6860\.6270\.7010\.7300\.6900\.5320\.0170\.546DifficultyPPL\-0\.233\-0\.396\-0\.583\-0\.110\-0\.387\-0\.261\-0\.629\-0\.262\-0\.358CPPL0\.1760\.3920\.7330\.1950\.5060\.3620\.655\-0\.0270\.374IFD0\.0390\.3490\.7720\.4170\.6310\.0700\.429\-0\.6320\.259Multi\-MetricCAR0\.1450\.4460\.8680\.4300\.7010\.5650\.876\-0\.0100\.503DMC0\.4030\.6960\.8950\.6180\.8430\.7780\.9230\.2040\.670ARCQualitySRL0\.3560\.6040\.6960\.9390\.8100\.4140\.6320\.1230\.572SRG0\.0360\.4460\.3580\.8370\.6840\.4140\.3290\.0060\.389DifficultyPPL\-0\.828\-0\.756\-0\.542\-0\.399\-0\.498\-0\.123\-0\.674\-0\.562\-0\.548CPPL0\.8360\.8000\.4530\.2950\.338\-0\.704\-0\.487\-0\.2050\.166IFD0\.7970\.7140\.2330\.2690\.275\-0\.399\-0\.755\-0\.3640\.096Multi\-MetricCAR0\.7610\.7880\.5660\.7180\.7270\.1880\.5170\.2180\.560DMC0\.5980\.7520\.7800\.9010\.8280\.4760\.7180\.2340\.661MATHQualitySRL0\.3630\.1210\.6020\.5270\.9000\.8310\.6000\.5560\.562SRG0\.3210\.1010\.6080\.5070\.8900\.8330\.6270\.6070\.562DifficultyPPL\-0\.773\-0\.682\-0\.878\-0\.931\-0\.656\-0\.696\-0\.934\-0\.823\-0\.797CPPL0\.7590\.5020\.4340\.4560\.5140\.8440\.7760\.7050\.624IFD0\.7700\.661\-0\.793\-0\.794\-0\.6480\.7000\.4400\.4210\.095Multi\-MetricCAR\-0\.2810\.1610\.5400\.5880\.9150\.9010\.6680\.6800\.521DMC0\.4690\.3060\.6590\.5940\.9000\.8610\.7110\.6840\.648GSM8KQualitySRL0\.8000\.6850\.3510\.7480\.822\-0\.0160\.6900\.3770\.557SRG0\.1550\.2300\.4640\.4800\.2810\.3640\.5510\.5830\.388DifficultyPPL\-0\.329\-0\.667\-0\.839\-0\.763\-0\.378\-0\.588\-0\.702\-0\.831\-0\.637CPPL0\.2990\.599\-0\.363\-0\.585\-0\.7910\.4240\.6200\.8390\.130IFD0\.3210\.609\-0\.528\-0\.716\-0\.521\-0\.501\-0\.797\-0\.659\-0\.349Multi\-MetricCAR0\.5150\.626\-0\.2210\.4180\.227\-0\.0100\.6800\.3820\.327DMC0\.8000\.7350\.4010\.7980\.8220\.1430\.7350\.4250\.607
Table 7:The correlation between reasoning distillation performance and evaluation metrics across different student models and tasks\.
### C\.4Relation betweenPP,QQandDD
Figure[6](https://arxiv.org/html/2605.29229#A3.F6)presents the relation betweenPPandQQas well as betweenPPandDD, and displays the regression curves\.
Figure 6:Scatter plots ofPPversusQQandPPversusDD, along with the corresponding regression lines\.
## Appendix DReasoning Distillation Performance
Table[8](https://arxiv.org/html/2605.29229#A4.T8),[10](https://arxiv.org/html/2605.29229#A4.T10),[11](https://arxiv.org/html/2605.29229#A4.T11)shows the performance of multiple data selection methods on the tasks, with student modelSSselected as Gemma\-2B, Mistral\-7B, Qwen2\.5\-1\.5B and Qwen2\.5\-7B\. Among them, Gemma‑2B represents a student model with relatively low initial capability and limited potential; Mistral‑7B denotes a model with low initial reasoning ability but high potential owing to its larger parameter count; and the Qwen family is primarily used to compare the performance of models within the same family across different parameter scales\.
TrainingDatasetSelection Metric\(Topk%k\\%\)TasksSQACSQAARCMATHGSM8KNone\-48\.238\.540\.18\.412\.6StaticFull \(100%\)54\.645\.244\.818\.223\.4Random \(12\.5%\)51\.443\.842\.614\.519\.8Quality \(12\.5%\)56\.246\.845\.425\.424\.6Difficulty \(12\.5%\)49\.844\.541\.216\.818\.2CAR \(12\.5%\)53\.547\.246\.022\.025\.2DMC \(12\.5%\)58\.449\.647\.526\.828\.4DMC \(10%\)59\.248\.446\.825\.429\.1DMC \(7\.5%\)57\.847\.946\.223\.627\.5DynamicADCL55\.448\.243\.520\.622\.8Difficulty \(12\.5%\)50\.246\.542\.819\.421\.2CAR \(12\.5%\)56\.850\.447\.226\.529\.5DMC \(12\.5%\)61\.655\.248\.229\.630\.8DMC \(10%\)62\.552\.349\.428\.632\.4DMC \(7\.5%\)60\.351\.049\.227\.831\.8
Table 8:Performance of Gemma\-2B\. For the static and dynamic settings, the dataset selection metric achieving the best and second\-best performance are indicated inboldandunderline, respectively\.TrainingDatasetSelection Metric\(Topk%k\\%\)TasksSQACSQAARCMATHGSM8KNone\-65\.264\.869\.410\.535\.2StaticFull \(100%\)69\.471\.675\.023\.047\.8Random \(12\.5%\)67\.870\.273\.519\.845\.4Quality \(12\.5%\)72\.673\.576\.826\.554\.2Difficulty \(12\.5%\)66\.268\.470\.214\.241\.5CAR \(12\.5%\)70\.573\.276\.225\.853\.8DMC \(12\.5%\)72\.575\.478\.229\.458\.6DMC \(10%\)71\.874\.877\.528\.259\.2DMC \(7\.5%\)70\.273\.676\.826\.056\.4DynamicADCL70\.872\.574\.222\.451\.0Difficulty \(12\.5%\)65\.470\.672\.820\.248\.6CAR \(12\.5%\)71\.675\.278\.428\.559\.5DMC \(12\.5%\)73\.976\.680\.231\.262\.4DMC \(10%\)75\.377\.279\.030\.062\.6DMC \(7\.5%\)73\.776\.279\.030\.660\.6
Table 9:Performance of Mistral\-7B\. For the static and dynamic settings, the dataset selection metric achieving the best and second\-best performance are indicated inboldandunderline, respectively\.TrainingDatasetSelection Metric\(Topk%k\\%\)TasksSQACSQAARCMATHGSM8KNone\-45\.238\.452\.69\.414\.2StaticFull \(100%\)51\.559\.661\.631\.441\.6Random \(12\.5%\)50\.158\.259\.428\.638\.5Quality \(12\.5%\)54\.862\.563\.844\.251\.4Difficulty \(12\.5%\)48\.657\.055\.218\.432\.0CAR \(12\.5%\)52\.461\.863\.241\.050\.8DMC \(12\.5%\)54\.265\.465\.851\.255\.6DMC \(10%\)53\.864\.865\.249\.856\.4DMC \(7\.5%\)52\.663\.264\.046\.553\.2DynamicADCL52\.860\.558\.635\.242\.4Difficulty \(12\.5%\)47\.458\.456\.830\.235\.6CAR \(12\.5%\)53\.264\.264\.648\.454\.8DMC \(12\.5%\)55\.969\.267\.253\.659\.8DMC \(10%\)56\.268\.067\.854\.259\.0DMC \(7\.5%\)54\.367\.065\.453\.857\.2
Table 10:Performance of Qwen2\.5\-1\.5B\. For the static and dynamic settings, the dataset selection metric achieving the best and second\-best performance are indicated inboldandunderline, respectively\.TrainingDatasetSelection Metric\(Topk%k\\%\)TasksSQACSQAARCMATHGSM8KNone\-62\.468\.274\.535\.848\.6StaticFull \(100%\)66\.875\.480\.676\.073\.6Random \(12\.5%\)64\.273\.878\.470\.268\.5Quality \(12\.5%\)70\.878\.882\.477\.875\.4Difficulty \(12\.5%\)60\.572\.172\.845\.655\.2CAR \(12\.5%\)65\.876\.881\.676\.474\.8DMC \(12\.5%\)71\.278\.684\.280\.479\.2DMC \(10%\)70\.878\.283\.579\.680\.6DMC \(7\.5%\)68\.577\.482\.075\.277\.4DynamicADCL68\.276\.475\.262\.870\.4Difficulty \(12\.5%\)62\.474\.673\.558\.262\.8CAR \(12\.5%\)69\.478\.582\.879\.581\.2DMC \(12\.5%\)73\.681\.287\.682\.884\.2DMC \(10%\)73\.280\.688\.284\.086\.2DMC \(7\.5%\)71\.679\.886\.881\.682\.8
Table 11:Performance of Qwen2\.5\-7B\. For the static and dynamic settings, the dataset selection metric achieving the best and second\-best performance are indicated inboldandunderline, respectively\.
## Appendix ETraining Dynamics
We analyze the training dynamics with MATH as𝒟0\\mathcal\{D\}\_\{0\}, using Qwen2\.5\-3B as a representative high\-capacity model and Gemma\-2B as a representative low\-capacity model\.
#### Capability Improvement
As shown in Figure[7](https://arxiv.org/html/2605.29229#A5.F7), the student capability gradually improves as training progresses, consistent with our expectations\. For the low\-capability model Gemma\-2B, the model initially selects relatively difficult data, yielding a rapid capability increase early in training; as its capability grows, DMC places greater emphasis on data quality and its alignment with the model, enhancing the stability of the model’s reasoning capability\.
Figure 7:Student capability improvement curve under different parameter settings during reasoning distillation\.
#### Difficulty Shift
In order to analyze changes in specific data samples, we selected five representative samples and visualized their selection status and corresponding difficulty shifts in Figure[8](https://arxiv.org/html/2605.29229#A5.F8)\. The x\-axis corresponds to the training epoch progress and the y\-axis to the relative difficulty\. Data selected into the training set at each epoch are circled\.
Figure 8:Difficulty shift curve of Gemma\-2b on MATH\.These samples illustrate several characteristic patterns\. A core\-course sample \(Index 0\) is consistently retained throughout training; a sample whose difficulty initially matches the student \(Index 2202\) is rapidly mastered and then dropped to avoid overfitting, whereas one that is initially too hard \(Index 2355\) is selected only once its difficulty falls into the effective DMC range, and a sample near the threshold \(Index 2308\) alternates in and out, regulating generalization\. Notably, some excluded samples \(e\.g\., Index 1529, 2202\) keep decreasing in difficulty even without being trained on, indicating that they are not irreplaceable: the student acquires similar feature representations from other instances, which also explains why they are not selected\.
## Appendix FComputational Cost Analysis
Dynamic data selection re\-evaluates the data pool before every training epoch, which introduces additional overhead relative to a conventional one\-pass training pipeline\. In our experiments, this naive re\-evaluation increases the total training time by approximately 53%\.
This overhead can be substantially reduced without affecting the selected data\. As shown in Figure[2](https://arxiv.org/html/2605.29229#S6.F2)\(Section[6\.4](https://arxiv.org/html/2605.29229#S6.SS4)\), the dynamic selection process almost exclusively draws from the top\-quality region of the pool: the highest\-quality 5% of data are nearly always selected, and data beyond the top 20% are seldom chosen\. Consequently, the expensive difficulty re\-evaluation can be restricted to the top 20% of candidates ranked by the static quality scoreQQ, while the remaining low\-quality data are excluded a priori\. This reduces the additional training\-time overhead from∼\\sim53% to∼\\sim10\.6%\.
We argue that this marginal one\-time training cost is acceptable\. The goal of distillation is to obtain an efficient student model that will serve a large number of inference calls; a∼\\sim10\.6% increase in one\-time training cost is a worthwhile trade\-off for a consistently stronger student\. Moreover, compared with the compute wasted on training over incompatible data that yields no gain, DMC\-guided selection is arguably more efficient in terms of performance per unit of compute\.Similar Articles
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.
Reasoning Compression with Mixed-Policy Distillation
This paper proposes Mixed-Policy Distillation (MPD), a framework that transfers concise reasoning behaviors from large teacher models to smaller student models, reducing token usage by up to 27.1% while improving performance.
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
This paper introduces DyCon, a training-free framework that uses step-level embeddings to model evolving task difficulty and dynamically control reasoning depth in Large Reasoning Models, effectively reducing overthinking and improving efficiency without sacrificing accuracy.
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation
Proposes Distribution-Aligned Self-Distillation (DASD), which dynamically filters tokens during self-distillation to preserve beneficial logical corrections while suppressing distributionally misaligned style noise, improving robust reasoning on math, code, and commonsense benchmarks.
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
This paper introduces TESSY, a teacher-student cooperative framework for fine-tuning reasoning models that generates on-policy SFT data by decoupling generation into capability tokens (from teacher) and style tokens (from student), addressing catastrophic forgetting issues when using off-policy teacher data.