GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

arXiv cs.AI Papers

Summary

GRACE proposes a gradient-aligned method that scores individual reasoning steps to select the most valuable data for post-training, achieving 108.8% of full-data performance with only 20% of the data.

arXiv:2605.13130v1 Announce Type: new Abstract: Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model's internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 06:15 AM

# GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training
Source: [https://arxiv.org/html/2605.13130](https://arxiv.org/html/2605.13130)
Junjie Li Harbin Institute of Technology, Shenzhen, China 22b351018@stu\.hit\.edu\.cn &Ziao Wang11footnotemark:1 Hong Kong Baptist University, China ziaowang@hkbu\.edu\.cn &NingXuan Ma Harbin Institute of Technology, Shenzhen, China 2023311G27@stu\.hit\.edu\.cn &Jianghong Ma Harbin Institute of Technology, Shenzhen, China City University of Hong Kong, China majianghong@hit\.edu\.cn &Xiaofeng Zhang22footnotemark:2 Harbin Institute of Technology, Shenzhen, China zhangxiaofeng@hit\.edu\.cn

###### Abstract

Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable\. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually\. We present GRACE, a gradient\-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer\-oriented gradient direction, and its consistency with the preceding reasoning trajectory\. Step\-level scores are aggregated into a sample\-level value for subset selection, using only the model’s internal optimization signals and no external reward models or step annotations\. To make this scalable, GRACE introduces a representation\-level gradient proxy that estimates step\-level alignment from token\-level upstream signals in a single forward pass\. Post\-training Qwen3\-VL\-2B\-Instruct on MMathCoT\-1M, GRACE reaches 108\.8% of the full\-data performance with 20% of the data and retains 100\.2% with only 5%, with subsets that transfer effectively across model backbones\.

## 1Introduction

Large\-scale reasoning datasets have become a cornerstone for post\-training large language and vision\-language models\[[30](https://arxiv.org/html/2605.13130#bib.bib17),[35](https://arxiv.org/html/2605.13130#bib.bib18)\]\. The standard way to use them is to supervise the model on the entire reasoning trace, treating every step as an equally valuable target\. In reality, the steps within a trace contribute very unevenly: some directly support the final answer, while others restate earlier content, explore irrelevant tangents, or introduce noise\. Training on them uniformly wastes budget on low\-value steps and dilutes the contribution of useful ones\. This cost has become significant as reasoning corpora grow to millions of traces\[[19](https://arxiv.org/html/2605.13130#bib.bib41)\]and post\-training takes hundreds of GPU\-hours per run\. Choosing what to train on therefore matters\[[39](https://arxiv.org/html/2605.13130#bib.bib20)\], and for reasoning data this means assessing steps individually rather than ranking whole traces\.

Existing data curation methods improve training efficiency by selecting samples based on correctness\[[29](https://arxiv.org/html/2605.13130#bib.bib21)\], reward models\[[20](https://arxiv.org/html/2605.13130#bib.bib22)\], or sample\-level influence\[[32](https://arxiv.org/html/2605.13130#bib.bib5),[16](https://arxiv.org/html/2605.13130#bib.bib23),[10](https://arxiv.org/html/2605.13130#bib.bib24)\]and all of them operate at the granularity of entire traces\. As a result, a trace with correct final answers but poor intermediate steps\[[27](https://arxiv.org/html/2605.13130#bib.bib25)\]is treated as equally valuable as a tightly reasoned one\. This highlights a fundamental limitation:current approaches lack a mechanism to assess how individual reasoning steps contribute to optimization\.

![Refer to caption](https://arxiv.org/html/2605.13130v1/x1.png)Figure 1:Motivation and empirical effect of GRACE\. Left: Reasoning traces are viewed as sequences of optimization events, where each step induces an update direction whose utility depends on its alignment with the target objective and the evolving trajectory\. Right: Radar chart comparing downstream performance across benchmarks\. GRACE achieves full\-data or better performance using only a fraction of the training data\.In this work, we revisit reasoning data from an optimization perspective\. Instead of viewing reasoning traces as static supervision targets, we model them as sequences of optimization events, where each reasoning step induces a local training signal that affects the gradient direction toward the final answer\. From this view, the utility of reasoning data depends not only on external attributes such as correctness or length, but also on whether its intermediate steps constructively support optimization\.

Building on this view, we proposeGRACE\(Gradient\-alignedReasoning dAtaCuration forEfficient post\-training\), a method that performs fine\-grained data curation by estimating step\-level optimization utility\. Rather than pruning or rewriting reasoning traces, GRACE assigns each step a utility score based on two complementary criteria: \(i\) its alignment with the answer\-oriented optimization direction, and \(ii\) its consistency with the accumulated reasoning trajectory\. These signals capture both task\-driven and trajectory\-aware contributions of each step\. The resulting step\-level scores are then aggregated to produce a sample\-level utility score, enabling effective subset selection while preserving the simplicity of sample\-level training\. Fig\.[1](https://arxiv.org/html/2605.13130#S1.F1)illustrates the motivation of GRACE and provides empirical evidence that optimization\-aware curation can retain strong performance with substantially fewer training samples\.

A key challenge is that true step\-level gradients are computationally intractable at scale\. Naively computing gradients for each reasoning step would require decomposing traces into multiple training instances and performing repeated backward passes\. To address this, GRACE introduces arepresentation\-level gradient proxythat approximates step\-induced optimization directions using token\-level upstream signals\. This proxy enables efficient estimation of step\-level alignment from a single forward pass, making optimization\-aware curation practical for large\-scale CoT datasets\.

We evaluate GRACE by post\-training Qwen3\-VL\-2B\-Instruct\[[26](https://arxiv.org/html/2605.13130#bib.bib6)\]on MMathCoT\-1M\[[19](https://arxiv.org/html/2605.13130#bib.bib41)\]and assessing the resulting models on a diverse suite of multimodal benchmarks spanning mathematical reasoning and general visual question answering\. GRACE consistently identifies high\-value subsets: training on only20%of the curated data surpasses full\-data performance, reaching108\.8%of the full\-data result averaged across benchmarks, while using only5%retains100\.2%\. Furthermore, the selected subsets transfer effectively across model backbones, suggesting that the proposed optimization\-based signal captures intrinsic data value beyond a specific model configuration\.

Our contributions are three\-fold:

1. 1\.We introduce an optimization\-based perspective on reasoning data, framing reasoning traces as sequences of optimization events and highlighting the role of step\-level alignment in effective learning\.
2. 2\.We propose GRACE, a reasoning data curation method that aggregates step\-level optimization signals derived from answer\-oriented alignment and trajectory consistency for sample\-level subset selection\.
3. 3\.We develop a representation\-level gradient proxy that enables scalable estimation of step\-level alignment without per\-step parameter\-space gradient computation\.

## 2Method

![Refer to caption](https://arxiv.org/html/2605.13130v1/x2.png)Figure 2:The GRACE curation pipeline\.\(1\)Given an input and its reasoning trace, GRACE identifies token sets for each step and the answer\.\(2\)A fixed scoring model extracts token\-level upstream signals in one forward pass and groups them by token sets\.\(3\)Grouped signals are averaged into gradient proxies and scored by answer and trajectory alignment\.\(4\)Step scores are aggregated into a sample value for ranking and top\-ρ\\rhosubset selection\.In this section, we present GRACE\. We first define the reasoning data curation problem, then derive the step\-level optimization utility, introduce its scalable representation\-level proxy, and describe sample\-level subset selection\. The overall pipeline is illustrated in Fig\.[2](https://arxiv.org/html/2605.13130#S2.F2)\.

### 2\.1Problem Formulation

We consider a reasoning dataset𝒟=\{zi\}i=1N\\mathcal\{D\}=\\\{z\_\{i\}\\\}\_\{i=1\}^\{N\}, where each sample is

zi=\(xi,𝐬i,ai\),z\_\{i\}=\(x\_\{i\},\\mathbf\{s\}\_\{i\},a\_\{i\}\),withxix\_\{i\}denoting the input,𝐬i=\(si,1,si,2,…,si,Ki\)\\mathbf\{s\}\_\{i\}=\(s\_\{i,1\},s\_\{i,2\},\\dots,s\_\{i,K\_\{i\}\}\)denoting a sequence of reasoning steps, andaia\_\{i\}denoting the final answer\. Letfθf\_\{\\theta\}denote the model, and let𝒯i,k\\mathcal\{T\}\_\{i,k\}and𝒯i,ans\\mathcal\{T\}\_\{i,\\mathrm\{ans\}\}denote the token positions of stepsi,ks\_\{i,k\}and the answer segment, respectively\. For any token set𝒯\\mathcal\{T\}, we define the average token\-level loss

L​\(θ;zi,𝒯\)=1\|𝒯\|​∑t∈𝒯Lt​\(θ;zi\)\.L\(\\theta;z\_\{i\},\\mathcal\{T\}\)=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}L\_\{t\}\(\\theta;z\_\{i\}\)\.Accordingly, the step loss and answer loss are given byLi,k=L​\(θ;zi,𝒯i,k\)L\_\{i,k\}=L\(\\theta;z\_\{i\},\\mathcal\{T\}\_\{i,k\}\)andLians=L​\(θ;zi,𝒯i,ans\)L\_\{i\}^\{\\mathrm\{ans\}\}=L\(\\theta;z\_\{i\},\\mathcal\{T\}\_\{i,\\mathrm\{ans\}\}\), and the full loss over the reasoning trace and answer is

𝒯i,full=⋃k=1Ki𝒯i,k∪𝒯i,ans,Lfull​\(θ;zi\)=L​\(θ;zi,𝒯i,full\)\.\\mathcal\{T\}\_\{i,\\mathrm\{full\}\}=\\bigcup\_\{k=1\}^\{K\_\{i\}\}\\mathcal\{T\}\_\{i,k\}\\cup\\mathcal\{T\}\_\{i,\\mathrm\{ans\}\},\\qquad L\_\{\\mathrm\{full\}\}\(\\theta;z\_\{i\}\)=L\(\\theta;z\_\{i\},\\mathcal\{T\}\_\{i,\\mathrm\{full\}\}\)\.Standard post\-training minimizes

ℒSFT​\(θ;𝒟\)=1\|𝒟\|​∑zi∈𝒟Lfull​\(θ;zi\)\.\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\theta;\\mathcal\{D\}\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{z\_\{i\}\\in\\mathcal\{D\}\}L\_\{\\mathrm\{full\}\}\(\\theta;z\_\{i\}\)\.\(1\)
Our goal is to select a compact subset𝒮⊂𝒟\\mathcal\{S\}\\subset\\mathcal\{D\}with budget\|𝒮\|=⌈ρ​\|𝒟\|⌉\|\\mathcal\{S\}\|=\\lceil\\rho\|\\mathcal\{D\}\|\\rceil, whereρ∈\(0,1\)\\rho\\in\(0,1\)is the selection ratio, such that post\-training on𝒮\\mathcal\{S\}preserves or improves downstream performance compared with training on the full dataset\. To this end, GRACE assigns each sample a scalar value scoreV​\(zi\)V\(z\_\{i\}\)and selects the top\-ranked subset:

𝒮=\{zi∈𝒟∣rankV⁡\(zi\)≤⌈ρ​\|𝒟\|⌉\}\.\\mathcal\{S\}=\\left\\\{z\_\{i\}\\in\\mathcal\{D\}\\mid\\operatorname\{rank\}\_\{V\}\(z\_\{i\}\)\\leq\\lceil\\rho\|\\mathcal\{D\}\|\\rceil\\right\\\}\.\(2\)
The key question is how to defineV​\(zi\)V\(z\_\{i\}\)for reasoning data\. GRACE addresses this by estimating the optimization utility of each reasoning step and aggregating these step\-level signals into a sample\-level value\. Since these values are computed from model\-internal signals, the scoring model should provide stable representations\. Following prior works\[[33](https://arxiv.org/html/2605.13130#bib.bib2),[11](https://arxiv.org/html/2605.13130#bib.bib3)\], we obtain the scoring modelfθf\_\{\\theta\}by warming up an initial modelfθ0f\_\{\\theta\_\{0\}\}on aγ\\gamma\-ratio subset of𝒟\\mathcal\{D\}, and keep it fixed during data scoring\.

### 2\.2Step\-level Optimization Utility

We define the utility of a reasoning step based on its contribution to optimizing a target objective\. For clarity, we omit the sample indexiiwhen discussing a single samplez=\(x,𝐬,a\)z=\(x,\\mathbf\{s\},a\)\. Let𝒯k\\mathcal\{T\}\_\{k\}denote the token set of stepsks\_\{k\}, and writeLk​\(θ;z\)=L​\(θ;z,𝒯k\)L\_\{k\}\(\\theta;z\)=L\(\\theta;z,\\mathcal\{T\}\_\{k\}\)\. We adopt a standard first\-order influence perspective\[[22](https://arxiv.org/html/2605.13130#bib.bib1),[9](https://arxiv.org/html/2605.13130#bib.bib26)\]\. Consider a small update induced by stepsks\_\{k\}:

θ′=θ−η​∇θLk​\(θ;z\),\\theta^\{\\prime\}=\\theta\-\\eta\\nabla\_\{\\theta\}L\_\{k\}\(\\theta;z\),\(3\)whereη\>0\\eta\>0is the learning rate\. LetLtar​\(θ;z\)L^\{\\mathrm\{tar\}\}\(\\theta;z\)denote a target loss specifying the desired optimization direction; under Eq\.[3](https://arxiv.org/html/2605.13130#S2.E3), its first\-order change is approximated as

Ltar​\(θ′;z\)−Ltar​\(θ;z\)\\displaystyle L^\{\\mathrm\{tar\}\}\(\\theta^\{\\prime\};z\)\-L^\{\\mathrm\{tar\}\}\(\\theta;z\)≈−η​⟨∇θLk,∇θLtar⟩\\displaystyle\\approx\-\\eta\\left\\langle\\nabla\_\{\\theta\}L\_\{k\},\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\\right\\rangle\(4\)=−η​‖∇θLk‖​‖∇θLtar‖​cos⁡\(∇θLk,∇θLtar\)\.\\displaystyle=\-\\eta\\\|\\nabla\_\{\\theta\}L\_\{k\}\\\|\\,\\\|\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\\\|\\,\\cos\\bigl\(\\nabla\_\{\\theta\}L\_\{k\},\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\\bigr\)\.
This shows that a step is locally beneficial when its induced gradient is directionally aligned with the target gradient\. Since step lengths and gradient scales can vary substantially across reasoning segments, we focus on the normalized directional component:

Aktar≜cos⁡\(∇θLk,∇θLtar\)\.A\_\{k\}^\{\\mathrm\{tar\}\}\\triangleq\\cos\\bigl\(\\nabla\_\{\\theta\}L\_\{k\},\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\\bigr\)\.\(5\)See Appendix[B](https://arxiv.org/html/2605.13130#A2)for details\. Different choices of target direction correspond to different notions of step utility\. In GRACE, we consider two complementary objectives:

\(1\) Answer\-oriented objective\.We instantiateLtar=LansL^\{\\mathrm\{tar\}\}=L^\{\\mathrm\{ans\}\}, whereLans​\(θ;z\)=L​\(θ;z,𝒯ans\)L^\{\\mathrm\{ans\}\}\(\\theta;z\)=L\(\\theta;z,\\mathcal\{T\}\_\{\\mathrm\{ans\}\}\)is the loss on the answer segment and𝒯ans\\mathcal\{T\}\_\{\\mathrm\{ans\}\}denotes its token set\. This gives

Akans≜cos⁡\(∇θLk,∇θLans\)\.A\_\{k\}^\{\\mathrm\{ans\}\}\\triangleq\\cos\\bigl\(\\nabla\_\{\\theta\}L\_\{k\},\\nabla\_\{\\theta\}L^\{\\mathrm\{ans\}\}\\bigr\)\.\(6\)which measures whether the step supports optimizing the final answer\. While the answer\-oriented objective captures whether a step supports the final answer, it does not characterize whether the step is coherent with the reasoning process that precedes it\.

\(2\) Trajectory consistency objective\.Reasoning steps form an ordered trajectory rather than independent supervision signals\. For steps with preceding context, we define the historical reference direction and its corresponding alignment score jointly:

Akhist≜cos⁡\(∇θLk,rk\),rk≜Normalize​\(∑j<kωk,j​∇θLj\),A\_\{k\}^\{\\mathrm\{hist\}\}\\triangleq\\cos\\bigl\(\\nabla\_\{\\theta\}L\_\{k\},r\_\{k\}\\bigr\),\\qquad r\_\{k\}\\triangleq\\mathrm\{Normalize\}\\left\(\\sum\_\{j<k\}\\omega\_\{k,j\}\\nabla\_\{\\theta\}L\_\{j\}\\right\),\(7\)
whereωk,j≥0\\omega\_\{k,j\}\\geq 0and∑j<kωk,j=1\\sum\_\{j<k\}\\omega\_\{k,j\}=1control the contribution of each previous step to the historical reference direction\. The scoreAkhistA\_\{k\}^\{\\mathrm\{hist\}\}measures whether the current step continues the existing reasoning trajectory\.

Combining these two criteria, the final step\-level utility is defined as

Scorek≜α​Akans\+\(1−α\)​Akhist,α∈\[0,1\],\\mathrm\{Score\}\_\{k\}\\triangleq\\alpha A\_\{k\}^\{\\mathrm\{ans\}\}\+\(1\-\\alpha\)A\_\{k\}^\{\\mathrm\{hist\}\},\\qquad\\alpha\\in\[0,1\],\(8\)where the historical term is omitted fork=1k=1, since no preceding context exists\.

### 2\.3Representation\-level Gradient Proxy

The step utility in Eq\.[8](https://arxiv.org/html/2605.13130#S2.E8)is defined through parameter\-space gradients\. Directly evaluating such gradients for every reasoning step would require isolating each step as a separate loss and performing repeated backward passes over decomposed traces\. To obtain a scalable estimate of step\-level alignment, we project the optimization signal to the final representation interface and compute alignment in this lower\-dimensional space\.

Since the full gradient can be decomposed as∇θL=\[∇WoutL;∇θrepL\]\\nabla\_\{\\theta\}L=\\left\[\\nabla\_\{W\_\{\\mathrm\{out\}\}\}L;\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\\right\], we estimate step\-induced directions in the representation\-producing subspaceθrep\\theta\_\{\\mathrm\{rep\}\}, where all reasoning tokens interact through the model’s internal features\. Letht∈ℝdh\_\{t\}\\in\\mathbb\{R\}^\{d\}denote the final\-layer hidden state at token positiontt, whereddis the hidden dimension, and letWout∈ℝd×VW\_\{\\mathrm\{out\}\}\\in\\mathbb\{R\}^\{d\\times V\}be the output projection matrix, whereVVis the vocabulary size\. The pre\-softmax logit isℓt=Wout⊤​ht∈ℝV\\ell\_\{t\}=W\_\{\\mathrm\{out\}\}^\{\\top\}h\_\{t\}\\in\\mathbb\{R\}^\{V\}, and the next\-token probability ispt=softmax​\(ℓt\)p\_\{t\}=\\mathrm\{softmax\}\(\\ell\_\{t\}\)\. For the token\-level cross\-entropy lossLtL\_\{t\},

∂Lt∂ℓt=pt−yt,\\frac\{\\partial L\_\{t\}\}\{\\partial\\ell\_\{t\}\}=p\_\{t\}\-y\_\{t\},whereyty\_\{t\}is the one\-hot target token\. The corresponding gradient at the representation interface is

ut≜∂Lt∂ht=Wout​\(pt−yt\)\.u\_\{t\}\\triangleq\\frac\{\\partial L\_\{t\}\}\{\\partial h\_\{t\}\}=W\_\{\\mathrm\{out\}\}\(p\_\{t\}\-y\_\{t\}\)\.\(9\)
Letθrep\\theta\_\{\\mathrm\{rep\}\}denote the parameters that produce the final\-layer representations, and defineJt=∂ht/∂θrepJ\_\{t\}=\\partial h\_\{t\}/\\partial\\theta\_\{\\mathrm\{rep\}\}\. By the chain rule, the representation\-parameter gradient induced by tokenttis∇θrepLt=Jt⊤​ut\.\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\_\{t\}=J\_\{t\}^\{\\top\}u\_\{t\}\.Accordingly, for a token set𝒯\\mathcal\{T\}, the corresponding segment gradient is∇θrepL​\(𝒯\)=\|𝒯\|−1​∑t∈𝒯Jt⊤​ut\.\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\)=\|\\mathcal\{T\}\|^\{\-1\}\\sum\_\{t\\in\\mathcal\{T\}\}J\_\{t\}^\{\\top\}u\_\{t\}\.Thus,\{ut\}\\\{u\_\{t\}\\\}are the common upstream optimization signals that drive updates of the representation\-producing parameters through the Jacobian mapping\.

Eq\.[5](https://arxiv.org/html/2605.13130#S2.E5)requires the cosine alignment between the update directions induced by two token segments\. For two token sets𝒯1\\mathcal\{T\}\_\{1\}and𝒯2\\mathcal\{T\}\_\{2\}, its representation\-parameter form is

cos⁡\(∇θrepL​\(𝒯1\),∇θrepL​\(𝒯2\)\)=∑t∈𝒯1∑t′∈𝒯2ut⊤​\(Jt​Jt′⊤\)​ut′‖∑t∈𝒯1Jt⊤​ut‖​‖∑t′∈𝒯2Jt′⊤​ut′‖\.\\cos\\\!\\left\(\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\_\{1\}\),\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\_\{2\}\)\\right\)=\\frac\{\\sum\_\{t\\in\\mathcal\{T\}\_\{1\}\}\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\_\{2\}\}u\_\{t\}^\{\\top\}\\left\(J\_\{t\}J\_\{t^\{\\prime\}\}^\{\\top\}\\right\)u\_\{t^\{\\prime\}\}\}\{\\left\\\|\\sum\_\{t\\in\\mathcal\{T\}\_\{1\}\}J\_\{t\}^\{\\top\}u\_\{t\}\\right\\\|\\left\\\|\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\_\{2\}\}J\_\{t^\{\\prime\}\}^\{\\top\}u\_\{t^\{\\prime\}\}\\right\\\|\}\.\(10\)
Eq\.[10](https://arxiv.org/html/2605.13130#S2.E10)shows that exact representation\-parameter alignment depends on both token\-level upstream signals and Jacobian\-induced interactionsJt​Jt′⊤J\_\{t\}J\_\{t^\{\\prime\}\}^\{\\top\}\.

Exact step\-level evaluation of these interactions would require isolating each step loss and backpropagating it throughθrep\\theta\_\{\\mathrm\{rep\}\}, leading to repeated per\-step gradient computation\. To make step\-level scoring scalable, we introduce an interface\-level surrogate that preserves upstream optimization signals while avoiding explicit construction of the Jacobian\-induced geometry\. This design follows scalable data valuation methods that approximate gradient information in proxy spaces\[[33](https://arxiv.org/html/2605.13130#bib.bib2),[1](https://arxiv.org/html/2605.13130#bib.bib4),[21](https://arxiv.org/html/2605.13130#bib.bib27)\]\. Under this surrogate, the segment\-level update direction is represented by the aggregated upstream signal:

g​\(𝒯\)≜1\|𝒯\|​∑t∈𝒯ut\.g\(\\mathcal\{T\}\)\\triangleq\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}u\_\{t\}\.\(11\)The averaging follows the definition of the segment lossL​\(θ;z,𝒯\)L\(\\theta;z,\\mathcal\{T\}\)as a token\-level mean, ensuring that proxy directions are not biased by segment length\.

Using the token sets𝒯k\\mathcal\{T\}\_\{k\}and𝒯ans\\mathcal\{T\}\_\{\\mathrm\{ans\}\}, we writegk≜g​\(𝒯k\)g\_\{k\}\\triangleq g\(\\mathcal\{T\}\_\{k\}\)andgans≜g​\(𝒯ans\)g\_\{\\mathrm\{ans\}\}\\triangleq g\(\\mathcal\{T\}\_\{\\mathrm\{ans\}\}\)\. The step\-level utility is then computed in the proxy space as

Score^k=\{A^kans,k=1,α​A^kans\+\(1−α\)​A^khist,k\>1,\\widehat\{\\mathrm\{Score\}\}\_\{k\}=\\begin\{cases\}\\widehat\{A\}\_\{k\}^\{\\mathrm\{ans\}\},&k=1,\\\\\[5\.69054pt\] \\alpha\\widehat\{A\}\_\{k\}^\{\\mathrm\{ans\}\}\+\(1\-\\alpha\)\\widehat\{A\}\_\{k\}^\{\\mathrm\{hist\}\},&k\>1,\\end\{cases\}\(12\)where

A^kans≜cos⁡\(gk,gans\),A^khist≜cos⁡\(gk,r^k\),r^k≜Normalize​\(∑j<kωk,j​gj\)\.\\widehat\{A\}\_\{k\}^\{\\mathrm\{ans\}\}\\triangleq\\cos\(g\_\{k\},g\_\{\\mathrm\{ans\}\}\),\\qquad\\widehat\{A\}\_\{k\}^\{\\mathrm\{hist\}\}\\triangleq\\cos\(g\_\{k\},\\widehat\{r\}\_\{k\}\),\\qquad\\widehat\{r\}\_\{k\}\\triangleq\\mathrm\{Normalize\}\\left\(\\sum\_\{j<k\}\\omega\_\{k,j\}g\_\{j\}\\right\)\.Importantly, for each sample, the fixed output projectionWoutW\_\{\\mathrm\{out\}\}, the forward\-pass probabilities\{pt\}\\\{p\_\{t\}\\\}, and the ground\-truth tokens\{yt\}\\\{y\_\{t\}\\\}are sufficient to obtain directional gradient proxies for all reasoning steps in a single forward pass, without constructing per\-step training instances or performing backward passes\. Detailed derivation is provided in Appendix[C](https://arxiv.org/html/2605.13130#A3)\.

### 2\.4Sample\-level Aggregation

Given the step\-level proxy utilityScore^i,k\\widehat\{\\mathrm\{Score\}\}\_\{i,k\}in Eq\.[12](https://arxiv.org/html/2605.13130#S2.E12), GRACE instantiates the sample valueV​\(zi\)V\(z\_\{i\}\)by averaging step\-level utilities:

V​\(zi\)=1Ki​∑k=1KiScore^i,k\.V\(z\_\{i\}\)=\\frac\{1\}\{K\_\{i\}\}\\sum\_\{k=1\}^\{K\_\{i\}\}\\widehat\{\\mathrm\{Score\}\}\_\{i,k\}\.\(13\)This aggregation treats the reasoning trace as a sequence of optimization events and measures its overall training value by the average step\-level utility\.

The samples are ranked byV​\(zi\)V\(z\_\{i\}\)and selected according to the top\-budget rule in Eq\.[2](https://arxiv.org/html/2605.13130#S2.E2)\. The selected subset𝒮\\mathcal\{S\}is used for post\-training with the original reasoning traces\. See Appendix[D](https://arxiv.org/html/2605.13130#A4)for details\.

## 3Experiments

We empirically validate GRACE through a series of experiments designed to answer five questions:

- •Does GRACE curate reasoning data more effectively than existing methods \(Sec\.[3\.1](https://arxiv.org/html/2605.13130#S3.SS1)\)?
- •Do the curated subsets transfer across model backbones \(Sec\.[3\.2](https://arxiv.org/html/2605.13130#S3.SS2)\)?
- •Which components of GRACE drive its effectiveness \(Sec\.[3\.3](https://arxiv.org/html/2605.13130#S3.SS3)\)?
- •How robust is GRACE to its design hyperparameters \(Sec\.[3\.4](https://arxiv.org/html/2605.13130#S3.SS4)\)?
- •What is the computational cost of GRACE relative to gradient\-based alternatives \(Sec\.[3\.5](https://arxiv.org/html/2605.13130#S3.SS5)\)?

#### Experimental Setup\.

We evaluate GRACE by post\-training Qwen3\-VL\-2B\-Instruct on the reasoning\-rich candidate pool of MMathCoT\-1M and comparing against heuristic selectors \(Random, Longest, Stepmax\) and data\-curation baselines \(LESS, ICONS, CADC\) under the same selection budget and training recipe\. Evaluation covers general VQA/perception, multi\-task and multi\-image reasoning, and mathematical reasoning benchmarks, withRel\. Avg\.denoting performance normalized by the full\-data baseline\. Unless otherwise stated, GRACE usesρ=0\.2\\rho=0\.2,γ=0\.05\\gamma=0\.05, uniform history aggregation, andα=0\.7\\alpha=0\.7\. Full experimental details, including datasets, backbones, benchmarks, training recipes, and hardware, are provided in Appendix[E](https://arxiv.org/html/2605.13130#A5)\.

Table 1:Performance of Qwen3\-VL\-2B under the data selection methods\.Data %denotes the proportion of training data used, andRel\. Avg\.is the average relative performance over benchmarks\. ↑ indicates larger is better\.Boldandunderlinedvalues denote the best and second\-best results among 20% data selection methods, respectively\.MethodData %Hallusion\[[7](https://arxiv.org/html/2605.13130#bib.bib10)\]↑SQA\[[18](https://arxiv.org/html/2605.13130#bib.bib11)\]↑MMBench\[[15](https://arxiv.org/html/2605.13130#bib.bib12)\]↑MME\[[4](https://arxiv.org/html/2605.13130#bib.bib14)\]↑MMT\[[37](https://arxiv.org/html/2605.13130#bib.bib13)\]↑MathVista\[[17](https://arxiv.org/html/2605.13130#bib.bib15)\]↑MathVision\[[28](https://arxiv.org/html/2605.13130#bib.bib16)\]↑Rel\. Avg\.↑Perc\.Cog\.SIMIMINIFullFull100%43\.780\.269\.31517\.3656\.155\.053\.452\.516\.414\.7–Random20%45\.584\.572\.61495\.5653\.257\.454\.852\.316\.413\.9101\.4Longest20%46\.383\.073\.61494\.5687\.556\.454\.651\.315\.814\.5101\.7Stepmax20%46\.983\.873\.71477\.7637\.556\.754\.552\.615\.515\.1101\.5LESS\[[33](https://arxiv.org/html/2605.13130#bib.bib2)\]20%44\.683\.170\.21511\.4668\.656\.053\.749\.918\.814\.3101\.7ICONS\[[32](https://arxiv.org/html/2605.13130#bib.bib5)\]20%46\.382\.671\.51497\.0675\.055\.553\.851\.013\.214\.499\.1CADC\[[11](https://arxiv.org/html/2605.13130#bib.bib3)\]20%44\.684\.473\.11498\.9630\.757\.555\.254\.016\.116\.1102\.6GRACE\(Ours\)5%48\.583\.071\.11509\.3617\.556\.555\.151\.914\.014\.8100\.210%45\.782\.569\.51500\.7660\.457\.755\.153\.018\.115\.5103\.215%48\.582\.671\.11500\.5679\.357\.654\.954\.317\.816\.7105\.220%46\.885\.073\.81512\.3682\.958\.556\.354\.221\.717\.2108\.8

### 3\.1Main Results

Table[1](https://arxiv.org/html/2605.13130#S3.T1)reports per\-benchmark performance on Qwen3\-VL\-2B, comparing all baselines at a20%20\\%selection ratio against GRACE at5%5\\%,10%10\\%,15%15\\%, and20%20\\%\. GRACE consistently outperforms all heuristic and gradient\-based baselines: at20%20\\%data, it reaches108\.8%108\.8\\%relative average, surpassing the strongest baseline \(CADC,102\.6%102\.6\\%\) by a substantial margin and exceeding the full\-data baseline by8\.88\.8points\. With only5%5\\%of the data, GRACE retains100\.2%100\.2\\%of full\-data performance, demonstrating that step\-level optimization\-aware scoring can identify highly compact yet informative subsets\. This gain over full\-data training is partly due to continued SFT on an instruction\-tuned backbone: the math\-centric candidate pool may over\-specialize the model toward mathematical reasoning and cause partial forgetting on general VQA and perception abilities\. Thus, reduced subsets can sometimes outperform full\-data SFT, while GRACE further improves this effect by selecting traces with more favorable step\-level optimization signals\.

Table[2](https://arxiv.org/html/2605.13130#S3.T2)compares relative average performance across5%5\\%–15%15\\%selection ratios\. GRACE is the only method that exceeds full\-data performance at every ratio, while heuristic baselines fluctuate around the full\-data line and gradient\-based baselines are unstable at low ratios\.

Table 2:Relative average performance \(%\) of Qwen3\-VL\-2B under different data selection ratios\.Data %denotes the proportion of data used\. Values are normalized to the full\-data training baseline\.Data %FullRandomLongestStepmaxLESSICONSCADCGRACE \(Ours\)5%100\.096\.498\.395\.690\.495\.999\.8100\.210%100\.098\.5100\.598\.895\.5102\.4101\.3103\.215%100\.0100\.098\.2100\.698\.9101\.7100\.7105\.2
### 3\.2Transfer Across Backbones

To test whether the value assigned by GRACE reflects data properties that transfer beyond the scoring backbone, we post\-train other backbones on the subset selected using Qwen3\-VL\-2B,*without*re\-running data selection\. We consider Qwen2\.5\-VL\-3B\[[25](https://arxiv.org/html/2605.13130#bib.bib40)\], LLaVA\-1\.5\-7B\[[13](https://arxiv.org/html/2605.13130#bib.bib7)\], and Qwen3\-VL\-8B\[[26](https://arxiv.org/html/2605.13130#bib.bib6)\], covering different model families and scales, and compare them with their corresponding full\-data baselines under the same training recipe\.

![Refer to caption](https://arxiv.org/html/2605.13130v1/transfer.png)

Figure 3:Transferability and training efficiency across backbones\.Figure[3](https://arxiv.org/html/2605.13130#S3.F3)further examines whether GRACE\-selected data transfer across backbones\. The line reports the relative average performance of the 20% GRACE subset, normalized by each backbone’s full\-data baseline\. The dashed horizontal line marks the 100% full\-data level, and the bars compare single\-GPU wall\-clock training time under 100% and 20% data\. We select the 20% subset once using Qwen3\-VL\-2B and reuse it to post\-train Qwen2\.5\-VL\-3B, LLaVA\-1\.5\-7B, and Qwen3\-VL\-8B without re\-running data selection\. The curated subset consistently surpasses the corresponding full\-data baseline across all backbones, indicating that the value captured by GRACE is not tied to a single scoring model\. This suggests that step\-level optimization utility reflects transferable properties of reasoning data rather than model\-specific artifacts\.

In addition to transferability, the selected subset substantially reduces training cost, with larger savings on larger backbones where full\-data post\-training is more expensive\. Overall, GRACE improves the efficiency–performance trade\-off: a compact reasoning subset can match or surpass full\-data performance while requiring only a fraction of the training time\.

### 3\.3Ablation Studies

We ablate the two utility components and gradient proxy on Qwen3\-VL\-2B atρ=0\.2\\rho=0\.2\. We consider five variants: \(a\)w/o historical score: settingα=1\\alpha=1, scoring each step only by answer\-oriented alignment; \(b\)w/o answer score: removing the answer\-oriented term and scoring steps only by historical alignment, with the first step omitted since no history exists; \(c\)Target = CoT \+ Ans: using the union of all reasoning steps and the answer as the optimization target; \(d\)Target = Suffix: using the trailing suffix after the current step, testing whether local future context suffices as an answer surrogate; \(e\)Proxy = projected gradient: replacingg​\(𝒯\)g\(\\mathcal\{T\}\)with LESS\-style projected parameter\-space gradients computed per step, while keeping the rest of GRACE unchanged\.

Table 3:Ablation study of GRACE on Qwen3\-VL\-2B atρ=0\.2\\rho=0\.2\.VariantHallusion↑SQA↑MMBench↑MME↑MMT↑MathVista↑MathVision↑Rel\. Avg\.↑Perc\.Cog\.SIMIMINIFullGRACE46\.885\.073\.81512\.3682\.958\.556\.354\.221\.717\.2108\.8w/o Hist\.45\.983\.371\.01502\.6686\.856\.654\.855\.217\.218\.4105\.5w/o Ans\.44\.580\.869\.71495\.1655\.455\.253\.350\.713\.613\.897\.5Target = CoT \+ Ans46\.881\.869\.21508\.1655\.056\.154\.550\.715\.314\.5100\.2Target = Suffix44\.482\.168\.51515\.9656\.155\.252\.751\.314\.415\.199\.0Proxy = Proj\. Grad\.45\.782\.971\.71501\.0661\.856\.754\.849\.012\.714\.198\.3

Table[3](https://arxiv.org/html/2605.13130#S3.T3)reports the ablation results across all benchmarks\. Both utility components contribute to GRACE\. Using only answer\-oriented alignment remains competitive, reaching105\.5%105\.5\\%relative average performance, but still trails full GRACE by3\.33\.3points, showing that trajectory consistency provides complementary information beyond answer alignment\. In contrast, removing the answer\-oriented term drops the relative average to97\.5%97\.5\\%, suggesting that historical consistency alone may favor internally coherent traces that are not necessarily aligned with the final optimization objective\.

The target direction also matters\. Replacing the answer segment with the full trace \(CoT \+ Ans\) or local suffix reduces the relative average to100\.2%100\.2\\%and99\.0%99\.0\\%, respectively, indicating that the final answer is a more reliable target for step\-level utility estimation\. Finally, replacing our representation\-level proxy with projected parameter\-space gradients yields only98\.3%98\.3\\%, confirming that our proxy is not merely an efficiency approximation but also provides a stable signal for reasoning\-step valuation\.

### 3\.4Hyperparameter Analysis

We study the sensitivity of GRACE to its main hyperparameters on Qwen3\-VL\-2B atρ=0\.2\\rho=0\.2\. We examine: \(i\) the candidate pool \(default≥8\\geq 8\-step subset of MMathCoT\-1M vs\. full MMathCoT\-1M\); \(ii\) the warm\-up strategy, including the warm\-up ratioγ\\gammaand scoring checkpoints; \(iii\) the history aggregation schemeωk,j\\omega\_\{k,j\}\(uniform / sliding window with sizeWW/ EMA with decayβ\\beta\); and \(iv\) the balance coefficientα∈\[0,1\]\\alpha\\in\[0,1\]for answer\-oriented and historical alignment\.

Table 4:Hyperparameter analysis on Qwen3\-VL\-2B atρ=0\.2\\rho=0\.2\.γ\\gammadenotes the warm\-up ratio;WWandβ\\betaare used for window and EMA history aggregation, respectively\. For four\-checkpoint scoring,0\.250\.25–1\.01\.0denotes checkpoints at25%25\\%,50%50\\%,75%75\\%, and100%100\\%warm\-up progress\.StudyVariant𝜸\\gammaHist\.𝑾W𝜷\\beta𝜶\\alphaRel\. Avg\.↑Pool≥\\geq8\-step–––––100\.0Full w/o filter–––––99\.3Warm\-upNo warm\-up0uniform––0\.589\.6Larger warm\-up0\.25uniform––0\.599\.6Four checkpoints0\.25–1\.0uniform––0\.5101\.7Default0\.05uniform––0\.5106\.6HistoryWindow0\.05window3–0\.5104\.6EMA0\.05EMA–0\.80\.5105\.7Uniform0\.05uniform––0\.5106\.6Balanceα=0\.1\\alpha=0\.10\.05uniform––0\.1100\.8α=0\.5\\alpha=0\.50\.05uniform––0\.5106\.6α=0\.7\\alpha=0\.7\(default\)0\.05uniform––0\.7108\.8α=0\.9\\alpha=0\.90\.05uniform––0\.9107\.7Table[4](https://arxiv.org/html/2605.13130#S3.T4)summarizes representative hyperparameter variants\. These results lead to four main observations\.Candidate pool\.The≥\\geq8\-step pool slightly outperforms the unfiltered MMathCoT\-1M pool, supporting the use of reasoning\-rich traces for stable step\-level scoring\.Warm\-up\.Without warm\-up, performance drops to89\.6%89\.6\\%, while the lightweight default warm\-up withγ=0\.05\\gamma=0\.05reaches106\.6%106\.6\\%, outperforming larger warm\-up or multi\-checkpoint scoring\.History aggregation\.Uniform cumulative averaging performs best among the tested strategies, suggesting that broader reasoning history provides a stable reference direction\.Balance coefficient\.α=0\.7\\alpha=0\.7achieves the best result \(108\.8%108\.8\\%\), showing that answer alignment should dominate while still retaining trajectory consistency\. Full per\-benchmark results are provided in Appendix[F](https://arxiv.org/html/2605.13130#A6)\.

### 3\.5Computational Cost

We focus on the dominant offline cost of feature/signal collection, as the subsequent ranking and top\-ρ\\rhoselection costs are negligible in comparison\. Let𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}and𝒟target\\mathcal\{D\}\_\{\\mathrm\{target\}\}denote the candidate pool and the target/validation set, respectively; letMMbe the number of scoring checkpoints,KKthe average number of reasoning steps,dprojd\_\{\\mathrm\{proj\}\}the projected\-gradient dimension, andddthe hidden dimension defined in Sec\.[2\.3](https://arxiv.org/html/2605.13130#S2.SS3)\. We useCfwdC\_\{\\mathrm\{fwd\}\}andCbwdC\_\{\\mathrm\{bwd\}\}for the cost of one forward and backward pass\. Table[5](https://arxiv.org/html/2605.13130#S3.T5)summarizes the time and storage complexity of this collection stage\.

Table 5:Cost complexity of feature/signal collection\.MethodTime ComplexityStorage ComplexityGradient projection𝒪​\(M​\(\|𝒟train\|\+\|𝒟target\|\)​\(Cfwd\+Cbwd\)\)\\mathcal\{O\}\\\!\\left\(M\\bigl\(\|\\mathcal\{D\}\_\{\\mathrm\{train\}\}\|\+\|\\mathcal\{D\}\_\{\\mathrm\{target\}\}\|\\bigr\)\(C\_\{\\mathrm\{fwd\}\}\+C\_\{\\mathrm\{bwd\}\}\)\\right\)𝒪​\(M​\(\|𝒟train\|\+\|𝒟target\|\)​dproj\)\\mathcal\{O\}\\\!\\left\(M\\bigl\(\|\\mathcal\{D\}\_\{\\mathrm\{train\}\}\|\+\|\\mathcal\{D\}\_\{\\mathrm\{target\}\}\|\\bigr\)d\_\{\\mathrm\{proj\}\}\\right\)GRACE gradient proxy𝒪​\(M​\|𝒟train\|​Cfwd\)\\mathcal\{O\}\\\!\\left\(M\|\\mathcal\{D\}\_\{\\mathrm\{train\}\}\|C\_\{\\mathrm\{fwd\}\}\\right\)𝒪​\(M​\|𝒟train\|​\(K\+1\)​d\)\\mathcal\{O\}\\\!\\left\(M\|\\mathcal\{D\}\_\{\\mathrm\{train\}\}\|\(K\+1\)d\\right\)

Gradient\-projection methods collect projected gradients for both training and target samples, and each feature requires a forward–backward pass\. This is expensive because it scales with both𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}and𝒟target\\mathcal\{D\}\_\{\\mathrm\{target\}\}and storesdprojd\_\{\\mathrm\{proj\}\}\-dimensional gradients for all samples\. GRACE simplifies this collection stage\. It requires no target set or backward computation; instead, it only performs forward passes over candidate training samples\. Token\-level upstream signals are then grouped intoKKstep proxies and one answer proxy in the hidden dimensiondd, enabling step\-aware valuation in a single pass per sample\. In practice, this proxy collection remains lightweight, taking 6\.5 single\-node hours and 5\.7GB of storage on the candidate pool\.

Overall, GRACE replaces backward\-based gradient extraction with forward\-only proxy collection, making step\-level reasoning data valuation substantially more efficient while preserving the fine\-grained structure needed for our scoring objective\.

## 4Related Work

#### Reasoning supervision and step\-level evaluation\.

Chain\-of\-thought prompting and supervision improve reasoning by exposing intermediate rationales\[[30](https://arxiv.org/html/2605.13130#bib.bib17),[35](https://arxiv.org/html/2605.13130#bib.bib18)\]\. As reasoning datasets scale, post\-training pipelines often curate data using quality indicators, such as final\-answer correctness, self\-consistency\[[29](https://arxiv.org/html/2605.13130#bib.bib21)\], reward\-model scores, or preference signals\[[20](https://arxiv.org/html/2605.13130#bib.bib22),[23](https://arxiv.org/html/2605.13130#bib.bib29)\]\. Process\-supervision methods further evaluate intermediate steps with human annotations or verifiers\[[19](https://arxiv.org/html/2605.13130#bib.bib41),[27](https://arxiv.org/html/2605.13130#bib.bib25),[12](https://arxiv.org/html/2605.13130#bib.bib31),[5](https://arxiv.org/html/2605.13130#bib.bib30)\]\. GRACE instead scores steps with internal optimization signals, without external rewards or step annotations\.

#### Data valuation and efficient post\-training\.

Influence functions and TracIn\-style estimators measure training\-sample effects through optimization dynamics\[[9](https://arxiv.org/html/2605.13130#bib.bib26),[22](https://arxiv.org/html/2605.13130#bib.bib1),[36](https://arxiv.org/html/2605.13130#bib.bib32),[6](https://arxiv.org/html/2605.13130#bib.bib34)\], and scalable variants such as LESS approximate these effects with projected gradients\[[8](https://arxiv.org/html/2605.13130#bib.bib33),[33](https://arxiv.org/html/2605.13130#bib.bib2)\]\. Vision\-language data selection methods further explore influence consensus or capability\-aware curation\[[32](https://arxiv.org/html/2605.13130#bib.bib5),[11](https://arxiv.org/html/2605.13130#bib.bib3)\], while efficient instruction tuning often relies on diversity\[[31](https://arxiv.org/html/2605.13130#bib.bib38)\], difficulty\[[34](https://arxiv.org/html/2605.13130#bib.bib39)\], task coverage\[[32](https://arxiv.org/html/2605.13130#bib.bib5)\], data quality\[[14](https://arxiv.org/html/2605.13130#bib.bib28),[16](https://arxiv.org/html/2605.13130#bib.bib23),[24](https://arxiv.org/html/2605.13130#bib.bib35),[2](https://arxiv.org/html/2605.13130#bib.bib37)\], or sample\-level influence\[[33](https://arxiv.org/html/2605.13130#bib.bib2),[32](https://arxiv.org/html/2605.13130#bib.bib5),[11](https://arxiv.org/html/2605.13130#bib.bib3)\]\. These methods mainly operate at the sample level, whereas GRACE models the ordered internal structure of reasoning traces and aggregates step\-level directional utilities for subset selection\.

## 5Conclusion

We present GRACE, a gradient\-aligned reasoning data curation method for efficient post\-training\. Instead of treating a reasoning trace as an indivisible supervision unit, GRACE views it as a sequence of optimization events and evaluates each step through answer\-oriented alignment and trajectory consistency\. To make this fine\-grained valuation scalable, we introduce a representation\-level gradient proxy that estimates step\-induced update directions from forward\-pass upstream signals, avoiding per\-step backward computation\. Experiments on multimodal reasoning post\-training show that GRACE selects compact subsets that match or surpass full\-data training, with 20% curated data reaching 108\.8% of full\-data performance and 5% retaining 100\.2%\. These results suggest that the value of reasoning data depends not only on external quality indicators such as correctness, preference, or length, but also on whether its intermediate steps constructively support the optimization trajectory\.

## References

- \[1\]J\. T\. Ash, C\. Zhang, A\. Krishnamurthy, J\. Langford, and A\. Agarwal\(2020\)Deep batch active learning by diverse, uncertain gradient lower bounds\.In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020,External Links:[Link](https://openreview.net/forum?id=ryghZJBKPS)Cited by:[§2\.3](https://arxiv.org/html/2605.13130#S2.SS3.p6.1)\.
- \[2\]Y\. Cao, Y\. Kang, C\. Wang, and L\. Sun\(2024\)Instruction mining: instruction data selection for tuning large language models\.External Links:2307\.06290,[Link](https://arxiv.org/abs/2307.06290)Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[3\]H\. Duan, J\. Yang, Y\. Qiao, X\. Fang, L\. Chen, Y\. Liu, X\. Dong, Y\. Zang, P\. Zhang, J\. Wang, D\. Lin, and K\. Chen\(2024\)VLMEvalKit: an open\-source toolkit for evaluating large multi\-modality models\.InProceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 \- 1 November 2024,J\. Cai, M\. S\. Kankanhalli, B\. Prabhakaran, S\. Boll, R\. Subramanian, L\. Zheng, V\. K\. Singh, P\. César, L\. Xie, and D\. Xu \(Eds\.\),pp\. 11198–11201\.External Links:[Link](https://doi.org/10.1145/3664647.3685520),[Document](https://dx.doi.org/10.1145/3664647.3685520)Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px4.p1.1)\.
- \[4\]C\. Fu, P\. Chen, Y\. Shen, Y\. Qin, M\. Zhang, X\. Lin, Z\. Qiu, W\. Lin, J\. Yang, X\. Zheng, K\. Li, X\. Sun, and R\. Ji\(2023\)MME: A comprehensive evaluation benchmark for multimodal large language models\.CoRRabs/2306\.13394\.External Links:[Link](https://doi.org/10.48550/arXiv.2306.13394),[Document](https://dx.doi.org/10.48550/ARXIV.2306.13394),2306\.13394Cited by:[1st item](https://arxiv.org/html/2605.13130#A5.I1.i1.p1.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.1.6.1)\.
- \[5\]M\. Gao, X\. Liu, Z\. Yue, Y\. Wu, S\. Chen, J\. Li, S\. Tang, F\. Wu, T\. Chua, and Y\. Zhuang\(2025\)Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program\.InInternational Conference on Computer Vision,pp\. 1718–1728\.External Links:[Link](https://mlanthology.org/iccv/2025/gao2025iccv-benchmarking/)Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[6\]A\. Ghorbani and J\. Y\. Zou\(2019\)Data shapley: equitable valuation of data for machine learning\.InProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9\-15 June 2019, Long Beach, California, USA,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research,pp\. 2242–2251\.External Links:[Link](http://proceedings.mlr.press/v97/ghorbani19c.html)Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[7\]T\. Guan, F\. Liu, X\. Wu, R\. Xian, Z\. Li, X\. Liu, X\. Wang, L\. Chen, F\. Huang, Y\. Yacoob, D\. Manocha, and T\. Zhou\(2024\)Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision\-language models\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16\-22, 2024,pp\. 14375–14385\.External Links:[Link](https://doi.org/10.1109/CVPR52733.2024.01363),[Document](https://dx.doi.org/10.1109/CVPR52733.2024.01363)Cited by:[1st item](https://arxiv.org/html/2605.13130#A5.I1.i1.p1.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.1.3.1)\.
- \[8\]A\. Ilyas, S\. M\. Park, L\. Engstrom, G\. Leclerc, and A\. Madry\(2022\)Datamodels: predicting predictions from training data\.InICML,Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[9\]P\. W\. Koh and P\. Liang\(2017\)Understanding black\-box predictions via influence functions\.InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6\-11 August 2017,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research,pp\. 1885–1894\.External Links:[Link](http://proceedings.mlr.press/v70/koh17a.html)Cited by:[§2\.2](https://arxiv.org/html/2605.13130#S2.SS2.p1.6),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[10\]J\. Lee, B\. Li, and S\. J\. Hwang\(2024\)Concept\-skill transferability\-based data selection for large vision\-language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 5060–5080\.Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p2.1)\.
- \[11\]J\. Li, Z\. Wang, J\. Ma, and X\. Zhang\(2025\)Uncovering intrinsic capabilities: A paradigm for data curation in vision\-language models\.CoRRabs/2510\.00040\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.00040),[Document](https://dx.doi.org/10.48550/ARXIV.2510.00040),2510\.00040Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px3.p1.1),[§2\.1](https://arxiv.org/html/2605.13130#S2.SS1.p3.5),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.9.1.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[12\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.External Links:2305\.20050,[Link](https://arxiv.org/abs/2305.20050)Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[13\]H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee\(2024\)Improved baselines with visual instruction tuning\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16\-22, 2024,pp\. 26286–26296\.External Links:[Link](https://doi.org/10.1109/CVPR52733.2024.02484),[Document](https://dx.doi.org/10.1109/CVPR52733.2024.02484)Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.13130#S3.SS2.p1.1)\.
- \[14\]W\. Liu, W\. Zeng, K\. He, Y\. Jiang, and J\. He\(2024\)What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=BTKAeLqLMw)Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[15\]Y\. Liu, H\. Duan, Y\. Zhang, B\. Li, S\. Zhang, W\. Zhao, Y\. Yuan, J\. Wang, C\. He, Z\. Liu, K\. Chen, and D\. Lin\(2024\)MMBench: is your multi\-modal model an all\-around player?\.InComputer Vision \- ECCV 2024 \- 18th European Conference, Milan, Italy, September 29\-October 4, 2024, Proceedings, Part VI,A\. Leonardis, E\. Ricci, S\. Roth, O\. Russakovsky, T\. Sattler, and G\. Varol \(Eds\.\),Lecture Notes in Computer Science,pp\. 216–233\.External Links:[Link](https://doi.org/10.1007/978-3-031-72658-3%5C_13),[Document](https://dx.doi.org/10.1007/978-3-031-72658-3%5F13)Cited by:[1st item](https://arxiv.org/html/2605.13130#A5.I1.i1.p1.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.1.5.1)\.
- \[16\]Z\. Liu, K\. Zhou, W\. X\. Zhao, D\. Gao, Y\. Li, and J\. Wen\(2024\)Less is more: high\-value data selection for visual instruction tuning\.CoRRabs/2403\.09559\.External Links:2403\.09559Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p2.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[17\]P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao\(2024\)MathVista: evaluating mathematical reasoning of foundation models in visual contexts\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=KUNzEQMWU7)Cited by:[3rd item](https://arxiv.org/html/2605.13130#A5.I1.i3.p1.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.1.8.1)\.
- \[18\]P\. Lu, S\. Mishra, T\. Xia, L\. Qiu, K\. Chang, S\. Zhu, O\. Tafjord, P\. Clark, and A\. Kalyan\(2022\)Learn to explain: multimodal reasoning via thought chains for science question answering\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/11332b6b6cf4485b84afadb1352d3a9a-Abstract-Conference.html)Cited by:[1st item](https://arxiv.org/html/2605.13130#A5.I1.i1.p1.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.1.4.1)\.
- \[19\]R\. Luo, Z\. Zheng, L\. Wang, Y\. Wang, X\. Ni, Z\. Lin, S\. Jiang, Y\. Yu, C\. Shi, R\. Chu, J\. zeng, and Y\. Yang\(2025\)Unlocking multimodal mathematical reasoning via process reward model\.pp\. 49851–49899\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/4757e472d97ca980c24a487622f7ff00-Paper-Conference.pdf)Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.13130#S1.p1.1),[§1](https://arxiv.org/html/2605.13130#S1.p6.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[20\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p2.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[21\]S\. M\. Park, K\. Georgiev, A\. Ilyas, G\. Leclerc, and A\. Madry\(2023\)TRAK: attributing model behavior at scale\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research,pp\. 27074–27113\.External Links:[Link](https://proceedings.mlr.press/v202/park23c.html)Cited by:[§2\.3](https://arxiv.org/html/2605.13130#S2.SS3.p6.1)\.
- \[22\]G\. Pruthi, F\. Liu, S\. Kale, and M\. Sundararajan\(2020\)Estimating training data influence by tracing gradient descent\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),Cited by:[§2\.2](https://arxiv.org/html/2605.13130#S2.SS2.p1.6),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[23\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html)Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[24\]B\. Safaei, F\. Siddiqui, J\. Xu, V\. M\. Patel, and S\. Lo\(2025\-06\)Filter Images First, Generate Instructions Later: Pre\-Instruction Data Selection for Visual Instruction Tuning\.In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)International Conference on Learning RepresentationsAdvances in Neural Information Processing Systems,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, Y\. Sun, D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.202438,pp\. 14247–14256\.External Links:ISSNCited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[25\]Q\. Team\(2025\-01\)Qwen2\.5\-vl\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5-vl/)Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.13130#S3.SS2.p1.1)\.
- \[26\]Q\. Team\(2025\)Qwen3\-vl technical report\.CoRRabs/2511\.21631\.External Links:[Link](https://doi.org/10.48550/arXiv.2511.21631),[Document](https://dx.doi.org/10.48550/ARXIV.2511.21631),2511\.21631Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.13130#S1.p6.1),[§3\.2](https://arxiv.org/html/2605.13130#S3.SS2.p1.1)\.
- \[27\]J\. Uesato, N\. Kushman, R\. Kumar, H\. F\. Song, N\. Y\. Siegel, L\. Wang, A\. Creswell, G\. Irving, and I\. Higgins\(2022\)Solving math word problems with process\- and outcome\-based feedback\.CoRRabs/2211\.14275\.External Links:[Link](https://doi.org/10.48550/arXiv.2211.14275),[Document](https://dx.doi.org/10.48550/ARXIV.2211.14275),2211\.14275Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p2.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[28\]K\. Wang, J\. Pan, W\. Shi, Z\. Lu, H\. Ren, A\. Zhou, M\. Zhan, and H\. Li\(2024\)Measuring multimodal mathematical reasoning with math\-vision dataset\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/ad0edc7d5fa1a783f063646968b7315b-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[3rd item](https://arxiv.org/html/2605.13130#A5.I1.i3.p1.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.1.9.1)\.
- \[29\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p2.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[30\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, b\. ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 24824–24837\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p1.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[31\]S\. Wu, K\. Lu, B\. Xu, J\. Lin, Q\. Su, and C\. Zhou\(2023\)Self\-evolved diverse data sampling for efficient instruction tuning\.CoRRabs/2311\.08182\.External Links:[Link](https://doi.org/10.48550/arXiv.2311.08182),[Document](https://dx.doi.org/10.48550/ARXIV.2311.08182),2311\.08182Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[32\]X\. Wu, M\. Xia, R\. Shao, Z\. Deng, P\. W\. Koh, and O\. Russakovsky\(2025\)ICONS: influence consensus for vision\-language data selection\.CoRRabs/2501\.00654\.External Links:2501\.00654Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.13130#S1.p2.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.8.1.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[33\]M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen\(2024\)LESS: selecting influential data for targeted instruction tuning\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px3.p1.1),[§2\.1](https://arxiv.org/html/2605.13130#S2.SS1.p3.5),[§2\.3](https://arxiv.org/html/2605.13130#S2.SS3.p6.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.7.1.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[34\]C\. Xu, Q\. Sun, K\. Zheng, X\. Geng, P\. Zhao, J\. Feng, C\. Tao, Q\. Lin, and D\. Jiang\(2024\)WizardLM: empowering large pre\-trained language models to follow complex instructions\.pp\. 30745–30766\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/82eec786fdfbbfa53450c5feb7d1ac92-Paper-Conference.pdf)Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[35\]G\. Xu, P\. Jin, H\. Li, Y\. Song, L\. Sun, and L\. Yuan\(2024\)LLaVA\-cot: let vision language models reason step\-by\-step\.CoRRabs/2411\.10440\.External Links:[Link](https://doi.org/10.48550/arXiv.2411.10440),[Document](https://dx.doi.org/10.48550/ARXIV.2411.10440),2411\.10440Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p1.1),[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px1.p1.1)\.
- \[36\]C\. Yeh, J\. Kim, I\. E\. Yen, and P\. K\. Ravikumar\(2018\)Representer point selection for explaining deep neural networks\.Advances in neural information processing systems31\.Cited by:[§4](https://arxiv.org/html/2605.13130#S4.SS0.SSS0.Px2.p1.1)\.
- \[37\]K\. Ying, F\. Meng, J\. Wang, Z\. Li, H\. Lin, Y\. Yang, H\. Zhang, W\. Zhang, Y\. Lin, S\. Liu, J\. Lei, Q\. Lu, R\. Chen, P\. Xu, R\. Zhang, H\. Zhang, P\. Gao, Y\. Wang, Y\. Qiao, P\. Luo, K\. Zhang, and W\. Shao\(2024\)MMT\-bench: A comprehensive multimodal benchmark for evaluating large vision\-language models towards multitask AGI\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research,pp\. 57116–57198\.External Links:[Link](https://proceedings.mlr.press/v235/ying24a.html)Cited by:[2nd item](https://arxiv.org/html/2605.13130#A5.I1.i2.p1.1),[Table 1](https://arxiv.org/html/2605.13130#S3.T1.9.1.1.7.1)\.
- \[38\]Y\. Zhao, J\. Huang, J\. Hu, X\. Wang, Y\. Mao, D\. Zhang, Z\. Jiang, Z\. Wu, B\. Ai, A\. Wang, W\. Zhou, and Y\. Chen\(2025\)SWIFT: A scalable lightweight infrastructure for fine\-tuning\.InThirty\-Ninth AAAI Conference on Artificial Intelligence, Thirty\-Seventh Conference on Innovative Applications of Artificial Intelligence, Fifteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2025, Philadelphia, PA, USA, February 25 \- March 4, 2025,T\. Walsh, J\. Shah, and Z\. Kolter \(Eds\.\),pp\. 29733–29735\.External Links:[Link](https://doi.org/10.1609/aaai.v39i28.35383),[Document](https://dx.doi.org/10.1609/AAAI.V39I28.35383)Cited by:[Appendix E](https://arxiv.org/html/2605.13130#A5.SS0.SSS0.Px2.p2.6)\.
- \[39\]C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu, S\. Zhang, G\. Ghosh, M\. Lewis, L\. Zettlemoyer, and O\. Levy\(2023\)LIMA: less is more for alignment\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/ac662d74829e4407ce1d126477f4a03a-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.13130#S1.p1.1)\.

## Appendix ALimitations

GRACE has several limitations\. First, GRACE estimates data utility from optimization signals observed under a fixed scoring model and warm\-up configuration\. Although our experiments show consistent gains across selection ratios, hyperparameter variants, and transferred backbones, applying GRACE to substantially different optimizers, training objectives, or model families may require additional empirical validation\. Second, our empirical evaluation focuses on multimodal reasoning post\-training with chain\-of\-thought data\. While the results cover mathematical reasoning, general visual question answering, multi\-task reasoning, and multi\-image reasoning benchmarks, extending GRACE to pure language reasoning, non\-CoT instruction data, or reinforcement\-learning\-based post\-training remains an important direction for future work\. Third, while improving data efficiency can reduce post\-training cost and make reasoning model development more accessible, more effective curation may also lower the barrier to training stronger models for unintended or harmful downstream uses\. This highlights the importance of responsible dataset governance, safety evaluation, and deployment when applying GRACE to capability\-enhancing post\-training pipelines\.

## Appendix BFirst\-order Motivation for Directional Step Utility

For a samplezzand a target lossLtar​\(θ;z\)L^\{\\mathrm\{tar\}\}\(\\theta;z\), consider the step\-induced update

θ′=θ−η​∇θLk​\(θ;z\),\\theta^\{\\prime\}=\\theta\-\\eta\\nabla\_\{\\theta\}L\_\{k\}\(\\theta;z\),\(14\)whereη\>0\\eta\>0\. Applying a first\-order Taylor expansion atθ\\thetagives

Ltar​\(θ′;z\)\\displaystyle L^\{\\mathrm\{tar\}\}\(\\theta^\{\\prime\};z\)=Ltar​\(θ;z\)\+⟨∇θLtar​\(θ;z\),θ′−θ⟩\+O​\(‖θ′−θ‖2\)\\displaystyle=L^\{\\mathrm\{tar\}\}\(\\theta;z\)\+\\left\\langle\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\(\\theta;z\),\\theta^\{\\prime\}\-\\theta\\right\\rangle\+O\(\\\|\\theta^\{\\prime\}\-\\theta\\\|^\{2\}\)\(15\)=Ltar​\(θ;z\)−η​⟨∇θLtar​\(θ;z\),∇θLk​\(θ;z\)⟩\+O​\(η2\)\.\\displaystyle=L^\{\\mathrm\{tar\}\}\(\\theta;z\)\-\\eta\\left\\langle\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\(\\theta;z\),\\nabla\_\{\\theta\}L\_\{k\}\(\\theta;z\)\\right\\rangle\+O\(\\eta^\{2\}\)\.\(16\)Therefore,

Ltar​\(θ′;z\)−Ltar​\(θ;z\)=−η​⟨∇θLk​\(θ;z\),∇θLtar​\(θ;z\)⟩\+O​\(η2\)\.L^\{\\mathrm\{tar\}\}\(\\theta^\{\\prime\};z\)\-L^\{\\mathrm\{tar\}\}\(\\theta;z\)=\-\\eta\\left\\langle\\nabla\_\{\\theta\}L\_\{k\}\(\\theta;z\),\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\(\\theta;z\)\\right\\rangle\+O\(\\eta^\{2\}\)\.\(17\)Under a small step size, ignoring the higher\-order term yields Eq\.[4](https://arxiv.org/html/2605.13130#S2.E4)\.

The first\-order term shows that the local effect of stepsks\_\{k\}is governed by the inner product between the step gradient and the target gradient\. This inner product can be written as

⟨∇θLk,∇θLtar⟩=‖∇θLk‖​‖∇θLtar‖​Aktar,\\left\\langle\\nabla\_\{\\theta\}L\_\{k\},\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\\right\\rangle=\\\|\\nabla\_\{\\theta\}L\_\{k\}\\\|\\,\\\|\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\\\|\\,A\_\{k\}^\{\\mathrm\{tar\}\},\(18\)where

Aktar≜cos⁡\(∇θLk,∇θLtar\)\.A\_\{k\}^\{\\mathrm\{tar\}\}\\triangleq\\cos\\bigl\(\\nabla\_\{\\theta\}L\_\{k\},\\nabla\_\{\\theta\}L^\{\\mathrm\{tar\}\}\\bigr\)\.\(19\)Thus,AktarA\_\{k\}^\{\\mathrm\{tar\}\}corresponds to the normalized directional component of the first\-order utility\. GRACE uses this normalized form to compare step directions while reducing the effect of gradient scale\.

## Appendix CDerivation of Representation\-level Gradient Proxy

We derive the representation\-level proxy in Sec\.[2\.3](https://arxiv.org/html/2605.13130#S2.SS3)from the parameter\-space formulation in Sec\.[2\.2](https://arxiv.org/html/2605.13130#S2.SS2)\.

Letθrep\\theta\_\{\\mathrm\{rep\}\}denote the parameters that produce the final hidden representations\. For a token\-level lossLtL\_\{t\}, by the chain rule,

∇θrepLt=\(∂ht∂θrep\)⊤​∂Lt∂ht=Jt⊤​ut,\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\_\{t\}=\\left\(\\frac\{\\partial h\_\{t\}\}\{\\partial\\theta\_\{\\mathrm\{rep\}\}\}\\right\)^\{\\top\}\\frac\{\\partial L\_\{t\}\}\{\\partial h\_\{t\}\}=J\_\{t\}^\{\\top\}u\_\{t\},\(20\)where

Jt≜∂ht∂θrep,ut≜∂Lt∂ht\.J\_\{t\}\\triangleq\\frac\{\\partial h\_\{t\}\}\{\\partial\\theta\_\{\\mathrm\{rep\}\}\},\\qquad u\_\{t\}\\triangleq\\frac\{\\partial L\_\{t\}\}\{\\partial h\_\{t\}\}\.
For softmax cross\-entropy, letℓt=Wout⊤​ht\\ell\_\{t\}=W\_\{\\mathrm\{out\}\}^\{\\top\}h\_\{t\},pt=softmax​\(ℓt\)p\_\{t\}=\\mathrm\{softmax\}\(\\ell\_\{t\}\), andyty\_\{t\}be the one\-hot label\. The token\-level loss is

Lt=−∑v=1V\(yt\)v​log⁡pt,v,pt,v=exp⁡\(ℓt,v\)∑r=1Vexp⁡\(ℓt,r\)\.L\_\{t\}=\-\\sum\_\{v=1\}^\{V\}\(y\_\{t\}\)\_\{v\}\\log p\_\{t,v\},\\qquad p\_\{t,v\}=\\frac\{\\exp\(\\ell\_\{t,v\}\)\}\{\\sum\_\{r=1\}^\{V\}\\exp\(\\ell\_\{t,r\}\)\}\.The derivative of the softmax probability with respect to the logit is

∂pt,v∂ℓt,r=pt,v​\(𝟏v=r−pt,r\)\.\\frac\{\\partial p\_\{t,v\}\}\{\\partial\\ell\_\{t,r\}\}=p\_\{t,v\}\(\\mathbf\{1\}\_\{v=r\}\-p\_\{t,r\}\)\.Applying the chain rule gives

∂Lt∂ℓt,r\\displaystyle\\frac\{\\partial L\_\{t\}\}\{\\partial\\ell\_\{t,r\}\}=−∑v=1V\(yt\)vpt,v​∂pt,v∂ℓt,r\\displaystyle=\-\\sum\_\{v=1\}^\{V\}\\frac\{\(y\_\{t\}\)\_\{v\}\}\{p\_\{t,v\}\}\\frac\{\\partial p\_\{t,v\}\}\{\\partial\\ell\_\{t,r\}\}=−∑v=1V\(yt\)v​\(𝟏v=r−pt,r\)=pt,r−\(yt\)r\.\\displaystyle=\-\\sum\_\{v=1\}^\{V\}\(y\_\{t\}\)\_\{v\}\(\\mathbf\{1\}\_\{v=r\}\-p\_\{t,r\}\)=p\_\{t,r\}\-\(y\_\{t\}\)\_\{r\}\.\(21\)Thus, in vector form,

∂Lt∂ℓt=pt−yt\.\\frac\{\\partial L\_\{t\}\}\{\\partial\\ell\_\{t\}\}=p\_\{t\}\-y\_\{t\}\.\(22\)
Since the Jacobian ofℓt\\ell\_\{t\}with respect tohth\_\{t\}isWout⊤W\_\{\\mathrm\{out\}\}^\{\\top\}, the upstream gradient at the representation interface is

ut≜∂Lt∂ht=\(∂ℓt∂ht\)⊤​∂Lt∂ℓt=Wout​\(pt−yt\)\.u\_\{t\}\\triangleq\\frac\{\\partial L\_\{t\}\}\{\\partial h\_\{t\}\}=\\left\(\\frac\{\\partial\\ell\_\{t\}\}\{\\partial h\_\{t\}\}\\right\)^\{\\top\}\\frac\{\\partial L\_\{t\}\}\{\\partial\\ell\_\{t\}\}=W\_\{\\mathrm\{out\}\}\(p\_\{t\}\-y\_\{t\}\)\.\(23\)
For a token set𝒯\\mathcal\{T\}, the corresponding representation\-parameter gradient is

∇θrepL​\(𝒯\)=1\|𝒯\|​∑t∈𝒯Jt⊤​ut\.\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\)=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}J\_\{t\}^\{\\top\}u\_\{t\}\.\(24\)Thus,\{ut\}\\\{u\_\{t\}\\\}are the common upstream optimization signals that drive updates of the representation\-producing parameters through the Jacobian mapping\.

Consider two token sets𝒯1\\mathcal\{T\}\_\{1\}and𝒯2\\mathcal\{T\}\_\{2\}\. Their representation\-parameter gradient inner product is

⟨∇θrepL​\(𝒯1\),∇θrepL​\(𝒯2\)⟩\\displaystyle\\left\\langle\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\_\{1\}\),\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\_\{2\}\)\\right\\rangle=⟨1\|𝒯1\|​∑t∈𝒯1Jt⊤​ut,1\|𝒯2\|​∑t′∈𝒯2Jt′⊤​ut′⟩\\displaystyle=\\left\\langle\\frac\{1\}\{\|\\mathcal\{T\}\_\{1\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{1\}\}J\_\{t\}^\{\\top\}u\_\{t\},\\frac\{1\}\{\|\\mathcal\{T\}\_\{2\}\|\}\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\_\{2\}\}J\_\{t^\{\\prime\}\}^\{\\top\}u\_\{t^\{\\prime\}\}\\right\\rangle=1\|𝒯1\|​\|𝒯2\|​∑t∈𝒯1∑t′∈𝒯2ut⊤​\(Jt​Jt′⊤\)​ut′\.\\displaystyle=\\frac\{1\}\{\|\\mathcal\{T\}\_\{1\}\|\|\\mathcal\{T\}\_\{2\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{1\}\}\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\_\{2\}\}u\_\{t\}^\{\\top\}\\left\(J\_\{t\}J\_\{t^\{\\prime\}\}^\{\\top\}\\right\)u\_\{t^\{\\prime\}\}\.\(25\)
Accordingly, their exact representation\-parameter cosine alignment is

cos⁡\(∇θrepL​\(𝒯1\),∇θrepL​\(𝒯2\)\)=∑t∈𝒯1∑t′∈𝒯2ut⊤​\(Jt​Jt′⊤\)​ut′‖∑t∈𝒯1Jt⊤​ut‖​‖∑t′∈𝒯2Jt′⊤​ut′‖,\\cos\\\!\\left\(\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\_\{1\}\),\\nabla\_\{\\theta\_\{\\mathrm\{rep\}\}\}L\(\\mathcal\{T\}\_\{2\}\)\\right\)=\\frac\{\\sum\_\{t\\in\\mathcal\{T\}\_\{1\}\}\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\_\{2\}\}u\_\{t\}^\{\\top\}\\left\(J\_\{t\}J\_\{t^\{\\prime\}\}^\{\\top\}\\right\)u\_\{t^\{\\prime\}\}\}\{\\left\\\|\\sum\_\{t\\in\\mathcal\{T\}\_\{1\}\}J\_\{t\}^\{\\top\}u\_\{t\}\\right\\\|\\left\\\|\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\_\{2\}\}J\_\{t^\{\\prime\}\}^\{\\top\}u\_\{t^\{\\prime\}\}\\right\\\|\},\(26\)where the segment\-length normalization factors cancel in the cosine\.

Eq\.[26](https://arxiv.org/html/2605.13130#A3.E26)shows that exact alignment depends on both upstream gradients and Jacobian\-induced interactions\. Exact step\-level evaluation of these interactions would require isolating each step loss and explicitly backpropagating it throughθrep\\theta\_\{\\mathrm\{rep\}\}, leading to repeated per\-step gradient computation\. We therefore use an interface\-level surrogate that preserves the upstream optimization signals while avoiding explicit construction of the Jacobian\-induced geometry, in line with scalable data valuation methods that approximate gradient information in proxy spaces\. This yields the representation\-level proxy

g​\(𝒯\)≜1\|𝒯\|​∑t∈𝒯ut\.g\(\\mathcal\{T\}\)\\triangleq\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}u\_\{t\}\.\(27\)
For thekk\-th reasoning step and the answer segment, we define

gk≜g​\(𝒯k\),gans≜g​\(𝒯ans\)\.g\_\{k\}\\triangleq g\(\\mathcal\{T\}\_\{k\}\),\\qquad g\_\{\\mathrm\{ans\}\}\\triangleq g\(\\mathcal\{T\}\_\{\\mathrm\{ans\}\}\)\.Using this proxy, the gradient\-space quantities in Sec\.[2\.2](https://arxiv.org/html/2605.13130#S2.SS2)are instantiated as

Score^k=\{A^kans,k=1,α​A^kans\+\(1−α\)​A^khist,k\>1,\\widehat\{\\mathrm\{Score\}\}\_\{k\}=\\begin\{cases\}\\widehat\{A\}\_\{k\}^\{\\mathrm\{ans\}\},&k=1,\\\\\[5\.69054pt\] \\alpha\\widehat\{A\}\_\{k\}^\{\\mathrm\{ans\}\}\+\(1\-\\alpha\)\\widehat\{A\}\_\{k\}^\{\\mathrm\{hist\}\},&k\>1,\\end\{cases\}\(28\)where

A^kans≜cos⁡\(gk,gans\),A^khist≜cos⁡\(gk,r^k\),r^k≜Normalize​\(∑j<kωk,j​gj\)\.\\widehat\{A\}\_\{k\}^\{\\mathrm\{ans\}\}\\triangleq\\cos\(g\_\{k\},g\_\{\\mathrm\{ans\}\}\),\\quad\\widehat\{A\}\_\{k\}^\{\\mathrm\{hist\}\}\\triangleq\\cos\(g\_\{k\},\\widehat\{r\}\_\{k\}\),\\quad\\widehat\{r\}\_\{k\}\\triangleq\\mathrm\{Normalize\}\\left\(\\sum\_\{j<k\}\\omega\_\{k,j\}g\_\{j\}\\right\)\.
Thus,Score^k\\widehat\{\\mathrm\{Score\}\}\_\{k\}is the computable representation\-level approximation of the gradient\-space utilityScorek\\mathrm\{Score\}\_\{k\}\. The coefficientsωk,j\\omega\_\{k,j\}can instantiate several common history aggregation strategies:

ωk,j=\{1k−1,Uniform,𝟏​\{max⁡\(1,k−W\)≤j<k\}min⁡\(W,k−1\),Sliding window,βk−1−j∑r=1k−1βk−1−r,EMA\.\\omega\_\{k,j\}=\\begin\{cases\}\\displaystyle\\frac\{1\}\{k\-1\},&\\text\{Uniform\},\\\\\[5\.69054pt\] \\displaystyle\\frac\{\\mathbf\{1\}\\\{\\max\(1,k\-W\)\\leq j<k\\\}\}\{\\min\(W,k\-1\)\},&\\text\{Sliding window\},\\\\\[5\.69054pt\] \\displaystyle\\frac\{\\beta^\{k\-1\-j\}\}\{\\sum\_\{r=1\}^\{k\-1\}\\beta^\{k\-1\-r\}\},&\\text\{EMA\}\.\\end\{cases\}whereW∈ℕW\\in\\mathbb\{N\}denotes the window size,β∈\[0,1\)\\beta\\in\[0,1\)is the decay factor, and𝟏​\{⋅\}\\mathbf\{1\}\\\{\\cdot\\\}is the indicator function\.

## Appendix DDetailed Algorithm

Algorithm[1](https://arxiv.org/html/2605.13130#alg1)provides the complete curation procedure of GRACE, including model warm\-up, forward\-only proxy extraction, step\-level scoring, sample\-level aggregation, and top\-ρ\\rhosubset selection\.

Algorithm 1GRACE reasoning data curation0:Reasoning dataset

𝒟\\mathcal\{D\}, initial model

fθ0f\_\{\\theta\_\{0\}\}, warm\-up ratio

γ\\gamma, selection ratio

ρ\\rho, balance coefficient

α\\alpha
0:Curated subset

𝒮\\mathcal\{S\}
1:Warm up

fθ0f\_\{\\theta\_\{0\}\}on a subset of

𝒟\\mathcal\{D\}with ratio

γ\\gammato obtain

fθf\_\{\\theta\}\.

2:Keep

fθf\_\{\\theta\}fixed during scoring and let

WoutW\_\{\\mathrm\{out\}\}be its output projection\.

3:foreach sample

zi=\(xi,𝐬i,ai\)∈𝒟z\_\{i\}=\(x\_\{i\},\\mathbf\{s\}\_\{i\},a\_\{i\}\)\\in\\mathcal\{D\}do

4:Run a forward pass with model

fθf\_\{\\theta\}to obtain token probabilities

ptp\_\{t\}\.

5:Compute upstream signals:

ut←Wout​\(pt−yt\)u\_\{t\}\\leftarrow W\_\{\\mathrm\{out\}\}\(p\_\{t\}\-y\_\{t\}\), where

yty\_\{t\}is the ground\-truth token\.

6:foreach reasoning step

si,ks\_\{i,k\}do

7:

gi,k←\|𝒯i,k\|−1​∑t∈𝒯i,kutg\_\{i,k\}\\leftarrow\|\\mathcal\{T\}\_\{i,k\}\|^\{\-1\}\\sum\_\{t\\in\\mathcal\{T\}\_\{i,k\}\}u\_\{t\}\.

8:endfor

9:

gi,ans←\|𝒯i,ans\|−1​∑t∈𝒯i,ansutg\_\{i,\\mathrm\{ans\}\}\\leftarrow\|\\mathcal\{T\}\_\{i,\\mathrm\{ans\}\}\|^\{\-1\}\\sum\_\{t\\in\\mathcal\{T\}\_\{i,\\mathrm\{ans\}\}\}u\_\{t\}\.

10:foreach reasoning step

si,ks\_\{i,k\}do

11:if

k=1k=1then

12:

Score^i,1←cos⁡\(gi,1,gi,ans\)\\widehat\{\\mathrm\{Score\}\}\_\{i,1\}\\leftarrow\\cos\(g\_\{i,1\},g\_\{i,\\mathrm\{ans\}\}\)\.

13:else

14:

r^i,k←Normalize​\(∑j<kωk,j​gi,j\)\\widehat\{r\}\_\{i,k\}\\leftarrow\\mathrm\{Normalize\}\\\!\\left\(\\sum\_\{j<k\}\\omega\_\{k,j\}g\_\{i,j\}\\right\)\.

15:

Score^i,k←α​cos⁡\(gi,k,gi,ans\)\+\(1−α\)​cos⁡\(gi,k,r^i,k\)\\widehat\{\\mathrm\{Score\}\}\_\{i,k\}\\leftarrow\\alpha\\cos\(g\_\{i,k\},g\_\{i,\\mathrm\{ans\}\}\)\+\(1\-\\alpha\)\\cos\(g\_\{i,k\},\\widehat\{r\}\_\{i,k\}\)\.

16:endif

17:endfor

18:

V​\(zi\)←Ki−1​∑k=1KiScore^i,kV\(z\_\{i\}\)\\leftarrow K\_\{i\}^\{\-1\}\\sum\_\{k=1\}^\{K\_\{i\}\}\\widehat\{\\mathrm\{Score\}\}\_\{i,k\}\.

19:endfor

20:Select

𝒮\\mathcal\{S\}as the top

⌈ρ​\|𝒟\|⌉\\lceil\\rho\|\\mathcal\{D\}\|\\rceilsamples ranked by

V​\(zi\)V\(z\_\{i\}\)\.

21:return

𝒮\\mathcal\{S\}

## Appendix EExperimental Details

#### Training data\.

We post\-train on MMathCoT\-1M\[[19](https://arxiv.org/html/2605.13130#bib.bib41)\], a large\-scale multimodal mathematical reasoning corpus with chain\-of\-thought supervision\. Since GRACE evaluates step\-level optimization signals, we use reasoning traces with at least eight reasoning steps as the default candidate pool, which provides sufficient granularity for stable step\-level utility estimation\. Preliminary experiments show that training on the full MMathCoT\-1M does not improve downstream performance over this reasoning\-rich pool, suggesting that this filtering does not compromise training effectiveness\. Unless explicitly stated otherwise, all reported full\-data baselines and selection ratios are defined with respect to this≥\\geq8\-step candidate pool\.

#### Backbone models\.

Our default backbone is Qwen3\-VL\-2B\-Instruct\[[26](https://arxiv.org/html/2605.13130#bib.bib6)\]\. To examine the transferability of GRACE\-selected subsets, we additionally post\-train Qwen2\.5\-VL\-3B\[[25](https://arxiv.org/html/2605.13130#bib.bib40)\], LLaVA\-1\.5\-7B\[[13](https://arxiv.org/html/2605.13130#bib.bib7)\], and Qwen3\-VL\-8B\-Instruct on the same curated subset, without re\-running data selection\.

For data scoring, we warm up the initial model on aγ=0\.05\\gamma=0\.05subset of the candidate pool to obtain the fixed scoring modelfθf\_\{\\theta\}, and use a single warm\-up checkpoint by default\. The warm\-up subset is used only to obtain the fixed scoring model and is not counted as post\-training data\. For fairness, all gradient\- and proxy\-based baselines use the same warm\-up checkpoint when applicable\. The default selection ratio isρ=0\.2\\rho=0\.2\. The historical reference directionr^k\\widehat\{r\}\_\{k\}uses uniform aggregation \(ωk,j=1/\(k−1\)\\omega\_\{k,j\}=1/\(k\-1\), i\.e\., the historical average\), and the balance coefficient is fixed toα=0\.7\\alpha=0\.7\. Post\-training and evaluation are conducted with the ms\-swift framework\[[38](https://arxiv.org/html/2605.13130#bib.bib8)\]\.

#### Baselines\.

We compare GRACE against two families of baselines: \(i\)*heuristic selectors*—Random \(uniform sampling\), Longest \(longest reasoning traces by token length\), and Stepmax \(traces with the largest number of steps\); \(ii\)*state\-of\-the\-art data curation methods*—LESS\[[33](https://arxiv.org/html/2605.13130#bib.bib2)\], a gradient\-projection\-based influence method; ICONS\[[32](https://arxiv.org/html/2605.13130#bib.bib5)\], a cross\-task influence\-consensus selector; and CADC\[[11](https://arxiv.org/html/2605.13130#bib.bib3)\], a recent curriculum\-aware data curator\. For fair comparison, all baselines select subsets at the same ratioρ\\rhofrom the same candidate pool and are post\-trained with identical training recipes\.

#### Benchmarks\.

We evaluate the post\-trained models on a diverse suite of multimodal benchmarks using the VLMEvalKit\[[3](https://arxiv.org/html/2605.13130#bib.bib9)\]backend integrated in ms\-swift, covering three categories:

- •*general visual question answering and perception*—HallusionBench\[[7](https://arxiv.org/html/2605.13130#bib.bib10)\], ScienceQA\[[18](https://arxiv.org/html/2605.13130#bib.bib11)\], MMBench\[[15](https://arxiv.org/html/2605.13130#bib.bib12)\], MME \(Perception and Cognition\)\[[4](https://arxiv.org/html/2605.13130#bib.bib14)\];
- •*multi\-task and multi\-image reasoning*—MMT\-Bench \(single\-image\) and MMT\-Bench\_MI \(multi\-image\)\[[37](https://arxiv.org/html/2605.13130#bib.bib13)\];
- •*mathematical reasoning*—MathVista\[[17](https://arxiv.org/html/2605.13130#bib.bib15)\], MathVision\_MINI, and MathVision \(full\)\[[28](https://arxiv.org/html/2605.13130#bib.bib16)\]\.

We report task\-specific metrics for each benchmark, and additionally report relative average performance \(Rel\. Avg\.\) normalized by the full\-data training baseline\. For compactness, tables abbreviate HallusionBench as Hallusion, ScienceQA as SQA, MME Perception/Cognition as Perc\./Cog\., and MMT\-Bench single\-/multi\-image settings as SI/MI\.

GivenBBevaluation entries, including benchmark sub\-scores reported separately, we compute the relative score of a selected subset𝒮\\mathcal\{S\}on entrybbas

Rb​\(𝒮\)=mb​\(𝒮\)mb​\(𝒟full\)×100,R\_\{b\}\(\\mathcal\{S\}\)=\\frac\{m\_\{b\}\(\\mathcal\{S\}\)\}\{m\_\{b\}\(\\mathcal\{D\}\_\{\\mathrm\{full\}\}\)\}\\times 100,wheremb​\(⋅\)m\_\{b\}\(\\cdot\)denotes the metric of entrybb, and𝒟full\\mathcal\{D\}\_\{\\mathrm\{full\}\}denotes the full≥\\geq8\-step candidate pool\. The relative average is then

Rel\.Avg\.=1B∑b=1BRb\(𝒮\)\.\\mathrm\{Rel\.\\ Avg\.\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}R\_\{b\}\(\\mathcal\{S\}\)\.

#### Implementation framework\.

All post\-training experiments are implemented with ms\-swift v3\.12\.1, following the official recommended SFT recipe for vision\-language models\. We use PyTorch 2\.9\.0 with CUDA 12\.8 and cuDNN 9\.10\.2, and use DeepSpeed v0\.17\.6 with ZeRO\-2 optimization for distributed training\.

#### Post\-training configuration\.

Unless otherwise specified, all models are fine\-tuned for one epoch with LoRA\. We use bfloat16 precision, FlashAttention, padding\-free training, sequence packing, and gradient checkpointing\. LoRA is applied to all linear layers with rank88and scaling factor3232, while the vision encoder and multimodal aligner are frozen\. The learning rate is set to1×10−41\\times 10^\{\-4\}with a warm\-up ratio of0\.050\.05\. The per\-device batch size is11, the gradient accumulation step is22, and the maximum sequence length is40964096\. Optimizer and scheduler settings follow the default ms\-swift SFT configuration unless otherwise stated\.

#### Hardware\.

Experiments are conducted on a server with eight NVIDIA A800\-SXM4\-80GB GPUs and approximately 1\.0 TiB system memory, running Ubuntu 22\.04\.5 LTS\.

Table 6:Full hyperparameter results for candidate pool and warm\-up strategy on Qwen3\-VL\-2B atρ=0\.2\\rho=0\.2\. For four\-checkpoint scoring,0\.250\.25–1\.01\.0denotes checkpoints at25%25\\%,50%50\\%,75%75\\%, and100%100\\%warm\-up progress\.Variant𝜸\\gammaHist\.𝜶\\alphaHallusionSQAMMBenchMMEMMTMathVistaMathVisionRel\. Avg\.Perc\.Cog\.SIMIMINIFull≥\\geq8\-step pool–––43\.780\.269\.31517\.3656\.155\.053\.452\.514\.716\.4100\.0Full w/o step filter–––43\.283\.473\.51526\.0593\.656\.054\.650\.114\.915\.199\.3Warm\-up 25%0\.25uniform0\.546\.283\.572\.51506\.7622\.956\.754\.450\.315\.013\.899\.6Four checkpoints0\.25–1\.0uniform0\.545\.685\.272\.41508\.5650\.057\.256\.250\.714\.815\.8101\.7No warm\-up0uniform0\.512\.572\.270\.41486\.7465\.754\.653\.151\.415\.816\.889\.6

Table 7:Full hyperparameter results for history aggregation on Qwen3\-VL\-2B atρ=0\.2\\rho=0\.2\. All variants useγ=0\.05\\gamma=0\.05andα=0\.5\\alpha=0\.5\.Boldandunderlinedvalues denote the best and second\-bestRel\. Avg\., respectively\.Hist\.Param\.𝜸\\gamma𝜶\\alphaHallusionSQAMMBenchMMEMMTMathVistaMathVisionRel\. Avg\.Perc\.Cog\.SIMIMINIFullWindowW=2W=20\.050\.546\.183\.072\.71508\.8665\.059\.357\.052\.914\.517\.2103\.5W=3W=30\.050\.545\.884\.672\.81504\.5692\.157\.255\.852\.017\.116\.7104\.6W=4W=40\.050\.544\.583\.773\.31504\.0610\.457\.054\.954\.717\.816\.5103\.6W=5W=50\.050\.544\.385\.874\.41494\.6577\.557\.756\.553\.316\.416\.8103\.0W=6W=60\.050\.545\.182\.871\.41501\.8693\.957\.355\.344\.217\.415\.8102\.0W=8W=80\.050\.547\.584\.973\.31486\.8652\.957\.555\.053\.817\.116\.1104\.2EMAβ=0\.70\\beta=0\.700\.050\.546\.484\.173\.31505\.0671\.856\.755\.951\.118\.416\.3104\.7β=0\.75\\beta=0\.750\.050\.545\.684\.773\.51506\.2658\.657\.355\.849\.819\.416\.1104\.7β=0\.80\\beta=0\.800\.050\.546\.684\.372\.71496\.4669\.657\.755\.551\.020\.116\.4105\.7β=0\.85\\beta=0\.850\.050\.546\.085\.574\.31493\.9588\.957\.255\.852\.918\.816\.1104\.1β=0\.90\\beta=0\.900\.050\.546\.183\.872\.71508\.2666\.157\.154\.554\.117\.116\.4104\.1β=0\.95\\beta=0\.950\.050\.545\.984\.572\.41498\.1651\.157\.156\.053\.817\.415\.3103\.4β=0\.99\\beta=0\.990\.050\.545\.985\.573\.11483\.4610\.757\.355\.851\.715\.815\.5101\.7Uniform–0\.050\.547\.385\.575\.11509\.6678\.257\.556\.152\.219\.416\.5106\.6

Table 8:Full hyperparameter results for the balance coefficientα\\alphaon Qwen3\-VL\-2B atρ=0\.2\\rho=0\.2\. All variants useγ=0\.05\\gamma=0\.05\.Boldandunderlinedvalues denote the best and second\-bestRel\. Avg\., respectively\.Hist\.Param\.𝜶\\alphaHallusionSQAMMBenchMMEMMTMathVistaMathVisionRel\. Avg\.Perc\.Cog\.SIMIMINIFullUniform–0\.946\.684\.772\.41506\.0683\.956\.354\.252\.921\.417\.9107\.70\.846\.282\.571\.51496\.1678\.256\.354\.053\.418\.417\.1104\.70\.746\.885\.073\.81512\.3682\.958\.556\.354\.221\.717\.2108\.80\.645\.783\.571\.11501\.3682\.156\.954\.152\.716\.416\.5103\.20\.547\.385\.575\.11509\.6678\.257\.556\.152\.219\.416\.5106\.60\.444\.081\.670\.01494\.5686\.154\.753\.252\.517\.115\.7101\.70\.347\.182\.770\.01496\.6667\.955\.253\.851\.118\.416\.5103\.60\.245\.785\.170\.81500\.8665\.056\.954\.850\.317\.116\.0102\.80\.143\.983\.871\.41516\.0647\.556\.655\.153\.613\.815\.9100\.8WindowW=3W=30\.946\.383\.771\.51493\.9686\.457\.054\.754\.621\.716\.6107\.20\.845\.181\.971\.81510\.2677\.557\.254\.752\.518\.417\.0104\.60\.744\.584\.072\.51496\.6673\.656\.854\.352\.317\.816\.4103\.70\.644\.782\.671\.11503\.7695\.056\.354\.252\.115\.516\.2102\.00\.545\.884\.672\.81504\.5692\.157\.255\.852\.017\.116\.7104\.60\.446\.183\.771\.11496\.1690\.057\.154\.753\.117\.415\.6103\.60\.346\.483\.771\.81505\.3670\.056\.453\.953\.018\.815\.5104\.00\.244\.183\.172\.01513\.2677\.156\.153\.752\.917\.815\.5102\.80\.145\.683\.671\.61503\.0651\.856\.055\.251\.718\.414\.7102\.6EMAβ=0\.8\\beta=0\.80\.946\.183\.672\.51491\.7661\.857\.254\.753\.517\.417\.0104\.40\.845\.282\.970\.21498\.7672\.556\.754\.551\.716\.417\.2103\.00\.744\.183\.771\.71511\.6689\.356\.254\.752\.415\.816\.7102\.80\.647\.483\.970\.51501\.0692\.955\.353\.753\.618\.117\.4105\.10\.546\.684\.372\.71496\.4669\.657\.755\.551\.020\.116\.4105\.70\.445\.582\.771\.61498\.3677\.956\.454\.653\.117\.416\.9104\.00\.345\.084\.571\.41506\.5659\.655\.853\.952\.517\.416\.8103\.40\.244\.985\.172\.31512\.8635\.456\.454\.554\.216\.116\.3102\.70\.143\.982\.469\.71502\.4656\.155\.953\.753\.616\.416\.7102\.1

## Appendix FFull Hyperparameter Results

We provide the full per\-benchmark hyperparameter results in Tables[6](https://arxiv.org/html/2605.13130#A5.T6)–[8](https://arxiv.org/html/2605.13130#A5.T8)\.

Similar Articles

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Hugging Face Daily Papers

STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Hugging Face Daily Papers

AVR is an adaptive visual reasoning framework that dynamically selects optimal reasoning formats to reduce token usage by 50-90% while maintaining accuracy in visual reasoning tasks. The method addresses reasoning path redundancy by decomposing visual reasoning into three cognitive functions and using FS-GRPO training to encourage efficient format selection.