LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

arXiv cs.LG Papers

Summary

LARK proposes a learnability-grounded method for selecting reasoning trajectories in LLM distillation, employing a learnability factor and χ²-regularized selection policy that balances efficiency and generalization, consistently outperforming baselines across models and tasks.

arXiv:2605.30651v1 Announce Type: new Abstract: We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student. In this paper, we present LARK, a learnability-grounded method for reasoning trajectory selection. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution. At the core of LARK is a learnability factor $\rho$, which characterizes the rate at which the student's training loss decreases. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and a $\chi^2$-regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK-selected trajectories induce faster supervised fine-tuning loss reduction. Our code is available at https://github.com/Tianrun-Yu/LARK.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:30 AM

# LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation
Source: [https://arxiv.org/html/2605.30651](https://arxiv.org/html/2605.30651)
Tianrun Yu1, Kaixiang Zhao1, Chih\-Chun Chen1, Amanda Hughes1 Taylor W\. Killian1, Fenglong Ma2, Weitong Zhang3, Porter Jenkins1,∗ 1Brigham Young University2The Pennsylvania State University3University of North Carolina at Chapel Hill \*Corresponding author: pjenkins@cs\.byu\.edu

###### Abstract

We study trajectory selection for reasoning distillation, where teacher\-generated reasoning trajectories are selectively used as supervision for a student model\. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student\. In this paper, we present LARK111Our code is available at[https://github\.com/Tianrun\-Yu/LARK](https://github.com/Tianrun-Yu/LARK)\., a learnability\-grounded method for reasoning trajectory selection\. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution\. At the core of LARK is a learnability factorρ\\rho, which characterizes the rate at which the student’s training loss decreases\. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and aχ2\\chi^\{2\}\-regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error\. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks\. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK\-selected trajectories induce faster supervised fine\-tuning loss reduction\.

## 1Introduction

Reasoning distillation has emerged as an effective paradigm for transferring chain\-of\-thought reasoning abilities from larger teacher Large Language Models \(LLMs\) to smaller student models\(Hsiehet al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib37); Yuanet al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib36)\)\. In this setting, a teacher model generates reasoning trajectories, and a student model is fine\-tuned to imitate them\. Recent studies have shown that*a small set of carefully selected reasoning examples*can yield substantial performance gains\(Yeet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib1); Muennighoffet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib2)\), often rivaling the use of much larger training sets\. These findings suggest that reasoning distillation requires both*generating*sufficient reasoning data and*identifying*the supervision that is most useful for the student\.

Many existing data selection methods for reasoning distillation still inherit the classical formulation of data selection in LLM fine\-tuning\(Xiaet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib10); Xiao and Chen,[2025](https://arxiv.org/html/2605.30651#bib.bib9)\), where selection is performed at the question or sample level\(Yuet al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib38); Zhanget al\.,[2025b](https://arxiv.org/html/2605.30651#bib.bib6); Liuet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib7)\)\. However, reasoning distillation presents a finer\-grained selection problem\. For a single question, we often have multiple candidate reasoning trajectories, generated by different teacher models, sampling runs, or reasoning styles\. Even when several trajectories lead to the same correct final answer, they may provide very different training signals to the current student model\. A trajectory that appears natural, high\-quality, or well aligned with the student model is not necessarily the one that the student can learn from most efficiently\.

Existing trajectory selection methods largely rely on heuristic criteria\. Some methods use external verifiers or LLM\-as\-a\-judge scores to assess reasoning quality\(Zhenget al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib39); Yanget al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib24)\), while GRAPE, Local Naturalness, and RSR score candidates using the student model itself\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30651#bib.bib4); Justet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib8); Yanget al\.,[2026](https://arxiv.org/html/2605.30651#bib.bib3)\)\. Although these methods are useful, they do not explicitly measure whether a trajectory is*learnable*by the student\. This gap motivates the central question:

*Can we design a principled trajectory selection criterion that ensures distilled reasoning trajectories are learnable by the student model?*

We answer this question by proposingLARK—Learnability\-groundedAnchor\-timeRanking—a learnability\-grounded method for reasoning trajectory selection, as illustrated in Figure[1](https://arxiv.org/html/2605.30651#S1.F1)\. LARK identifies the subset of trajectories from which the student model can learn most efficiently\. At its core, LARK introduces a principled selection criterion that characterizes trajectory learnability from an optimization perspective, while preserving generalization throughχ2\\chi^\{2\}\-regularization\. Our contributions are as follows:

- •We formulate reasoning trajectory selection as a learnability\-grounded policy optimization problem\. We introduce the anchor\-time learnability rateρ\\rho, which characterizes the decay rate of the post\-training loss and thereby turns learnability into a principled optimization objective\.
- •To estimate the learnability rateρ\\rhoefficiently while preventing the selector from hacking the learnability criterion, we use a first\-order Taylor expansion around the unselected data distribution and derive aχ2\\chi^\{2\}\-regularized policy optimization problem\. Theoretically, we show that this policy optimization implicitly increases data learnability and can be solved in closed form under fixed\-budget trajectory selection\.
- •Empirically, we show that LARK outperforms existing baselines across multiple base models and tasks\. Diagnostic analyses further support our theoretical claims and validate the learnability\-grounded perspective for trajectory selection\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x1.png)Figure 1:Overview of the LARK pipeline\. For each question, multiple teacher\-generated reasoning trajectories form a candidate pool\. LARK uses the student model to score these trajectories, ranks them by the score, and selects the top\-BBtrajectories for weighted SFT\. In the figure,aka\_\{k\}denotes thekk\-th trajectory after ranking,gkg\_\{k\}denotes the practical LARK scoreg^k\\hat\{g\}\_\{k\}, andqkq\_\{k\}denotes its training weight\. The indices in the ranking panel are re\-numbered after sorting\.##### Notation\.

Vectors are denoted by lowercase boldface letters, e\.g\.,𝐱\\mathbf\{x\}\. For a sequential response,𝐲<t\\mathbf\{y\}\_\{<t\}denotes the sequence of tokens preceding thett\-th token\. We denote by\[n\]\[n\]the set\{1,⋯,n\}\\\{1,\\cdots,n\\\}\. For two positive sequences\{an\}n≥1\\\{a\_\{n\}\\\}\_\{n\\geq 1\}and\{bn\}n≥1\\\{b\_\{n\}\\\}\_\{n\\geq 1\}, we writean=o​\(bn\)a\_\{n\}=o\(b\_\{n\}\)ifan/bn→0a\_\{n\}/b\_\{n\}\\to 0asn→∞n\\to\\infty\. We denote theKK\-dimensional simplex byΔK\\Delta^\{K\}, and∥⋅∥\\\|\\cdot\\\|denotes the Euclidean norm throughout the paper\. For two distributionsppandqq, theχ2\\chi^\{2\}divergence is defined asχ2​\(q∥p\)=∫x\(q​\(x\)−p​\(x\)\)2/p​\(x\)​dx\\chi^\{2\}\(q\\parallel p\)=\\int\_\{x\}\(q\(x\)\-p\(x\)\)^\{2\}/p\(x\)\\,\\mathrm\{d\}x, with the integral replaced by a summation in the discrete case\.

## 2Related Work

Reasoning distillation and data selection\.Reasoning distillation transfers chain\-of\-thought traces from stronger teacher models to smaller student models\(Hsiehet al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib37); Yuanet al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib36)\)\. Recent work has shown that a small but carefully chosen set of reasoning examples can rival much larger training sets\(Yuet al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib38); Yeet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib1); Muennighoffet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib2)\)\. This has motivated a growing body of work on data\-selection methods that filter examples by diversity, difficulty, uncertainty, or learning impact\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.30651#bib.bib6); Liuet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib7); Liet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib40)\)\.

Most of these methods operate at the question or example level\. In contrast, we study a finer\-grained problem that arises in reasoning distillation\. For a given question, multiple candidate reasoning trajectories may be available\. These trajectories can all lead to correct answers while providing substantially different training signals to the student\. LARK therefore asks which trajectory, among candidates for the same question, is most learnable for the current student model\.

Trajectory selection and optimization\-grounded weighting\.Existing trajectory\-selection methods typically score candidates by either external quality signals or student\-side fit\. External methods rely on correctness checks, verifiers, or LLM\-as\-a\-judge evaluation to estimate the quality of a reasoning trajectory\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.30651#bib.bib13)\)\. Student\-side methods instead use the student model’s own behavior, such as likelihood \(GRAPE\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30651#bib.bib4)\)\), local\-context likelihood \(Local Naturalness\(Justet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib8)\)\), or rank\-surprisal alignment \(RSR\(Yanget al\.,[2026](https://arxiv.org/html/2605.30651#bib.bib3)\)\), to identify trajectories that appear better aligned with the student\.

LARK differs from both families by deriving its selection criterion from a local learnability objective\. Specifically, it asks whether a trajectory yields a favorable*optimization signal*for the current student\. This perspective connects our work to gradient\-based data selection rooted in the Polyak–Łojasiewicz framework\(Lojasiewicz,[1963](https://arxiv.org/html/2605.30651#bib.bib29); Karimiet al\.,[2016](https://arxiv.org/html/2605.30651#bib.bib30); Xiaet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib10)\), though unlike those methods we avoid per\-trajectory backward passes by using a forward\-pass proxy\. For multi\-trajectory selection, we adopt aχ2\\chi^\{2\}\-regularized weighting that is related in spirit to regularized alignment and offline optimization\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.30651#bib.bib15); Huanget al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib17); Xieet al\.,[2022](https://arxiv.org/html/2605.30651#bib.bib27)\), yielding a closed\-form, budgeted rule\.

## 3Preliminaries

### 3\.1Supervised Fine\-Tuning for Reasoning Distillation

We consider the supervised fine\-tuning \(SFT\) of the student model\. Let𝐱\\mathbf\{x\}be the question and𝐲k=\(y1k,⋯,y\|ak\|k\)⊤\\mathbf\{y\}\_\{k\}\{=\}\(y\_\{1\}^\{k\},\\cdots,y\_\{\|a\_\{k\}\|\}^\{k\}\)^\{\\top\}denote thekk\-th trajectory from the teacher with length\|ak\|\|a\_\{k\}\|\. The SFT loss is defined by

ℓ​\(𝜽,𝐲k\)=1\|ak\|​∑t=1\|ak\|−log⁡π𝜽​\(yt∣𝐱,𝐲<t\),\\displaystyle\\textstyle\{\\ell\(\\boldsymbol\{\\theta\},\\mathbf\{y\}\_\{k\}\)=\\tfrac\{1\}\{\|a\_\{k\}\|\}\\sum\_\{t=1\}^\{\|a\_\{k\}\|\}\-\\log\\pi\_\{\\boldsymbol\{\\theta\}\}\(y\_\{t\}\\mid\\mathbf\{x\},\\,\\mathbf\{y\}\_\{<t\}\)\},whereπ𝜽\\pi\_\{\\boldsymbol\{\\theta\}\}denotes the student language model with parameter𝜽\\boldsymbol\{\\theta\}\. Over a set of response candidates\{𝐲k\}k=1K\\\{\\mathbf\{y\}\_\{k\}\\\}\_\{k=1\}^\{K\}, we consider the finite\-sum loss over a weight𝐪=\(q1,⋯,qK\)⊤\\mathbf\{q\}=\(q\_\{1\},\\cdots,q\_\{K\}\)^\{\\top\}written by

ℒ​\(𝜽,𝐪\)=∑k=1Kqk⋅ℓ​\(𝜽,𝐲k\),where​∑k=1Kqk=1,qk≥0,∀k∈\[K\],\\displaystyle\\textstyle\{\\mathcal\{L\}\(\\boldsymbol\{\\theta\},\\mathbf\{q\}\)=\\sum\_\{k=1\}^\{K\}q\_\{k\}\\cdot\\ell\(\\boldsymbol\{\\theta\},\\mathbf\{y\}\_\{k\}\),\\quad\\text\{where \}\\sum\_\{k=1\}^\{K\}q\_\{k\}=1,q\_\{k\}\\geq 0,\\forall k\\in\[K\]\},\(1\)where𝐪\\mathbf\{q\}encodes which trajectories should be emphasized during the SFT\. For instance, whenqk=1Kq\_\{k\}=\\tfrac\{1\}\{K\}for allk∈\[K\]k\\in\[K\], Eq\. \([1](https://arxiv.org/html/2605.30651#S3.E1)\) reduces to the standard SFT loss overKKtrajectories\. WhenBBtrajectories are selected for training, it is equivalent to finding a weight vector𝐪\\mathbf\{q\}on the simplexΔK\\Delta^\{K\}with the number of non\-zero elements exactly equal toBB\.

### 3\.2Gradient Flow Loss Decay Identity

We characterize the learnability from the lens of optimization theory, in which Polyak–Łojasiewicz \(PL\) condition of over\-parameterized optimization\(Bassilyet al\.,[2018](https://arxiv.org/html/2605.30651#bib.bib41); Liuet al\.,[2022](https://arxiv.org/html/2605.30651#bib.bib34); Chatterjee,[2022](https://arxiv.org/html/2605.30651#bib.bib42)\)centers on the ratio of the squared gradient norm to the loss\. We write this ratio as

ρ​\(𝜽;𝐪\)≜‖∇𝜽ℒ​\(𝜽;𝐪\)‖2/ℒ​\(𝜽;𝐪\),\\displaystyle\\rho\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\triangleq\{\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\{\\boldsymbol\{\\theta\}\};\\mathbf\{q\}\)\\\|^\{2\}\}/\{\\mathcal\{L\}\(\{\\boldsymbol\{\\theta\}\};\\mathbf\{q\}\)\},\(2\)which we refer to as the*loss decay rate function*\.ρ​\(𝜽,𝐪\)\\rho\(\\boldsymbol\{\\theta\},\\mathbf\{q\}\)characterizes the landscape of the optimization process\. For instance, whenρ​\(𝜽,𝐪\)≥μ\\rho\(\\boldsymbol\{\\theta\},\\mathbf\{q\}\)\\geq\\mu, Eq\. \([2](https://arxiv.org/html/2605.30651#S3.E2)\) implies‖∇𝜽ℒ‖2≥μ​ℒ\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\\\|^\{2\}\\geq\\mu\\mathcal\{L\}, which is commonly used in the analysis of over\-parameterized optimization\(Liuet al\.,[2022](https://arxiv.org/html/2605.30651#bib.bib34)\)\(where optimal lossℒ∗=0\\mathcal\{L\}^\{\*\}\{=\}0\)\.

Decay rate of gradient flow\.LetT\>0T\>0denote the total fine\-tuning time\. For a fixed weight vector𝐪\\mathbf\{q\}, the*gradient flow*ϕ𝐪:\[0,T\]→ℝ\|𝜽\|\\phi\_\{\\mathbf\{q\}\}:\[0,T\]\\to\\mathbb\{R\}^\{\|\\boldsymbol\{\\theta\}\|\}describes the evolution of the parameters when the student is fine\-tuned on the weighted objectiveℒ​\(⋅;𝐪\)\\mathcal\{L\}\(\\cdot;\\,\\mathbf\{q\}\), defined as the solution to the ODE

ϕ˙𝐪​\(s\)=−∇𝜽ℒ​\(ϕ𝐪​\(s\);𝐪\),ϕ𝐪​\(0\)=𝜽ref,s∈\[0,T\]\.\\displaystyle\\dot\{\\phi\}\_\{\\mathbf\{q\}\}\(s\)=\-\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\),\\qquad\\phi\_\{\\mathbf\{q\}\}\(0\)=\{\\boldsymbol\{\\theta\}\}\_\{\\mathrm\{ref\}\},\\qquad s\\in\[0,T\]\.Hereϕ𝐪​\(0\)=θref\\phi\_\{\\mathbf\{q\}\}\(0\)=\\theta\_\{\\mathrm\{ref\}\}is the starting point of fine\-tuning,ϕ𝐪​\(T\)\\phi\_\{\\mathbf\{q\}\}\(T\)is the parameter at timeTT, andssis the continuous gradient\-flow time variable\. By the chain rule and Eq\.\([2](https://arxiv.org/html/2605.30651#S3.E2)\), we have

dd​s​ℒ​\(ϕ𝐪​\(s\);𝐪\)=−‖∇𝜽ℒ​\(ϕ𝐪​\(s\);𝐪\)‖2=−ρ​\(𝜽,𝐪\)​ℒ​\(𝜽,𝐪\),ϕ𝐪​\(0\)=θref\.\\displaystyle\\tfrac\{\\mathrm\{d\}\}\{\\mathrm\{d\}s\}\\mathcal\{L\}\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)=\-\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\\|^\{2\}=\-\\rho\(\\boldsymbol\{\\theta\},\\mathbf\{q\}\)\\mathcal\{L\}\(\\boldsymbol\{\\theta\},\\mathbf\{q\}\),\\quad\\phi\_\{\\mathbf\{q\}\}\(0\)=\\theta\_\{\\mathrm\{ref\}\}\.Solving this ordinary differential equation yields

ℒ​\(ϕ𝐪​\(T\);𝐪\)=ℒ​\(θref;𝐪\)⋅exp⁡\(−∫0Tρ​\(ϕ𝐪​\(s\);𝐪\)​ds\),\\displaystyle\\textstyle\{\\mathcal\{L\}\(\\phi\_\{\\mathbf\{q\}\}\(T\);\\mathbf\{q\}\)=\\mathcal\{L\}\(\\theta\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\cdot\\exp\\left\(\-\\int\_\{0\}^\{T\}\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\mathrm\{d\}s\\right\),\}\(3\)which implies the integral∫0Tρ​\(ϕ𝐪​\(s\);𝐪\)​ds\\int\_\{0\}^\{T\}\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\mathrm\{d\}splays an important role in controlling the loss decay\.

## 4Proposed Method

In this section, we present the methodology of LARK\. Specifically, §[4\.1](https://arxiv.org/html/2605.30651#S4.SS1)establishes that the time\-integrated learnability objective in Eq\. \([3](https://arxiv.org/html/2605.30651#S3.E3)\) can be controlled by the anchor\-time learnability rateρ​\(⋅\)\\rho\(\\cdot\)evaluated at the initial student model\. §[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)then approximately optimizesρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)through local linearization around the uniform prior, develops a forward\-pass proxyg^k\\hat\{g\}\_\{k\}that estimates the local gradient ofρ\\rhowithout backward propagation, and bounds the residual errors by aχ2\\chi^\{2\}\-regularization term to preserve generalization and prevent reward hacking of the learnability criterion\. §[4\.3](https://arxiv.org/html/2605.30651#S4.SS3)combines these pieces into a closed\-form, budget\-parameterized selection rule with no hyperparameter to tune\.

### 4\.1Estimating the Learnability Objective with Anchor\-Relative Condition

As shown in Eq\. \([3](https://arxiv.org/html/2605.30651#S3.E3)\), selecting trajectories that are*learnable*by the student model can be formulated as optimizing the time\-integrated learnability objectivemax𝐪∈ΔK​∫0Tρ​\(ϕ𝐪​\(s\);𝐪\)​ds\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\int\_\{0\}^\{T\}\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\mathrm\{d\}s, which favors trajectory distributions under which the post\-training loss decays rapidly during SFT\. However, directly optimizing this objective is infeasible, since it requires tracking the factorρ\\rhoalong the entire SFT optimization trajectory\. To address this challenge, we introduce the following structural condition, which allows the learnability objective to be approximated from a fixed anchor model\.

###### Condition 1\.

We say the model satisfies the anchor\-relative condition if there exists an absolute constantκ\>0\\kappa\>0such that, for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}and everys∈\[0,T\]s\\in\[0,T\],ρ​\(ϕ𝐪​\(s\);𝐪\)≥κ​ρ​\(𝛉ref;𝐪\)\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\geq\\kappa\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\.

Under Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1), the learnability rate evaluated at the initial student parameters𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}provides a tractable proxy for the full time\-integrated objective\. This anchor\-relative condition is consistent with the lazy fine\-tuning behavior often observed in over\-parameterized LLMs during SFT\. In this regime, the model parameters remain close to the initialization𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}, and consequently the learnability rateρ\\rhoremains close to its anchor\-time valueρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\. We provide a theoretical justification of this condition under the neural tangent kernel regime\(NTK; Jacotet al\.,[2018](https://arxiv.org/html/2605.30651#bib.bib43)\)in Appendix[A\.2](https://arxiv.org/html/2605.30651#A1.SS2), together with empirical validation in Appendix[A\.3](https://arxiv.org/html/2605.30651#A1.SS3)\.

The following proposition then follows naturally; we defer its proof to Appendix[A\.1](https://arxiv.org/html/2605.30651#A1.SS1)\.

###### Proposition 1\.

Under Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1), for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}, the post\-training loss of the student model satisfiesℒ​\(ϕ𝐪​\(T\);𝐪\)≤ℒ​\(𝛉ref;𝐪\)⋅exp⁡\(−κ​T⋅ρ​\(𝛉ref;𝐪\)\)\.\\mathcal\{L\}\(\\phi\_\{\\mathbf\{q\}\}\(T\);\\mathbf\{q\}\)\\leq\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\cdot\\exp\\bigl\(\-\\kappa T\\cdot\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\bigr\)\.

Proposition[1](https://arxiv.org/html/2605.30651#Thmproposition1)shows that, under the anchor\-relative condition, the decay of the SFT loss can be controlled by the learnability rate evaluated at the anchor model𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}\. Therefore, instead of directly optimizing the intractable trajectory\-wide objective∫0Tρ​\(ϕ𝐪​\(s\);𝐪\)​ds\\int\_\{0\}^\{T\}\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\mathrm\{d\}s, we can optimize the anchor\-time surrogatemax𝐪∈ΔK⁡ρ​\(𝜽ref;𝐪\)\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)to encourage faster loss decay for the student model\.

### 4\.2Approximately Optimizing Learnabilityρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)through Local Linearization

In practice, directly optimizingρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)is problematic\. As shown in Appendix[B\.1](https://arxiv.org/html/2605.30651#A2.SS1), this objective induces a quasi\-convex maximization problem over the simplexΔK\\Delta^\{K\}, whose optimum is attained at a vertex of the simplex\. Consequently, exact maximization would lead to a degenerate solution that places all probability mass on a single trajectory, thereby “hacking” the learnability criterion instead of producing a useful training distribution\.

This observation motivates treating the learnability criterionρ\\rhoas a*local*indicator around the uniform distribution𝐩=\(1K,⋯,1K\)∈ℝK\\mathbf\{p\}=\(\\tfrac\{1\}\{K\},\\cdots,\\tfrac\{1\}\{K\}\)\\in\\mathbb\{R\}^\{K\}which corresponds to standard SFT without trajectory selection\. Rather than globally maximizingρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\), we restrict attention to distributions𝐪\\mathbf\{q\}in a neighborhood of𝐩\\mathbf\{p\}and locally linearize the objective around𝐩\\mathbf\{p\}\. The first\-order Taylor expansion gives

ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)\\displaystyle\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{p\}\)=⟨∇𝐪ρ​\(𝜽ref;𝐪\),𝐪−𝐩⟩\+o​\(‖𝐪−𝐩‖2\)\\displaystyle=\\textstyle\{\\langle\\nabla\_\{\\mathbf\{q\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\),\\mathbf\{q\}\-\\mathbf\{p\}\\rangle\+o\(\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|\_\{2\}\)\}=∑k=1K∂∂qk​ρ​\(𝜽ref;𝐪\)∣𝐪=𝐩⏟gk∗⋅\(qk−pk\)\+o​\(‖𝐪−𝐩‖2\)⏟R2​\(𝐩,𝐪\)\.\\displaystyle=\\textstyle\{\\sum\_\{k=1\}^\{K\}\}\\underbrace\{\\tfrac\{\\partial\}\{\\partial q\_\{k\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\mid\_\{\\mathbf\{q\}=\\mathbf\{p\}\}\}\_\{g\_\{k\}^\{\*\}\}\\cdot\(q\_\{k\}\-p\_\{k\}\)\+\\underbrace\{o\(\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|\_\{2\}\)\}\_\{R\_\{2\}\(\\mathbf\{p\},\\mathbf\{q\}\)\}\.\(4\)
Two steps remain for optimizing Eq\. \([4](https://arxiv.org/html/2605.30651#S4.E4)\)\. First, we need to estimate the first\-order derivativegk∗g\_\{k\}^\{\*\}in a computationally efficient manner\. Second, we need to bound the residualR2​\(𝐩,𝐪\)R\_\{2\}\(\\mathbf\{p\},\\mathbf\{q\}\)carefully\.

#### 4\.2\.1Estimating gradientgk∗g\_\{k\}^\{\*\}

According to the definition ofρ\\rho,gk∗=∂∂qk​ρ​\(𝜽ref;𝐪\)=∂∂qk​‖∑iqi​𝐠i‖2∑iqi​ℓig\_\{k\}^\{\*\}=\\tfrac\{\\partial\}\{\\partial q\_\{k\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)=\\tfrac\{\\partial\}\{\\partial q\_\{k\}\}\\tfrac\{\\\|\\sum\_\{i\}q\_\{i\}\\mathbf\{g\}\_\{i\}\\\|^\{2\}\}\{\\sum\_\{i\}q\_\{i\}\\ell\_\{i\}\}, where𝐠i=∇𝜽ℓi\\mathbf\{g\}\_\{i\}=\\nabla\_\{\\boldsymbol\{\\theta\}\}\\ell\_\{i\}andℓi=ℓ​\(𝜽,𝐲i\)\\ell\_\{i\}=\\ell\(\\boldsymbol\{\\theta\},\\mathbf\{y\}\_\{i\}\)\. With detailed derivations deferred to Appendix[B\.3](https://arxiv.org/html/2605.30651#A2.SS3), we can write

gk∗=2​𝐠k⊤​∑iqi​𝐠i∑iqi​ℓi−‖∑iqi​𝐠i‖2∑jqj​ℓk⋅ℓk∑iqi​ℓi\|𝐪=𝐩=2​𝐠k⊤​∑i𝐠i∑iℓi−‖∑i𝐠i‖2∑iℓi⋅ℓk∑iℓi,\\displaystyle g\_\{k\}^\{\*\}=\\frac\{2\\mathbf\{g\}\_\{k\}^\{\\top\}\\sum\_\{i\}q\_\{i\}\\mathbf\{g\}\_\{i\}\}\{\\sum\_\{i\}q\_\{i\}\\ell\_\{i\}\}\-\\frac\{\\\|\\sum\_\{i\}q\_\{i\}\\mathbf\{g\}\_\{i\}\\\|^\{2\}\}\{\\sum\_\{j\}q\_\{j\}\\ell\_\{k\}\}\\cdot\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}q\_\{i\}\\ell\_\{i\}\}\\bigg\|\_\{\\mathbf\{q\}=\\mathbf\{p\}\}=\\frac\{2\\mathbf\{g\}\_\{k\}^\{\\top\}\\sum\_\{i\}\\mathbf\{g\}\_\{i\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\-\\frac\{\\\|\\sum\_\{i\}\\mathbf\{g\}\_\{i\}\\\|^\{2\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\cdot\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\},\(5\)where the second equality follows fromqk=pk=1Kq\_\{k\}=p\_\{k\}=\\tfrac\{1\}\{K\}\. Based on Eq\. \([5](https://arxiv.org/html/2605.30651#S4.E5)\), we make two observations\. First, in many practical overparameterized networks, the per\-trajectory gradients𝐠k\\mathbf\{g\}\_\{k\}are nearly orthogonal in high\-dimensional parameter space, i\.e\.,𝐠k⊤​𝐠j≈δj​k​‖𝐠k‖2\\mathbf\{g\}\_\{k\}^\{\\top\}\\mathbf\{g\}\_\{j\}\\approx\\delta\_\{jk\}\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}\. Second, for the single\-trajectoryρ\\rho\-factor defined in Eq\. \([2](https://arxiv.org/html/2605.30651#S3.E2)\), denoted byρk∗=‖𝐠k‖2/ℓk\\rho\_\{k\}^\{\*\}=\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}/\\ell\_\{k\}, the gradient norm can be written as‖𝐠k‖2=ρk∗​ℓk\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}=\\rho\_\{k\}^\{\*\}\\ell\_\{k\}\. Plugging these observations into Eq\. \([5](https://arxiv.org/html/2605.30651#S4.E5)\),gk∗g\_\{k\}^\{\*\}can be approximated by

gk∗≈2​‖𝐠k‖2∑iℓi−∑j‖𝐠j‖2∑iℓi⋅ℓk∑iℓi=ℓk∑iℓi​\(2​ρk∗−∑iρi∗​ℓi∑iℓi\)\.\\displaystyle g\_\{k\}^\{\*\}\\approx\\frac\{2\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\-\\frac\{\\sum\_\{j\}\\\|\\mathbf\{g\}\_\{j\}\\\|^\{2\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\cdot\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}=\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\left\(2\\rho\_\{k\}^\{\*\}\-\\frac\{\\sum\_\{i\}\\rho\_\{i\}^\{\*\}\\ell\_\{i\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\right\)\.\(6\)
In practice, calculatingρk∗\\rho\_\{k\}^\{\*\}requires backward propagation to compute‖𝐠k‖2\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}, which is computationally expensive at selection time\. Fortunately, we can avoid this gradient computation by leveraging the softmax structure of the last layer in next\-token prediction, as stated in the following lemma:

###### Lemma 1\(Forward\-passρk∗\\rho\_\{k\}^\{\*\}estimation, informal\)\.

For anyk∈\[K\]k\\in\[K\], let𝛑tk∈Δ\|𝒱\|\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\\in\\Delta^\{\|\\mathcal\{V\}\|\}be the probability vector produced by the student model for thett\-th token, conditioned on the question𝐱\\mathbf\{x\}and the previous response tokens𝐲<tk\\mathbf\{y\}\_\{<t\}^\{k\}, and let𝛅​\(ytk\)∈Δ\|𝒱\|\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\in\\Delta^\{\|\\mathcal\{V\}\|\}be the one\-hot vector corresponding to the ground\-truth token from the teacher model\. Letρ^k≜∑t‖𝛑tk−𝛅​\(ytk\)‖2ℓk\\hat\{\\rho\}\_\{k\}\\triangleq\\frac\{\\sum\_\{t\}\\\|\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\-\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\\|^\{2\}\}\{\\ell\_\{k\}\}, then under standard structural conditions, there exist absolute positive constantsC1,C2C\_\{1\},C\_\{2\}whereρ^k∈\[C1​ρk∗,C2​ρk∗\]\\hat\{\\rho\}\_\{k\}\\in\[C\_\{1\}\\rho\_\{k\}^\{\*\},C\_\{2\}\\rho\_\{k\}^\{\*\}\]\.

Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)suggests thatρk∗\\rho\_\{k\}^\{\*\}can be estimated in a well\-controlled manner by the forward\-pass quantityρ^k\\hat\{\\rho\}\_\{k\}, without explicitly computing gradients\. We defer the detailed proof to Appendix[B\.4](https://arxiv.org/html/2605.30651#A2.SS4)\. Substitutingρ^k\\hat\{\\rho\}\_\{k\}into Eq\. \([5](https://arxiv.org/html/2605.30651#S4.E5)\) yields our practical estimator of the local gradient:

g^k=ℓk∑iℓi​\(2​ρ^k−∑iρ^i​ℓi∑iℓi\),where​ρ^k≜∑t‖𝝅tk−𝜹​\(ytk\)‖2ℓk\.\\displaystyle\\hat\{g\}\_\{k\}=\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\left\(2\\hat\{\\rho\}\_\{k\}\-\\frac\{\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\right\),\\text\{ where \}\\hat\{\\rho\}\_\{k\}\\triangleq\\frac\{\\sum\_\{t\}\\\|\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\-\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\\|^\{2\}\}\{\\ell\_\{k\}\}\.\(7\)

#### 4\.2\.2Bounding residualsR​\(𝐩,𝐪\)R\(\\mathbf\{p\},\\mathbf\{q\}\)

We next control the residual errors introduced by the local linearization and the gradient estimation\. More precisely, substitutinggk∗g\_\{k\}^\{\*\}withg^k\\hat\{g\}\_\{k\}in Eq\. \([4](https://arxiv.org/html/2605.30651#S4.E4)\) yields

ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)=∑k=1Kg^k⋅\(qk−pk\)\+∑k=1K\(gk∗−g^k\)⋅\(qk−pk\)⏟R1​\(𝐩,𝐪\)\+o​\(‖𝐪−𝐩‖2\)⏟R2​\(𝐩,𝐪\),\\displaystyle\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{p\}\)=\\textstyle\{\\sum\_\{k=1\}^\{K\}\}\\hat\{g\}\_\{k\}\\cdot\(q\_\{k\}\-p\_\{k\}\)\+\\underbrace\{\\textstyle\{\\sum\_\{k=1\}^\{K\}\}\(g\_\{k\}^\{\*\}\-\\hat\{g\}\_\{k\}\)\\cdot\(q\_\{k\}\-p\_\{k\}\)\}\_\{R\_\{1\}\(\\mathbf\{p\},\\mathbf\{q\}\)\}\+\\underbrace\{o\(\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|\_\{2\}\)\}\_\{R\_\{2\}\(\\mathbf\{p\},\\mathbf\{q\}\)\},whereR1R\_\{1\}captures the error from estimating the true local gradientgk∗g\_\{k\}^\{\*\}byg^k\\hat\{g\}\_\{k\}, andR2R\_\{2\}captures the higher\-order error from the Taylor expansion\. The following lemma shows that both terms can be controlled by theχ2\\chi^\{2\}distance from the uniform distribution\.

###### Lemma 2\(χ2\\chi^\{2\}controls both error terms, informal\)\.

Under standard structural conditions, there exist constantsα1,α2\>0\\alpha\_\{1\},\\alpha\_\{2\}\>0such that for all𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\},

\|R1​\(𝐩,𝐪\)\|≤α1​χ2​\(𝐪∥𝐩\)\+1,\|R2​\(𝐩,𝐪\)\|≤α2​χ2​\(𝐪∥𝐩\)\.\\displaystyle\|R\_\{1\}\(\\mathbf\{p\},\\mathbf\{q\}\)\|\\leq\\alpha\_\{1\}\\chi^\{2\}\(\\mathbf\{q\}\\\|\\mathbf\{p\}\)\+1,\\qquad\|R\_\{2\}\(\\mathbf\{p\},\\mathbf\{q\}\)\|\\leq\\alpha\_\{2\}\\chi^\{2\}\(\\mathbf\{q\}\\\|\\mathbf\{p\}\)\.\(8\)

Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)shows that the residual errors are controlled by theχ2\\chi^\{2\}distance between the selected distribution𝐪\\mathbf\{q\}and the uniform distribution𝐩\\mathbf\{p\}, with explicit forward\-computable constantsα1,α2\\alpha\_\{1\},\\alpha\_\{2\}given in Appendix[B\.5](https://arxiv.org/html/2605.30651#A2.SS5)\. This further supports our intuition that optimization should be performed in a local neighborhood around the original, non\-selected data distribution𝐩\\mathbf\{p\}, to preserve generalization and prevent the selector from drifting toward degenerate solutions\. Combining the linearized objective with the residual bounds gives the following lower bound on the original learnability improvement; we defer its proof to Appendix[B\.6](https://arxiv.org/html/2605.30651#A2.SS6)\.

###### Theorem 1\(LARK objective, informal\)\.

There exist absolute constantsα\>0\\alpha\>0andC∈ℝC\\in\\mathbb\{R\}such that

ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)≥∑k=1Kg^k⋅\(qk−pk\)−α​χ2​\(𝐪∥𝐩\)\+C\.\\displaystyle\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{p\}\)\\geq\\textstyle\{\\sum\_\{k=1\}^\{K\}\}\\hat\{g\}\_\{k\}\\cdot\(q\_\{k\}\-p\_\{k\}\)\-\\alpha\\chi^\{2\}\(\\mathbf\{q\}\\parallel\\mathbf\{p\}\)\+C\.\(9\)

Theorem[1](https://arxiv.org/html/2605.30651#Thmtheorem1)reduces the intractable learnability improvement to a tractable surrogate that depends on𝐪\\mathbf\{q\}only through \(i\) the inner product⟨𝐪−𝐩,𝐠^⟩\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\ranglebetween the trajectory deviation and the forward\-pass score vector, and \(ii\) theχ2\\chi^\{2\}distance to the uniform prior\. The constantCCis independent of𝐪\\mathbf\{q\}and therefore does not affect the maximizer overΔK\\Delta^\{K\}\. This reduction sets the stage for the closed\-form selection rule developed in §[4\.3](https://arxiv.org/html/2605.30651#S4.SS3)\.

### 4\.3LARK: Closed\-form Solution for Fixed\-Budget Trajectory Selection

Theorem[1](https://arxiv.org/html/2605.30651#Thmtheorem1)reduces the intractable learnability improvement to a tractableχ2\\chi^\{2\}\-regularized optimization based on\{g^k\}\\\{\\hat\{g\}\_\{k\}\\\}\. In practice, many data selection algorithms operate in the fixed\-budget setting, where the agent needs to selectB≤KB\\leq Ktrajectories per question, meaning that the support size of𝐪\\mathbf\{q\}isBB\. This fixed\-budget optimization admits a closed\-form solution, as stated in the following lemma\.

###### Lemma 3\(BB\-parameterized closed form\)\.

Sortg^1≥⋯≥g^K\\hat\{g\}\_\{1\}\\geq\\cdots\\geq\\hat\{g\}\_\{K\}and assume1≤B<K1\\leq B<Kwithg^B\>g^B\+1\\hat\{g\}\_\{B\}\>\\hat\{g\}\_\{B\+1\}\. Under the uniform priorpk=1/Kp\_\{k\}=1/K, the maximizer of optimization objective

𝐪^=arg⁡max𝐪∈ΔK,supp​\(𝐪\)=B⁡\{∑k=1Kg^k​\(qk−pk\)−τ​\(B\)2​χ2​\(𝐪∥𝐩\)\},\\displaystyle\\hat\{\\mathbf\{q\}\}=\\arg\\textstyle\{\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\},\\mathrm\{supp\}\(\\mathbf\{q\}\)=B\}\}\\left\\\{\\textstyle\\sum\\nolimits\_\{k=1\}^\{K\}\\hat\{g\}\_\{k\}\(q\_\{k\}\-p\_\{k\}\)\-\\tfrac\{\\tau\(B\)\}\{2\}\\chi^\{2\}\(\\mathbf\{q\}\\\|\\mathbf\{p\}\)\\right\\\},\(10\)with regularizerτ​\(B\)=1K​∑j=1B\(g^j−g^B\+1\)\\tau\(B\)=\\tfrac\{1\}\{K\}\\sum\_\{j=1\}^\{B\}\(\\hat\{g\}\_\{j\}\-\\hat\{g\}\_\{B\+1\}\)takes the closed formq^i=\(g^i−g^B\+1\)/∑j=1B\(g^j−g^B\+1\)\\hat\{q\}\_\{i\}=\(\\hat\{g\}\_\{i\}\-\\hat\{g\}\_\{B\+1\}\)/\\sum\_\{j=1\}^\{B\}\(\\hat\{g\}\_\{j\}\-\\hat\{g\}\_\{B\+1\}\)fori≤Bi\\leq B, andq^i=0\\hat\{q\}\_\{i\}=0otherwise\.

Input:Candidates\{𝐲k\}k=1K\\\{\\mathbf\{y\}\_\{k\}\\\}\_\{k=1\}^\{K\}; referenceπref\\pi\_\{\\mathrm\{ref\}\}with𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}; budgetBB\.

Output:

ℒ​\(𝜽;𝐪^\)=∑k=1Kq^k​ℓ​\(𝜽;𝐲k\)\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\hat\{\\mathbf\{q\}\}\)=\\textstyle\\sum\\nolimits\_\{k=1\}^\{K\}\\hat\{q\}\_\{k\}\\,\\ell\(\\boldsymbol\{\\theta\};\\mathbf\{y\}\_\{k\}\)\.

1

2for*k=1,…,Kk=1,\\ldots,K*do

3Forward pass of

πref\\pi\_\{\\mathrm\{ref\}\}on

𝐲k\\mathbf\{y\}\_\{k\}; record

𝝅tk,pt\\boldsymbol\{\\pi\}\_\{t\}^\{k\},p\_\{t\}for

t∈\{1,…,\|ak\|\}t\\in\\\{1,\\ldots,\|a\_\{k\}\|\\\}
4

ℓk←1\|ak\|​∑t−log⁡pt\\ell\_\{k\}\\leftarrow\\tfrac\{1\}\{\|a\_\{k\}\|\}\\textstyle\\sum\\nolimits\_\{t\}\-\\log p\_\{t\}
5

ρ^k←∑t‖𝝅tk−𝜹​\(ytk\)‖2∑t−log⁡pt\\hat\{\\rho\}\_\{k\}\\leftarrow\\dfrac\{\\sum\_\{t\}\\\|\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\-\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\\|^\{2\}\}\{\\sum\_\{t\}\-\\log p\_\{t\}\}
6

7

8for*k=1,…,Kk=1,\\ldots,K*do

9

g^k←ℓk∑iℓi​\(2​ρ^k−∑iρ^i​ℓi∑iℓi\)\\hat\{g\}\_\{k\}\\leftarrow\\dfrac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\\!\\left\(2\\hat\{\\rho\}\_\{k\}\-\\dfrac\{\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\,\\ell\_\{i\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\right\)
10

11

12Sort and relabel:

g^1≥⋯≥g^K\\hat\{g\}\_\{1\}\\geq\\cdots\\geq\\hat\{g\}\_\{K\}
13

14for*i=1,…,Ki=1,\\ldots,K*do

15if*i≤Bi\\leq B*then

16

q^i←g^i−g^B\+1∑j=1B\(g^j−g^B\+1\)\\hat\{q\}\_\{i\}\\leftarrow\\dfrac\{\\hat\{g\}\_\{i\}\-\\hat\{g\}\_\{B\+1\}\}\{\\sum\_\{j=1\}^\{B\}\(\\hat\{g\}\_\{j\}\-\\hat\{g\}\_\{B\+1\}\)\}
17

18else

19

q^i←0\\hat\{q\}\_\{i\}\\leftarrow 0
20

21

22

23return

ℒ​\(𝜽;𝐪^\)=∑k=1Kq^k​ℓ​\(𝜽;𝐲k\)\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\hat\{\\mathbf\{q\}\}\)=\\textstyle\\sum\\nolimits\_\{k=1\}^\{K\}\\hat\{q\}\_\{k\}\\,\\ell\(\\boldsymbol\{\\theta\};\\mathbf\{y\}\_\{k\}\)

Algorithm 1LARK: Learnability\-grounded Anchor\-time RankingLemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)gives a closed\-form top\-BBrule: each selected trajectory is weighted by its score margin aboveg^B\+1\\hat\{g\}\_\{B\+1\}, andτ∗​\(B\)\\tau^\{\*\}\(B\)is fully determined byBBwith no hyperparameter to tune\. Algorithm[1](https://arxiv.org/html/2605.30651#algorithm1)summarizes the full procedure\.

## 5Experiment

We evaluate LARK on reasoning distillation through three research questions\.RQ1asks whether LARK outperforms heuristic, quality\-based, and student\-side baselines under single\-trajectory \(B=1B\{=\}1\) and multi\-trajectory \(B=3B\{=\}3\) supervision\.RQ2isolates the contribution of the trajectory scoreg^k\\hat\{g\}\_\{k\}and theχ2\\chi^\{2\}\-BBweighting rule by holding one fixed and varying the other\.RQ3asks*why*LARK works, by linkingg^k\\hat\{g\}\_\{k\}to the anchor\-time learnability rateρ\\rho, to the SFT loss trajectory, and to downstream training utility, and by analyzing the token\-level structure of the trajectories LARK selects\. Section[5\.1](https://arxiv.org/html/2605.30651#S5.SS1)describes the shared experimental setup, and Sections[5\.2](https://arxiv.org/html/2605.30651#S5.SS2)–[5\.4](https://arxiv.org/html/2605.30651#S5.SS4)answer each question in turn\.

### 5\.1Experimental Setup

Candidate pool\.We use a fixed set of5,0005\{,\}000math problems from NuminaMath\(Liet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib18)\), each paired with3333trajectories from1111teacher models \(33rollouts each\); see Appendix[C\.1](https://arxiv.org/html/2605.30651#A3.SS1)\. All methods share this pool, so any performance difference reflects the selection rule alone\.

Selection budgets\.We evaluate each method atB=1B\{=\}1\(single trajectory per problem\) andB=3B\{=\}3\(three trajectories\)\. Baselines take the top\-BBunder their score with uniform weights1/B1/B; LARK applies theχ2\\chi^\{2\}\-BBrule of Lemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)to obtain weighted𝐪^\\hat\{\\mathbf\{q\}\}fromg^k\\hat\{g\}\_\{k\}computed at the reference student𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}\.

Baselines\.We compare LARK against seven methods \(Appendix[C\.2](https://arxiv.org/html/2605.30651#A3.SS2)\) spanning heuristics, quality\-based selection, LLM\-based evaluation, and student\-side forward\-pass scoring: Random, Token Lengthmax, Rule\-based Qualitymax, LLM\-judged Qualitymax, GRAPEmax\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30651#bib.bib4)\), Local Naturalnessmax\(Justet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib8)\), and RSRmin\(Yanget al\.,[2026](https://arxiv.org/html/2605.30651#bib.bib3)\)\.

Table 1:Performance comparison under single\-trajectory and multi\-trajectory selection\. Values are Acc@5 percentages, with mean±\\pmstd taken across three independent decoding seeds at inference time\.Students, training, evaluation\.We fine\-tuneQwen\-2\.5\-7B,Qwen\-2\.5\-1\.5B\(Yanget al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib25)\), andLlama\-3\.2\-3B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib26)\)via SFT, and evaluate on AIME\-2024\([1](https://arxiv.org/html/2605.30651#bib.bib20)\), AMC\(Mathematical Association of America,[2025](https://arxiv.org/html/2605.30651#bib.bib19)\), GPQA\-Diamond\(Reinet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib21)\), and MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.30651#bib.bib22)\)using Acc@5 \(55samples per problem, marked correct if any contains the gold answer\)\. For each method, we run inference under three independent decoding seeds using the corresponding trained checkpoint and evaluation protocol for that method; we report mean±\\pmstd across these three decoding seeds, so the reported variability reflects inference\-time sampling rather than training\-seed noise\. The Avg column in Table[1](https://arxiv.org/html/2605.30651#S5.T1)is the unweighted mean across the four benchmarks; per\-benchmark numbers are the primary basis of comparison\. A train/eval contamination audit appears in Appendix[C\.5](https://arxiv.org/html/2605.30651#A3.SS5)\. Training and decoding details are in Appendices[C\.3](https://arxiv.org/html/2605.30651#A3.SS3)–[C\.4](https://arxiv.org/html/2605.30651#A3.SS4)\.

### 5\.2RQ1: Does LARK Outperform Existing Trajectory Selection Methods?

![Refer to caption](https://arxiv.org/html/2605.30651v1/x2.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.30651v1/x3.png)\(b\)

Figure 2:Ablation of the two LARK componentson Qwen\-2\.5\-7B \(B=3B=3\)\. Values are Acc@5 percentages\.\(a\)Score ablation\.\(b\)Weighting ablation\.Table[1](https://arxiv.org/html/2605.30651#S5.T1)compares LARK with all baselines on three students\. LARK achieves the best average Acc@5 in all six model\-budget settings\. On Qwen\-2\.5\-7B, it improves over the strongest baseline RSR by7\.167\.16points atB=1B\{=\}1and7\.077\.07points atB=3B\{=\}3, with large gains on AMC and GPQA\. The consistent gains under both budgets suggest that LARK improves trajectory quality, not only supervision size\. The same trend holds for smaller students: LARK reaches41\.09%41\.09\\%/43\.14%43\.14\\%on Qwen\-2\.5\-1\.5B and26\.29%26\.29\\%/29\.13%29\.13\\%on Llama\-3\.2\-3B underB=1B\{=\}1/B=3B\{=\}3, suggesting that the score is not tied to one student family or scale\. Appendix[D\.2](https://arxiv.org/html/2605.30651#A4.SS2)provides a budget\-scaling analysis for LARK acrossB∈\{1,3,5,10,20\}B\\in\\\{1,3,5,10,20\\\}\.

### 5\.3RQ2: How Much Does Each Component Contribute?

We ablate the two design choices in LARK: the trajectory score and the weighting rule\.

Trajectory score\.Fixing the weighting toχ2\\chi^\{2\}\-BB, we compare four scores \(Figure[2\(a\)](https://arxiv.org/html/2605.30651#S5.F2.sf1)\): the inverse loss1/ℓk1/\\ell\_\{k\}, the Brier residual∑t‖𝝅tk−𝜹​\(ytk\)‖2\\sum\_\{t\}\\\|\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\-\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\\|^\{2\}, the proxyρ^k\\hat\{\\rho\}\_\{k\}of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1), and the induced scoreg^k\\hat\{g\}\_\{k\}\. The inverse loss is the weakest, confirming that likelihood alone does not characterize learnability;ρ^k\\hat\{\\rho\}\_\{k\}already improves substantially by combining residual error with loss; andg^k\\hat\{g\}\_\{k\}is the best across all four benchmarks, since it captures the marginal contribution of each trajectory around𝐩\\mathbf\{p\}rather than scoring trajectories in isolation\.

Weighting rule\.Fixing the score tog^k\\hat\{g\}\_\{k\}, we vary the weighting atB=3B\{=\}3\(Figure[2\(b\)](https://arxiv.org/html/2605.30651#S5.F2.sf2)\) over uniform, score\-proportional, softmax, andχ2\\chi^\{2\}\-BBweights\. Theχ2\\chi^\{2\}\-BBrule consistently outperforms the alternatives, with the largest gains on AIME \(where score noise matters most\) and on MATH\-500 \(where the benefit accumulates over many problems\)\. LARK’s improvement therefore comes not only from selecting high\-g^k\\hat\{g\}\_\{k\}trajectories but also from assigning robust weights within the selected support\.

Efficiency\.LARK is forward\-pass only and therefore does not require the per\-trajectory backward passes needed to compute the exact directional derivativegk∗g\_\{k\}^\{\*\}\. Its scoring cost matches the lowest\-cost student\-side baselines, with full complexity analysis and wall\-clock measurements reported in Appendix[D\.1](https://arxiv.org/html/2605.30651#A4.SS1)\.

### 5\.4RQ3: Why Does LARK Work?

We provide three pieces of evidence connectingg^k\\hat\{g\}\_\{k\}to the quantities it is designed to control, then characterize what makes the trajectories LARK selects informative\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x4.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.30651v1/x5.png)\(b\)
![Refer to caption](https://arxiv.org/html/2605.30651v1/x6.png)\(c\)

Figure 3:Three pieces of evidence linkingg^k\\hat\{g\}\_\{k\}to learnability\.\(a\)Δ​ρ​\(𝐪\)\\Delta\\rho\(\\mathbf\{q\}\)landscape on the simplex \(PCA\); black dots are𝐪^​\(B\)\\hat\{\\mathbf\{q\}\}\(B\)forB=1,…,32B\{=\}1\{,\}\\ldots\{,\}32\. LARK selections trace theΔ​ρ≥0\\Delta\\rho\\geq 0region \(green\) and avoidΔ​ρ<0\\Delta\\rho<0\(red\)\.\(b\)SFT loss reductionΔ​ℒt=ℒt−ℒ0\\Delta\\mathcal\{L\}\_\{t\}=\\mathcal\{L\}\_\{t\}\-\\mathcal\{L\}\_\{0\}on Qwen\-2\.5\-7B \(B=3B\{=\}3\); LARK induces the largest SFT loss decay\.\(c\)Per\-teacher mean ofg^k\\hat\{g\}\_\{k\}vs\. AMC Acc@5\.g^k\\hat\{g\}\_\{k\}controls the anchor\-time learnability rate\.We first verify that maximizingg^k\\hat\{g\}\_\{k\}actually raisesρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\), the quantity LARK is designed to optimize \(§[4\.1](https://arxiv.org/html/2605.30651#S4.SS1)\)\. For a representative problem we evaluateΔ​ρ​\(𝐪\)=ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)\\Delta\\rho\(\\mathbf\{q\}\)=\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{p\}\)densely overΔK\\Delta^\{K\}and project onto two PCA components \(Figure[3](https://arxiv.org/html/2605.30651#S5.F3)a\)\. The LARK selections𝐪^​\(B\)\\hat\{\\mathbf\{q\}\}\(B\)forB∈\{1,…,32\}B\\in\\\{1,\\ldots,32\\\}trace a path through theΔ​ρ\>0\\Delta\\rho\>0region \(green\) and avoid theΔ​ρ<0\\Delta\\rho<0region \(red\): every𝐪^​\(B\)\\hat\{\\mathbf\{q\}\}\(B\)satisfiesρ​\(𝜽ref;𝐪^​\(B\)\)≥ρ​\(𝜽ref;𝐩\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\hat\{\\mathbf\{q\}\}\(B\)\)\\geq\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{p\}\)\. The forward\-pass surrogate therefore acts as an effective control on the underlying anchor\-time learnability rate on real student models\.

Highg^k\\hat\{g\}\_\{k\}yields faster SFT loss decay\.By Proposition[1](https://arxiv.org/html/2605.30651#Thmproposition1), a higherρ\\rhoshould translate into faster SFT loss decay\. We verify this by trackingΔ​ℒt≜ℒt−ℒ0\\Delta\\mathcal\{L\}\_\{t\}\\triangleq\\mathcal\{L\}\_\{t\}\-\\mathcal\{L\}\_\{0\}on Qwen\-2\.5\-7B atB=3B\{=\}3\. Figure[3](https://arxiv.org/html/2605.30651#S5.F3)b shows that LARK achieves the largest loss reduction throughout training, with the gap visible already in the early stage; heuristic and naturalness baselines reduce the loss more slowly and converge to a shallower plateau, even when starting from a lower initial loss\.

g^k\\hat\{g\}\_\{k\}predicts downstream training utility\.We finally check thatg^k\\hat\{g\}\_\{k\}tracks downstream Acc@5 in practice\. We group the3333candidates by their teacher model \(1111teachers\), fine\-tune Qwen\-2\.5\-7B separately on each teacher’s trajectories, and evaluate on AMC\. The per\-teacher mean ofg^k\\hat\{g\}\_\{k\}is strongly correlated with downstream AMC accuracy \(Figure[3](https://arxiv.org/html/2605.30651#S5.F3)c, Pearsonr=0\.69r=0\.69,p=0\.018p=0\.018\): high\-g^k\\hat\{g\}\_\{k\}teachers \(GPT\-OSS\-20B/120B, Phi\-4\-R\+, QwQ\-32B\) yield students above65%65\\%, while low\-g^k\\hat\{g\}\_\{k\}teachers \(Magistral\-S, Nemotron\) yield students below55%55\\%\. The score is predictively valid at the teacher level, capturing a student\-side training signal that surface quality and likelihood do not\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x7.png)Figure 4:Token\-level behavioron Qwen\-2\.5\-7B\. LARK\-selected trajectories have lower correct\-token probabilityp¯​\(yt\)\\bar\{p\}\(y\_\{t\}\)\(top\) but higher wrong\-token concentration∑v≠ytp​\(v\)2\\sum\_\{v\\neq y\_\{t\}\}p\(v\)^\{2\}\(bottom\)\.Mechanism: confidently wrong, in a structured way\.The three results above establish that maximizingg^k\\hat\{g\}\_\{k\}raisesρ\\rho, accelerates SFT loss decay, and predicts utility, but do not explain*why*the selected trajectories carry strong signal\. We compute under𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}two token\-level statistics for each candidate: the mean correct\-token probabilityp¯​\(yt\)\\bar\{p\}\(y\_\{t\}\), and the wrong\-token concentration∑v≠ytp​\(v\)2\\sum\_\{v\\neq y\_\{t\}\}p\(v\)^\{2\}, which is large when the residual mass falls on a few wrong tokens\. Comparing all33×50033\\times 500candidates with the LARK\-selected top\-11\(Figure[4](https://arxiv.org/html/2605.30651#S5.F4)\), the selected trajectories have substantially*lower*p¯​\(yt\)\\bar\{p\}\(y\_\{t\}\)\(0\.2820\.282vs\.0\.4490\.449\) and*higher*wrong\-token concentration \(0\.1070\.107vs\.0\.0850\.085, a26%26\\%increase\)\. LARK therefore avoids trajectories the student already predicts confidently and routes supervision toward trajectories on which the student is wrong but in a*structured*way: residual mass concentrates on near\-miss tokens close to the gold answer, providing a clear local correction signal\. This signature echoes the rank\-based view of RSR\(Yanget al\.,[2026](https://arxiv.org/html/2605.30651#bib.bib3)\), which detects token positions where the gold answer is a near\-miss, and clarifies why LARK and RSR end up close in Table[1](https://arxiv.org/html/2605.30651#S5.T1)despite arising from distinct theoretical objectives\. A single\-problem walkthrough is given in Appendix[D\.3](https://arxiv.org/html/2605.30651#A4.SS3)\.

## 6Conclusion

We presentedLARK, a learnability\-grounded method for reasoning trajectory selection in distillation\. LARK estimates trajectory learnability with a forward\-pass surrogateg^k\\hat\{g\}\_\{k\}and uses a closed\-formχ2\\chi^\{2\}\-regularized soft top\-BBrule to construct a weighted SFT objective, avoiding expensive per\-trajectory backward passes while preserving distributional coverage\. Across three student models and four reasoning benchmarks, LARK consistently outperforms strong trajectory selection baselines under both single\-trajectory and multi\-trajectory budgets\. Ablation and diagnostic analyses show that both the learnability score and the weighting rule are important, and that LARK\-selected trajectories accelerate SFT loss decay while providing structured local correction signals\. Overall, our results support learnability as an effective principle for efficient reasoning distillation, with future extensions to noisier candidate pools and broader reasoning domains\.

## References

- \[1\]\(2024\)AIME problems and solutions\.Note:Accessed: 2025\-09\-13Cited by:[§C\.4](https://arxiv.org/html/2605.30651#A3.SS4.SSS0.Px1.p1.5),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p4.2)\.
- On exponential convergence of sgd in non\-convex over\-parametrized learning\.arXiv preprint arXiv:1811\.02564\.Cited by:[§3\.2](https://arxiv.org/html/2605.30651#S3.SS2.p1.5)\.
- S\. Chatterjee \(2022\)Convergence of gradient descent for deep neural networks\.arXiv preprint arXiv:2203\.16462\.Cited by:[§3\.2](https://arxiv.org/html/2605.30651#S3.SS2.p1.5)\.
- L\. Chizat, E\. Oyallon, and F\. Bach \(2019\)On lazy training in differentiable programming\.InNeurIPS,Cited by:[§A\.2](https://arxiv.org/html/2605.30651#A1.SS2.SSS0.Px3.p2.4)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2](https://arxiv.org/html/2605.30651#S2.p3.1)\.
- S\. S\. Du, X\. Zhai, B\. Poczos, and A\. Singh \(2019\)Gradient descent provably optimizes over\-parameterized neural networks\.InICLR,Cited by:[§A\.2](https://arxiv.org/html/2605.30651#A1.SS2.SSS0.Px6.p1.9)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p4.2)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§C\.4](https://arxiv.org/html/2605.30651#A3.SS4.SSS0.Px1.p1.5),[§C\.4](https://arxiv.org/html/2605.30651#A3.SS4.SSS0.Px5.p1.2),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p4.2)\.
- C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 8003–8017\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p1.1),[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- A\. Huang, W\. Zhan, T\. Xie, J\. D\. Lee, W\. Sun, A\. Krishnamurthy, and D\. J\. Foster \(2024\)Correcting the mythos of kl\-regularization: direct alignment without overoptimization via chi\-squared preference optimization\.arXiv preprint arXiv:2407\.13399\.Cited by:[§2](https://arxiv.org/html/2605.30651#S2.p4.1)\.
- A\. Jacot, F\. Gabriel, and C\. Hongler \(2018\)Neural tangent kernel: convergence and generalization in neural networks\.InNeurIPS,Cited by:[§A\.2](https://arxiv.org/html/2605.30651#A1.SS2.SSS0.Px3.p2.4),[§A\.2](https://arxiv.org/html/2605.30651#A1.SS2.SSS0.Px6.p1.9),[§4\.1](https://arxiv.org/html/2605.30651#S4.SS1.p2.4)\.
- H\. A\. Just, M\. Ko, and R\. Jia \(2025\)Distilling reasoning into student llms: local naturalness for selecting teacher data\.arXiv preprint arXiv:2510\.03988\.Cited by:[§C\.2](https://arxiv.org/html/2605.30651#A3.SS2.SSS0.Px6),[§1](https://arxiv.org/html/2605.30651#S1.p3.1),[§2](https://arxiv.org/html/2605.30651#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p3.6)\.
- H\. Karimi, J\. Nutini, and M\. Schmidt \(2016\)Linear convergence of gradient and proximal\-gradient methods under the polyak\-łojasiewicz condition\.InJoint European conference on machine learning and knowledge discovery in databases,pp\. 795–811\.Cited by:[§2](https://arxiv.org/html/2605.30651#S2.p4.1)\.
- J\. Lee, L\. Xiao, S\. Schoenholz, Y\. Bahri, R\. Novak, J\. Sohl\-Dickstein, and J\. Pennington \(2019\)Wide neural networks of any depth evolve as linear models under gradient descent\.InNeurIPS,Cited by:[§A\.2](https://arxiv.org/html/2605.30651#A1.SS2.SSS0.Px3.p2.4)\.
- J\. Li, E\. Beeching, L\. Tunstall, B\. Lipkin, R\. Soletskyi, S\. Huang, K\. Rasul, L\. Yu, A\. Q\. Jiang, Z\. Shen,et al\.\(2024\)Numinamath: the largest public dataset in ai4maths with 860k pairs of competition math problems and solutions\.Hugging Face repository13\(9\),pp\. 9\.Cited by:[§C\.1](https://arxiv.org/html/2605.30651#A3.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p1.4)\.
- X\. Li, H\. Zou, and P\. Liu \(2025\)Limr: less is more for rl scaling\.arXiv preprint arXiv:2502\.11886\.Cited by:[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- C\. Liu, L\. Zhu, and M\. Belkin \(2022\)Loss landscapes and optimization in over\-parameterized non\-linear systems and neural networks\.Applied and Computational Harmonic Analysis59,pp\. 85–116\.Cited by:[§3\.2](https://arxiv.org/html/2605.30651#S3.SS2.p1.4),[§3\.2](https://arxiv.org/html/2605.30651#S3.SS2.p1.5),[Remark 4](https://arxiv.org/html/2605.30651#Thmremark4.p1.1)\.
- L\. Liu, X\. Liu, D\. F\. Wong, D\. Li, Z\. Wang, B\. Hu, and M\. Zhang \(2024\)Selectit: selective instruction tuning for large language models via uncertainty\-aware self\-reflection\.arXiv preprint arXiv:2402\.16705\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p2.1),[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- S\. Lojasiewicz \(1963\)A topological property of real analytic subsets\.Coll\. du CNRS, Les équations aux dérivées partielles117\(87\-89\),pp\. 2\.Cited by:[§2](https://arxiv.org/html/2605.30651#S2.p4.1)\.
- S\. Malladi, A\. Wettig, D\. Yu, D\. Chen, and S\. Arora \(2023\)A kernel\-based view of language model fine\-tuning\.InICML,Cited by:[§A\.2](https://arxiv.org/html/2605.30651#A1.SS2.SSS0.Px3.p2.4)\.
- Mathematical Association of America \(2025\)American mathematics competitions\.Note:Accessed: 2025\-09\-13Cited by:[§C\.4](https://arxiv.org/html/2605.30651#A3.SS4.SSS0.Px1.p1.5),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p4.2)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. B\. Hashimoto \(2025\)S1: simple test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 20286–20332\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p1.1),[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§2](https://arxiv.org/html/2605.30651#S2.p4.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst conference on language modeling,Cited by:[§C\.4](https://arxiv.org/html/2605.30651#A3.SS4.SSS0.Px1.p1.5),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p4.2)\.
- M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen \(2024\)Less: selecting influential data for targeted instruction tuning\.arXiv preprint arXiv:2402\.04333\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p2.1),[§2](https://arxiv.org/html/2605.30651#S2.p4.1)\.
- Q\. Xiao and T\. Chen \(2025\)A unified understanding of offline data selection and online self\-refining generation for post\-training llms\.arXiv preprint arXiv:2511\.21056\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p2.1)\.
- T\. Xie, D\. J\. Foster, Y\. Bai, N\. Jiang, and S\. M\. Kakade \(2022\)The role of coverage in online reinforcement learning\.arXiv preprint arXiv:2210\.04157\.Cited by:[§2](https://arxiv.org/html/2605.30651#S2.p4.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§C\.2](https://arxiv.org/html/2605.30651#A3.SS2.SSS0.Px4.p1.3),[§1](https://arxiv.org/html/2605.30651#S1.p3.1)\.
- A\. Yang, B\. Zhang, B\. Hui, B\. Gao, B\. Yu, C\. Li, D\. Liu, J\. Tu, J\. Zhou, J\. Lin,et al\.\(2024\)Qwen2\. 5\-math technical report: toward mathematical expert model via self\-improvement\.arXiv preprint arXiv:2409\.12122\.Cited by:[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p4.2)\.
- Y\. Yang, M\. Lai, W\. Zhao, X\. Fan, Z\. Xi, M\. Wu, C\. Huang, J\. Zhao, H\. Lv, J\. Tong,et al\.\(2026\)Which reasoning trajectories teach students to reason better? a simple metric of informative alignment\.arXiv preprint arXiv:2601\.14249\.Cited by:[§C\.1](https://arxiv.org/html/2605.30651#A3.SS1.SSS0.Px1.p1.2),[§C\.1](https://arxiv.org/html/2605.30651#A3.SS1.SSS0.Px2.p1.1),[§C\.1](https://arxiv.org/html/2605.30651#A3.SS1.p1.1),[§C\.2](https://arxiv.org/html/2605.30651#A3.SS2.SSS0.Px7),[§1](https://arxiv.org/html/2605.30651#S1.p3.1),[§2](https://arxiv.org/html/2605.30651#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p3.6),[§5\.4](https://arxiv.org/html/2605.30651#S5.SS4.p5.13)\.
- Y\. Ye, Z\. Huang, Y\. Xiao, E\. Chern, S\. Xia, and P\. Liu \(2025\)Limo: less is more for reasoning\.arXiv preprint arXiv:2502\.03387\.Cited by:[§C\.2](https://arxiv.org/html/2605.30651#A3.SS2.SSS0.Px3.p1.3),[§1](https://arxiv.org/html/2605.30651#S1.p1.1),[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- L\. Yu, W\. Jiang, H\. Shi, J\. Yu, Z\. Liu, Y\. Zhang, J\. T\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2023\)Metamath: bootstrap your own mathematical questions for large language models\.arXiv preprint arXiv:2309\.12284\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p2.1),[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- Z\. Yuan, H\. Yuan, C\. Li, G\. Dong, K\. Lu, C\. Tan, C\. Zhou, and J\. Zhou \(2023\)Scaling relationship on learning mathematical reasoning with large language models\.arXiv preprint arXiv:2308\.01825\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p1.1),[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in neural information processing systems32\.Cited by:[Remark 3](https://arxiv.org/html/2605.30651#Thmremark3.p1.1)\.
- D\. Zhang, Q\. Dai, and H\. Peng \(2025a\)The best instruction\-tuning data are those that fit\.arXiv preprint arXiv:2502\.04194\.Cited by:[§C\.2](https://arxiv.org/html/2605.30651#A3.SS2.SSS0.Px5),[§1](https://arxiv.org/html/2605.30651#S1.p3.1),[§2](https://arxiv.org/html/2605.30651#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.30651#S5.SS1.p3.6)\.
- J\. Zhang, C\. Zhang, Y\. Liu, Y\. Jin, X\. Yang, B\. Zheng, Y\. Liu, and L\. Guo \(2025b\)D3: diversity, difficulty, and dependability\-aware data selection for sample\-efficient llm instruction tuning\.arXiv preprint arXiv:2503\.11441\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p2.1),[§2](https://arxiv.org/html/2605.30651#S2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§1](https://arxiv.org/html/2605.30651#S1.p3.1)\.

## Appendix AAnchor\-Time Analysis of the Learnability Objective

This appendix supports the anchor\-time reduction in Section[4\.1](https://arxiv.org/html/2605.30651#S4.SS1)\. We first prove the post\-training loss bound \(Appendix[A\.1](https://arxiv.org/html/2605.30651#A1.SS1)\)\. We then provide a theoretical justification of the anchor\-relative trajectory bound through the NTK lazy\-training framework \(Appendix[A\.2](https://arxiv.org/html/2605.30651#A1.SS2)\), and verify it empirically on real SFT trajectories \(Appendix[A\.3](https://arxiv.org/html/2605.30651#A1.SS3)\)\.

### A\.1Proof of the Anchor\-Time Loss Decay Bound

###### Proof of Proposition[1](https://arxiv.org/html/2605.30651#Thmproposition1)\.

Fix any𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}\. By the gradient\-flow decay identity \([3](https://arxiv.org/html/2605.30651#S3.E3)\) and Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1), which givesρ​\(ϕ𝐪​\(s\);𝐪\)≥κ⋅ρ​\(𝜽ref;𝐪\)\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\geq\\kappa\\cdot\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)for everys∈\[0,T\]s\\in\[0,T\], we have

∫0Tρ​\(ϕ𝐪​\(s\);𝐪\)​𝑑s≥∫0Tκ⋅ρ​\(𝜽ref;𝐪\)​𝑑s=κ​T⋅ρ​\(𝜽ref;𝐪\)\.\\int\_\{0\}^\{T\}\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\,ds\\;\\geq\\;\\int\_\{0\}^\{T\}\\kappa\\cdot\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\,ds\\;=\\;\\kappa T\\cdot\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\.Substituting into \([3](https://arxiv.org/html/2605.30651#S3.E3)\) and using the non\-negativity ofℒ​\(𝜽ref;𝐪\)\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)together with the monotonicity ofx↦exp⁡\(−x\)x\\mapsto\\exp\(\-x\)onℝ\\mathbb\{R\},

ℒ​\(ϕ𝐪​\(T\);𝐪\)=ℒ​\(𝜽ref;𝐪\)⋅exp⁡\(−∫0Tρ​\(ϕ𝐪​\(s\);𝐪\)​𝑑s\)≤ℒ​\(𝜽ref;𝐪\)⋅exp⁡\(−κ​T⋅ρ​\(𝜽ref;𝐪\)\)\.\\mathcal\{L\}\(\\phi\_\{\\mathbf\{q\}\}\(T\);\\mathbf\{q\}\)\\;=\\;\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\cdot\\exp\\\!\\left\(\-\\int\_\{0\}^\{T\}\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\,ds\\right\)\\;\\leq\\;\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\cdot\\exp\\\!\\bigl\(\-\\kappa T\\cdot\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\bigr\)\.\(11\)Since𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}was arbitrary, the bound holds for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}\. ∎

### A\.2NTK\-Based Justification of the Trajectory Bound

We provide a theoretical justification of Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)through the Neural Tangent Kernel \(NTK\) framework\. Under standard NTK lazy\-training conditions for pretrained language model fine\-tuning, combined with a local non\-degeneracy condition on the ground\-truth probability, we derive a strictly positive constantκ\\kappafor which Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)provably holds\.

##### Note on terminology\.

Throughout this subsection, we distinguish between two kinds of structural premises\.*Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)*is the anchor\-relative condition stated in the main text, which we aim to*justify*\.*Assumptions[A\.1](https://arxiv.org/html/2605.30651#A1.Thmappassumption1)–[A\.4](https://arxiv.org/html/2605.30651#A1.Thmappassumption4)*below are standard NTK lazy\-training premises, which we use to*derive*Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)as a theorem \(Proposition[2](https://arxiv.org/html/2605.30651#Thmproposition2)\)\.

##### NTK setup\.

We adopt the token\-level objects from Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)\. For each candidate trajectory𝐲k\\mathbf\{y\}\_\{k\}, recall the per\-token residualδt=πt−𝐞yt∈ℝV\\delta\_\{t\}=\\pi\_\{t\}\-\\mathbf\{e\}\_\{y\_\{t\}\}\\in\\mathbb\{R\}^\{V\}and the stacked residual matrixΔk=\[δ1,…,δ\|ak\|\]⊤∈ℝ\|ak\|×V\\Delta\_\{k\}=\[\\delta\_\{1\},\\ldots,\\delta\_\{\|a\_\{k\}\|\}\]^\{\\top\}\\in\\mathbb\{R\}^\{\|a\_\{k\}\|\\times V\}, withpt=πt​\(yt\)p\_\{t\}=\\pi\_\{t\}\(y\_\{t\}\)the probability assigned to the ground\-truth token\. We additionally introduce the following appendix\-only objects\.

Logits Jacobian\.For trajectory𝐲k\\mathbf\{y\}\_\{k\}, define

Jk​\(𝜽\)≜∂∂𝜽​vec​\(𝐳1,…,𝐳\|ak\|\)∈ℝ\|ak\|​V×\|𝜽\|\.J\_\{k\}\(\\boldsymbol\{\\theta\}\)\\triangleq\\frac\{\\partial\}\{\\partial\\boldsymbol\{\\theta\}\}\\,\\mathrm\{vec\}\(\\mathbf\{z\}\_\{1\},\\ldots,\\mathbf\{z\}\_\{\|a\_\{k\}\|\}\)\\;\\in\\;\\mathbb\{R\}^\{\|a\_\{k\}\|V\\times\|\\boldsymbol\{\\theta\}\|\}\.\(12\)
Vectorized residual\.Let𝚫k​\(𝜽\)≜vec​\(Δk​\(𝜽\)\)∈ℝ\|ak\|​V\\boldsymbol\{\\Delta\}\_\{k\}\(\\boldsymbol\{\\theta\}\)\\triangleq\\mathrm\{vec\}\(\\Delta\_\{k\}\(\\boldsymbol\{\\theta\}\)\)\\in\\mathbb\{R\}^\{\|a\_\{k\}\|V\}\.

Weighted NTK matrix and weighted residual\.Given𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}, let

J~​\(𝜽;𝐪\)≜\(q1/\|a1\|​J1​\(𝜽\)⋮qK/\|aK\|​JK​\(𝜽\)\),𝚫~​\(𝜽;𝐪\)≜\(q1/\|a1\|​𝚫1​\(𝜽\)⋮qK/\|aK\|​𝚫K​\(𝜽\)\),\\tilde\{J\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\triangleq\\begin\{pmatrix\}\\sqrt\{q\_\{1\}/\|a\_\{1\}\|\}\\,J\_\{1\}\(\\boldsymbol\{\\theta\}\)\\\\ \\vdots\\\\ \\sqrt\{q\_\{K\}/\|a\_\{K\}\|\}\\,J\_\{K\}\(\\boldsymbol\{\\theta\}\)\\end\{pmatrix\},\\qquad\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\triangleq\\begin\{pmatrix\}\\sqrt\{q\_\{1\}/\|a\_\{1\}\|\}\\,\\boldsymbol\{\\Delta\}\_\{1\}\(\\boldsymbol\{\\theta\}\)\\\\ \\vdots\\\\ \\sqrt\{q\_\{K\}/\|a\_\{K\}\|\}\\,\\boldsymbol\{\\Delta\}\_\{K\}\(\\boldsymbol\{\\theta\}\)\\end\{pmatrix\},\(13\)and define the weighted empirical NTK matrix

Θ​\(𝜽;𝐪\)≜J~​\(𝜽;𝐪\)​J~​\(𝜽;𝐪\)⊤\.\\Theta\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\triangleq\\tilde\{J\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\,\\tilde\{J\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)^\{\\top\}\.\(14\)
NTK identity\.By the standard chain rule for softmax cross\-entropy \(∇𝐳tℓt=δt\\nabla\_\{\\mathbf\{z\}\_\{t\}\}\\ell\_\{t\}=\\delta\_\{t\}, see Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)\), the weighted gradient admits the factorization

∇𝜽ℒ​\(𝜽;𝐪\)=J~​\(𝜽;𝐪\)⊤​𝚫~​\(𝜽;𝐪\),\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)=\\tilde\{J\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)^\{\\top\}\\,\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\),which yields the NTK identity

‖∇𝜽ℒ​\(𝜽;𝐪\)‖2=𝚫~​\(𝜽;𝐪\)⊤​Θ​\(𝜽;𝐪\)​𝚫~​\(𝜽;𝐪\)\.\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\}\\;=\\;\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)^\{\\top\}\\,\\Theta\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\,\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\.\(15\)When someqk=0q\_\{k\}=0, the corresponding trajectory block contributes neither to the weighted loss nor to its gradient\. The zero\-weight blocks are therefore harmless in the identity above, but they should not be used to impose full\-matrix positive definiteness\. This is why Assumption[A\.3](https://arxiv.org/html/2605.30651#A1.Thmappassumption3)below conditions the NTK only along the residual directions that enter the weighted objective\.

##### Assumptions\.

We adopt four conditions for the NTK analysis of language model fine\-tuning\.

###### Assumption A\.1\(Lazy training\)\.

There exists a constantR\>0R\>0such that, for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\},

sups∈\[0,T\]‖ϕ𝐪​\(s\)−𝜽ref‖≤R\.\\sup\_\{s\\in\[0,T\]\}\\\|\\phi\_\{\\mathbf\{q\}\}\(s\)\-\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}\\\|\\;\\leq\\;R\.\(16\)

###### Assumption A\.2\(NTK stability\)\.

There existsϵ∈\(0,1\)\\epsilon\\in\(0,1\)such that, for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}and every𝜽∈B​\(𝜽ref,R\)\\boldsymbol\{\\theta\}\\in B\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\},R\),

\(1−ϵ\)​Θ​\(𝜽ref;𝐪\)⪯Θ​\(𝜽;𝐪\)⪯\(1\+ϵ\)​Θ​\(𝜽ref;𝐪\)\.\(1\-\\epsilon\)\\,\\Theta\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\;\\preceq\\;\\Theta\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\;\\preceq\\;\(1\+\\epsilon\)\\,\\Theta\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\.\(17\)

###### Assumption A\.3\(Residual\-direction NTK conditioning at anchor\)\.

There exist constants0<λ−≤λ\+<∞0<\\lambda\_\{\-\}\\leq\\lambda\_\{\+\}<\\inftysuch that, for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}and every𝜽∈B​\(𝜽ref,R\)\\boldsymbol\{\\theta\}\\in B\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\},R\),

𝚫~​\(𝜽;𝐪\)⊤​Θ​\(𝜽ref;𝐪\)​𝚫~​\(𝜽;𝐪\)≥λ−​‖𝚫~​\(𝜽;𝐪\)‖2,\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)^\{\\top\}\\Theta\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\;\\geq\\;\\lambda\_\{\-\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\},\(18\)and

𝚫~​\(𝜽ref;𝐪\)⊤​Θ​\(𝜽ref;𝐪\)​𝚫~​\(𝜽ref;𝐪\)≤λ\+​‖𝚫~​\(𝜽ref;𝐪\)‖2\.\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)^\{\\top\}\\Theta\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\;\\leq\\;\\lambda\_\{\+\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\\|^\{2\}\.\(19\)

###### Assumption A\.4\(Bounded ground\-truth probability\)\.

There exist constants0<p−≤p\+<10<p\_\{\-\}\\leq p\_\{\+\}<1such that, for every𝜽∈B​\(𝜽ref,R\)\\boldsymbol\{\\theta\}\\in B\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\},R\), every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}, everyk∈\[K\]k\\in\[K\], and every token positiontt,

p−≤pt​\(𝜽\)≤p\+\.p\_\{\-\}\\;\\leq\\;p\_\{t\}\(\\boldsymbol\{\\theta\}\)\\;\\leq\\;p\_\{\+\}\.\(20\)

Assumption[A\.1](https://arxiv.org/html/2605.30651#A1.Thmappassumption1)encodes lazy fine\-tuning behavior, classical in NTK theory\[Jacotet al\.,[2018](https://arxiv.org/html/2605.30651#bib.bib43), Chizatet al\.,[2019](https://arxiv.org/html/2605.30651#bib.bib46)\]and supported empirically for pre\-trained transformer fine\-tuning byMalladiet al\.\[[2023](https://arxiv.org/html/2605.30651#bib.bib47)\]\. Assumption[A\.2](https://arxiv.org/html/2605.30651#A1.Thmappassumption2)requires the empirical NTK matrix to remain near\-constant within the lazy ball, with rateO​\(1/m\)O\(1/\\sqrt\{m\}\)at widthmm\[Leeet al\.,[2019](https://arxiv.org/html/2605.30651#bib.bib44), Jacotet al\.,[2018](https://arxiv.org/html/2605.30651#bib.bib43)\], and is supported empirically byMalladiet al\.\[[2023](https://arxiv.org/html/2605.30651#bib.bib47)\]for fine\-tuning\. Assumption[A\.3](https://arxiv.org/html/2605.30651#A1.Thmappassumption3)is a residual\-direction conditioning condition: it does not require the full weighted NTK matrix to be positive definite in directions corresponding to zero\-weight trajectories, but only requires non\-degenerate curvature along the residual directions that determine the weighted loss and its gradient\. Assumption[A\.4](https://arxiv.org/html/2605.30651#A1.Thmappassumption4)is a local non\-degeneracy condition on teacher\-forced token probabilities:pt\>0p\_\{t\}\>0holds strictly under finite logits, whilept<1p\_\{t\}<1excludes already\-converged tokens for which the residual signal vanishes\.

##### Loss\-residual relation\.

Before establishing the main result, we relate the lossℒ​\(𝜽;𝐪\)\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)to the squared residual norm‖𝚫~​\(𝜽;𝐪\)‖2\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\}\. The relation depends only on Assumption[A\.4](https://arxiv.org/html/2605.30651#A1.Thmappassumption4), and the constantsν−,ν\+\\nu\_\{\-\},\\nu\_\{\+\}are derived explicitly fromp−,p\+p\_\{\-\},p\_\{\+\}\.

###### Lemma 4\(Loss–residual two\-sided bound\)\.

Under Assumption[A\.4](https://arxiv.org/html/2605.30651#A1.Thmappassumption4), for every𝛉∈B​\(𝛉ref,R\)\\boldsymbol\{\\theta\}\\in B\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\},R\)and every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\},

ν−​‖𝚫~​\(𝜽;𝐪\)‖2≤ℒ​\(𝜽;𝐪\)≤ν\+​‖𝚫~​\(𝜽;𝐪\)‖2,\\nu\_\{\-\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\}\\;\\leq\\;\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\;\\leq\\;\\nu\_\{\+\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\},\(21\)where

ν−=12,ν\+=log⁡\(1/p−\)\(1−p\+\)2\.\\nu\_\{\-\}\\;=\\;\\tfrac\{1\}\{2\},\\qquad\\nu\_\{\+\}\\;=\\;\\frac\{\\log\(1/p\_\{\-\}\)\}\{\(1\-p\_\{\+\}\)^\{2\}\}\.\(22\)

###### Proof\.

We first establish corresponding pointwise bounds at the token level: for every token positiontt, withp≜pt∈\[p−,p\+\]p\\triangleq p\_\{t\}\\in\[p\_\{\-\},p\_\{\+\}\],

12​‖δt‖2≤ℓt≤log⁡\(1/p−\)\(1−p\+\)2​‖δt‖2\.\\tfrac\{1\}\{2\}\\,\\\|\\delta\_\{t\}\\\|^\{2\}\\;\\leq\\;\\ell\_\{t\}\\;\\leq\\;\\frac\{\\log\(1/p\_\{\-\}\)\}\{\(1\-p\_\{\+\}\)^\{2\}\}\\,\\\|\\delta\_\{t\}\\\|^\{2\}\.\(23\)
Lower bound\.

‖δt‖2\\displaystyle\\\|\\delta\_\{t\}\\\|^\{2\}=\(a\)\(1−p\)2\+∑v≠ytπt​\(v\)2\\displaystyle\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\(1\-p\)^\{2\}\+\\sum\_\{v\\neq y\_\{t\}\}\\pi\_\{t\}\(v\)^\{2\}≤\(b\)\(1−p\)2\+\(maxv≠yt⁡πt​\(v\)\)​∑v≠ytπt​\(v\)\\displaystyle\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{\\leq\}\}\(1\-p\)^\{2\}\+\\Bigl\(\\max\_\{v\\neq y\_\{t\}\}\\pi\_\{t\}\(v\)\\Bigr\)\\sum\_\{v\\neq y\_\{t\}\}\\pi\_\{t\}\(v\)≤\(c\)\(1−p\)2\+\(1−p\)⋅\(1−p\)=2​\(1−p\)2\\displaystyle\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\leq\}\}\(1\-p\)^\{2\}\+\(1\-p\)\\cdot\(1\-p\)\\;=\\;2\(1\-p\)^\{2\}≤\(d\)2​\(1−p\)≤\(e\)2​ℓt,\\displaystyle\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{\\leq\}\}2\(1\-p\)\\stackrel\{\{\\scriptstyle\(e\)\}\}\{\{\\leq\}\}2\\ell\_\{t\},\(24\)where \(a\) is the definitionδt=πt−𝐞yt\\delta\_\{t\}=\\pi\_\{t\}\-\\mathbf\{e\}\_\{y\_\{t\}\}together withπt​\(yt\)=p\\pi\_\{t\}\(y\_\{t\}\)=p; \(b\) uses∑vav2≤maxv⁡av⋅∑vav\\sum\_\{v\}a\_\{v\}^\{2\}\\leq\\max\_\{v\}a\_\{v\}\\cdot\\sum\_\{v\}a\_\{v\}for non\-negative\{av\}\\\{a\_\{v\}\\\}; \(c\) uses bothmaxv≠yt⁡πt​\(v\)≤∑v≠ytπt​\(v\)=1−p\\max\_\{v\\neq y\_\{t\}\}\\pi\_\{t\}\(v\)\\leq\\sum\_\{v\\neq y\_\{t\}\}\\pi\_\{t\}\(v\)=1\-pand∑v≠ytπt​\(v\)=1−p\\sum\_\{v\\neq y\_\{t\}\}\\pi\_\{t\}\(v\)=1\-p; \(d\) uses1−p≤11\-p\\leq 1; and \(e\) uses the elementary inequality−log⁡p≥1−p\-\\log p\\geq 1\-pforp∈\(0,1\]p\\in\(0,1\]\.

Upper bound\.

ℓt\\displaystyle\\ell\_\{t\}=\(f\)−log⁡p≤\(g\)log⁡\(1/p−\)\\displaystyle\\stackrel\{\{\\scriptstyle\(f\)\}\}\{\{=\}\}\-\\log p\\stackrel\{\{\\scriptstyle\(g\)\}\}\{\{\\leq\}\}\\log\(1/p\_\{\-\}\)=\(h\)log⁡\(1/p−\)\(1−p\+\)2​\(1−p\+\)2≤\(i\)log⁡\(1/p−\)\(1−p\+\)2​\(1−p\)2≤\(j\)log⁡\(1/p−\)\(1−p\+\)2​‖δt‖2,\\displaystyle\\stackrel\{\{\\scriptstyle\(h\)\}\}\{\{=\}\}\\frac\{\\log\(1/p\_\{\-\}\)\}\{\(1\-p\_\{\+\}\)^\{2\}\}\\,\(1\-p\_\{\+\}\)^\{2\}\\stackrel\{\{\\scriptstyle\(i\)\}\}\{\{\\leq\}\}\\frac\{\\log\(1/p\_\{\-\}\)\}\{\(1\-p\_\{\+\}\)^\{2\}\}\\,\(1\-p\)^\{2\}\\stackrel\{\{\\scriptstyle\(j\)\}\}\{\{\\leq\}\}\\frac\{\\log\(1/p\_\{\-\}\)\}\{\(1\-p\_\{\+\}\)^\{2\}\}\\,\\\|\\delta\_\{t\}\\\|^\{2\},\(25\)where \(f\) is the definition ofℓt\\ell\_\{t\}; \(g\) usesp≥p−p\\geq p\_\{\-\}and the monotonicity of−log\-\\log; \(h\) is multiplication and division by\(1−p\+\)2\>0\(1\-p\_\{\+\}\)^\{2\}\>0; \(i\) uses\(1−p\+\)2≤\(1−p\)2\(1\-p\_\{\+\}\)^\{2\}\\leq\(1\-p\)^\{2\}fromp≤p\+<1p\\leq p\_\{\+\}<1; and \(j\) uses‖δt‖2≥\(1−p\)2\\\|\\delta\_\{t\}\\\|^\{2\}\\geq\(1\-p\)^\{2\}from theyty\_\{t\}\-th coordinate\.

Aggregation\.Summing \([23](https://arxiv.org/html/2605.30651#A1.E23)\) overt=1,…,\|ak\|t=1,\\ldots,\|a\_\{k\}\|and dividing by\|ak\|\|a\_\{k\}\|yields

ν−​‖𝚫k‖2/\|ak\|≤ℒ​\(𝜽;𝐲k\)≤ν\+​‖𝚫k‖2/\|ak\|,\\nu\_\{\-\}\\,\\\|\\boldsymbol\{\\Delta\}\_\{k\}\\\|^\{2\}/\|a\_\{k\}\|\\leq\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{y\}\_\{k\}\)\\leq\\nu\_\{\+\}\\,\\\|\\boldsymbol\{\\Delta\}\_\{k\}\\\|^\{2\}/\|a\_\{k\}\|,where‖𝚫k‖2=∑t‖δt‖2\\\|\\boldsymbol\{\\Delta\}\_\{k\}\\\|^\{2\}=\\sum\_\{t\}\\\|\\delta\_\{t\}\\\|^\{2\}\. Multiplying byqkq\_\{k\}and summing overkkgives

ν−​∑k=1Kqk\|ak\|​‖𝚫k‖2≤ℒ​\(𝜽;𝐪\)≤ν\+​∑k=1Kqk\|ak\|​‖𝚫k‖2,\\nu\_\{\-\}\\,\\sum\_\{k=1\}^\{K\}\\frac\{q\_\{k\}\}\{\|a\_\{k\}\|\}\\,\\\|\\boldsymbol\{\\Delta\}\_\{k\}\\\|^\{2\}\\;\\leq\\;\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\;\\leq\\;\\nu\_\{\+\}\\,\\sum\_\{k=1\}^\{K\}\\frac\{q\_\{k\}\}\{\|a\_\{k\}\|\}\\,\\\|\\boldsymbol\{\\Delta\}\_\{k\}\\\|^\{2\},\(26\)and the result follows from

∑k\(qk/\|ak\|\)​‖𝚫k‖2=‖𝚫~​\(𝜽;𝐪\)‖2\.\\sum\_\{k\}\(q\_\{k\}/\|a\_\{k\}\|\)\\,\\\|\\boldsymbol\{\\Delta\}\_\{k\}\\\|^\{2\}=\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\}\.∎

##### Main result\.

We now establish the main result\.

###### Proposition 2\(NTK\-based justification of Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)\)\.

Under Assumptions[A\.1](https://arxiv.org/html/2605.30651#A1.Thmappassumption1)–[A\.4](https://arxiv.org/html/2605.30651#A1.Thmappassumption4), Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)holds with

κ=\(1−ϵ\)​λ−​ν−λ\+​ν\+=\(1−ϵ\)​λ−​\(1−p\+\)22​λ\+​log⁡\(1/p−\)\>0\.\\kappa\\;=\\;\\frac\{\(1\-\\epsilon\)\\,\\lambda\_\{\-\}\\,\\nu\_\{\-\}\}\{\\lambda\_\{\+\}\\,\\nu\_\{\+\}\}\\;=\\;\\frac\{\(1\-\\epsilon\)\\,\\lambda\_\{\-\}\\,\(1\-p\_\{\+\}\)^\{2\}\}\{2\\,\\lambda\_\{\+\}\\,\\log\(1/p\_\{\-\}\)\}\\;\>\\;0\.\(27\)

###### Proof\.

Fix any𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}ands∈\[0,T\]s\\in\[0,T\], and set𝜽=ϕ𝐪​\(s\)\\boldsymbol\{\\theta\}=\\phi\_\{\\mathbf\{q\}\}\(s\)\. Assumption[A\.1](https://arxiv.org/html/2605.30651#A1.Thmappassumption1)gives𝜽∈B​\(𝜽ref,R\)\\boldsymbol\{\\theta\}\\in B\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\},R\)\.

Writing𝚫~≜𝚫~​\(𝜽;𝐪\)\\tilde\{\\boldsymbol\{\\Delta\}\}\\triangleq\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\),

‖∇𝜽ℒ​\(𝜽;𝐪\)‖2\\displaystyle\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\}=\(a\)𝚫~⊤​Θ​\(𝜽;𝐪\)​𝚫~\\displaystyle\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\\tilde\{\\boldsymbol\{\\Delta\}\}^\{\\top\}\\Theta\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\tilde\{\\boldsymbol\{\\Delta\}\}≥\(b\)\(1−ϵ\)​𝚫~⊤​Θ​\(𝜽ref;𝐪\)​𝚫~\\displaystyle\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{\\geq\}\}\(1\-\\epsilon\)\\,\\tilde\{\\boldsymbol\{\\Delta\}\}^\{\\top\}\\Theta\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\tilde\{\\boldsymbol\{\\Delta\}\}≥\(c\)\(1−ϵ\)​λ−​‖𝚫~‖2,\\displaystyle\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\geq\}\}\(1\-\\epsilon\)\\,\\lambda\_\{\-\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\\\|^\{2\},\(28\)where \(a\) is the NTK identity \([15](https://arxiv.org/html/2605.30651#A1.E15)\); \(b\) applies the lower\-bound part of Assumption[A\.2](https://arxiv.org/html/2605.30651#A1.Thmappassumption2); and \(c\) applies the residual\-direction lower bound \([18](https://arxiv.org/html/2605.30651#A1.E18)\) from Assumption[A\.3](https://arxiv.org/html/2605.30651#A1.Thmappassumption3)\. Combined with the upper boundℒ​\(𝜽;𝐪\)≤ν\+​‖𝚫~‖2\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\leq\\nu\_\{\+\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\\\|^\{2\}from Lemma[4](https://arxiv.org/html/2605.30651#Thmlemma4),

ρ​\(𝜽;𝐪\)=‖∇𝜽ℒ​\(𝜽;𝐪\)‖2ℒ​\(𝜽;𝐪\)≥\(1−ϵ\)​λ−​‖𝚫~‖2ν\+​‖𝚫~‖2=\(1−ϵ\)​λ−ν\+\.\\rho\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\;=\\;\\frac\{\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\\\|^\{2\}\}\{\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathbf\{q\}\)\}\\;\\geq\\;\\frac\{\(1\-\\epsilon\)\\,\\lambda\_\{\-\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\\\|^\{2\}\}\{\\nu\_\{\+\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\\\|^\{2\}\}\\;=\\;\\frac\{\(1\-\\epsilon\)\\,\\lambda\_\{\-\}\}\{\\nu\_\{\+\}\}\.\(29\)
Writing𝚫~ref≜𝚫~​\(𝜽ref;𝐪\)\\tilde\{\\boldsymbol\{\\Delta\}\}\_\{\\mathrm\{ref\}\}\\triangleq\\tilde\{\\boldsymbol\{\\Delta\}\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\),

‖∇𝜽ℒ​\(𝜽ref;𝐪\)‖2\\displaystyle\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\\|^\{2\}=\(d\)𝚫~ref⊤​Θ​\(𝜽ref;𝐪\)​𝚫~ref\\displaystyle\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{=\}\}\\tilde\{\\boldsymbol\{\\Delta\}\}\_\{\\mathrm\{ref\}\}^\{\\top\}\\Theta\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\tilde\{\\boldsymbol\{\\Delta\}\}\_\{\\mathrm\{ref\}\}≤\(e\)λ\+​‖𝚫~ref‖2,\\displaystyle\\stackrel\{\{\\scriptstyle\(e\)\}\}\{\{\\leq\}\}\\lambda\_\{\+\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\_\{\\mathrm\{ref\}\}\\\|^\{2\},\(30\)where \(d\) is \([15](https://arxiv.org/html/2605.30651#A1.E15)\) at𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}and \(e\) applies the residual\-direction upper bound \([19](https://arxiv.org/html/2605.30651#A1.E19)\) from Assumption[A\.3](https://arxiv.org/html/2605.30651#A1.Thmappassumption3)\. Combined with the lower boundℒ​\(𝜽ref;𝐪\)≥ν−​‖𝚫~ref‖2\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\geq\\nu\_\{\-\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\_\{\\mathrm\{ref\}\}\\\|^\{2\}from Lemma[4](https://arxiv.org/html/2605.30651#Thmlemma4),

ρ​\(𝜽ref;𝐪\)=‖∇𝜽ℒ​\(𝜽ref;𝐪\)‖2ℒ​\(𝜽ref;𝐪\)≤λ\+​‖𝚫~ref‖2ν−​‖𝚫~ref‖2=λ\+ν−\.\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\;=\\;\\frac\{\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\\|^\{2\}\}\{\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\}\\;\\leq\\;\\frac\{\\lambda\_\{\+\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\_\{\\mathrm\{ref\}\}\\\|^\{2\}\}\{\\nu\_\{\-\}\\,\\\|\\tilde\{\\boldsymbol\{\\Delta\}\}\_\{\\mathrm\{ref\}\}\\\|^\{2\}\}\\;=\\;\\frac\{\\lambda\_\{\+\}\}\{\\nu\_\{\-\}\}\.\(31\)
Combining \([29](https://arxiv.org/html/2605.30651#A1.E29)\) and \([31](https://arxiv.org/html/2605.30651#A1.E31)\), both being positive,

ρ​\(ϕ𝐪​\(s\);𝐪\)ρ​\(𝜽ref;𝐪\)≥\(f\)\(1−ϵ\)​λ−/ν\+λ\+/ν−=\(1−ϵ\)​λ−​ν−λ\+​ν\+,\\frac\{\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\}\{\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\}\\;\\stackrel\{\{\\scriptstyle\(f\)\}\}\{\{\\geq\}\}\\;\\frac\{\(1\-\\epsilon\)\\,\\lambda\_\{\-\}/\\nu\_\{\+\}\}\{\\lambda\_\{\+\}/\\nu\_\{\-\}\}\\;=\\;\\frac\{\(1\-\\epsilon\)\\,\\lambda\_\{\-\}\\,\\nu\_\{\-\}\}\{\\lambda\_\{\+\}\\,\\nu\_\{\+\}\},\(32\)where \(f\) uses the lower bound on the numerator from \([29](https://arxiv.org/html/2605.30651#A1.E29)\) and the upper bound on the denominator from \([31](https://arxiv.org/html/2605.30651#A1.E31)\)\. Since𝐪\\mathbf\{q\}andsswere arbitrary,

infs∈\[0,T\]ρ​\(ϕ𝐪​\(s\);𝐪\)≥\(1−ϵ\)​λ−​ν−λ\+​ν\+⋅ρ​\(𝜽ref;𝐪\),\\inf\_\{s\\in\[0,T\]\}\\rho\(\\phi\_\{\\mathbf\{q\}\}\(s\);\\mathbf\{q\}\)\\;\\geq\\;\\frac\{\(1\-\\epsilon\)\\,\\lambda\_\{\-\}\\,\\nu\_\{\-\}\}\{\\lambda\_\{\+\}\\,\\nu\_\{\+\}\}\\cdot\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\),\(33\)which is Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)with

κ=\(1−ϵ\)​λ−​ν−λ\+​ν\+\.\\kappa=\\frac\{\(1\-\\epsilon\)\\lambda\_\{\-\}\\nu\_\{\-\}\}\{\\lambda\_\{\+\}\\nu\_\{\+\}\}\.Substitutingν−=1/2\\nu\_\{\-\}=1/2andν\+=log⁡\(1/p−\)/\(1−p\+\)2\\nu\_\{\+\}=\\log\(1/p\_\{\-\}\)/\(1\-p\_\{\+\}\)^\{2\}from Lemma[4](https://arxiv.org/html/2605.30651#Thmlemma4)yields \([27](https://arxiv.org/html/2605.30651#A1.E27)\)\. ∎

##### Remark onκ\\kappa\.

The expression

κ=\(1−ϵ\)​λ−​\(1−p\+\)22​λ\+​log⁡\(1/p−\)\\kappa=\\frac\{\(1\-\\epsilon\)\\lambda\_\{\-\}\(1\-p\_\{\+\}\)^\{2\}\}\{2\\lambda\_\{\+\}\\log\(1/p\_\{\-\}\)\}admits a clean interpretation\. The ratioλ−/λ\+\\lambda\_\{\-\}/\\lambda\_\{\+\}is the effective condition number of the empirical NTK along the residual directions that determine the weighted loss;1−ϵ1\-\\epsiloncaptures NTK stability along the trajectory; and\(1−p\+\)2/log⁡\(1/p−\)\(1\-p\_\{\+\}\)^\{2\}/\\log\(1/p\_\{\-\}\)is a geometric correction accounting for the cross\-entropy loss, arising from the unbounded growth of−log⁡p\-\\log pnearp=0p=0\. Under squared loss, this last factor reduces to a constant andκ\\kappacollapses to\(1−ϵ\)​λ−/λ\+\(1\-\\epsilon\)\\lambda\_\{\-\}/\\lambda\_\{\+\}, recovering the classical NTK convergence\-rate dependence on the kernel condition number\[Jacotet al\.,[2018](https://arxiv.org/html/2605.30651#bib.bib43), Duet al\.,[2019](https://arxiv.org/html/2605.30651#bib.bib45)\]\. The post\-training loss bound in Proposition[1](https://arxiv.org/html/2605.30651#Thmproposition1)thus holds with thisκ\\kappa, providing theoretical justification for usingρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)as the anchor\-time selection objective in Section[4\.1](https://arxiv.org/html/2605.30651#S4.SS1)\.

### A\.3Empirical Verification of the Trajectory Bound

##### Experimental setup\.

Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)quantifies over all𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}, with no restriction on the candidate pool of the weighted SFT loss in Eq\. \([1](https://arxiv.org/html/2605.30651#S3.E1)\): the pool may consist of theKKtrajectories of a single question, or of \(question, trajectory\) pairs aggregated across multiple questions\. In actual deployment, LARK is not applied to a single question in isolation but to mini\-batches drawn from a corpus of questions, where each batch induces a weight vector over the \(question, trajectory\) pairs it contains\. We therefore verify Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)directly in the corpus\-level regime in which LARK is deployed\.

Model and anchor\.We verify on Qwen\-2\.5\-7B, the main student model used in Section[5\.1](https://arxiv.org/html/2605.30651#S5.SS1)\. The pre\-trained checkpoint serves as the anchor parameter𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}\.

Verification corpus\.From the 5,000 NuminaMath training questions used in Section[5\.1](https://arxiv.org/html/2605.30651#S5.SS1), we randomly sample 500 questions\{𝐱n\}n=1500\\\{\\mathbf\{x\}\_\{n\}\\\}\_\{n=1\}^\{500\}as the verification corpus\. Each𝐱n\\mathbf\{x\}\_\{n\}retains its full pool of 33 candidate trajectories \(11 teachers×\\times3 rollouts\)\. For each𝐱n\\mathbf\{x\}\_\{n\}and each budgetB∈\{1,3,5,10,20\}B\\in\\\{1,3,5,10,20\\\}, we apply the LARK selection rule \(Lemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)\) to obtain a weight vector𝐪^n\(B\)\\hat\{\\mathbf\{q\}\}\_\{n\}^\{\(B\)\}\. Stacking and rescaling the500500per\-question weights yields a corpus\-level weight vector𝐐\(B\)∈Δ500×33\\mathbf\{Q\}^\{\(B\)\}\\in\\Delta^\{500\\times 33\}, with the corresponding weighted SFT loss

ℒtotal​\(𝜽;B\)≜ℒ​\(𝜽;𝐐\(B\)\)=1500​∑n=1500ℒ​\(𝜽;𝐪^n\(B\)\)\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\boldsymbol\{\\theta\};B\)\\;\\triangleq\\;\\mathcal\{L\}\\bigl\(\\boldsymbol\{\\theta\};\\mathbf\{Q\}^\{\(B\)\}\\bigr\)\\;=\\;\\frac\{1\}\{500\}\\sum\_\{n=1\}^\{500\}\\mathcal\{L\}\\bigl\(\\boldsymbol\{\\theta\};\\hat\{\\mathbf\{q\}\}\_\{n\}^\{\(B\)\}\\bigr\)\.\(34\)We treat𝐐\(B\)\\mathbf\{Q\}^\{\(B\)\}directly as the weight vector𝐪\\mathbf\{q\}in Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1), so the corpus\-level quantitiesρt\\rho\_\{t\}andκ^\(B\)\\hat\{\\kappa\}^\{\(B\)\}defined below are instantiations of Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)at𝐐\(B\)\\mathbf\{Q\}^\{\(B\)\}\.

SFT training\.For each budgetB∈\{1,3,5,10,20\}B\\in\\\{1,3,5,10,20\\\}, we run mini\-batch SFT onℒtotal​\(⋅;B\)\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\cdot;B\)starting from𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}, using the same hyperparameters as the main experiments\. Full training details are deferred to Appendix[C\.3](https://arxiv.org/html/2605.30651#A3.SS3)\.

Recorded quantities\.At each training stept∈\{0,1,…,Tmax\}t\\in\\\{0,1,\\ldots,T\_\{\\max\}\\\}, we record the corpus\-level decay rate

ρt≜‖∇𝜽ℒtotal​\(𝜽t;B\)‖2ℒtotal​\(𝜽t;B\),\\rho\_\{t\}\\;\\triangleq\\;\\frac\{\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\boldsymbol\{\\theta\}\_\{t\};B\)\\\|^\{2\}\}\{\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\boldsymbol\{\\theta\}\_\{t\};B\)\},\(35\)and the anchor\-relative ratioκ^t≜ρt/ρ0\\hat\{\\kappa\}\_\{t\}\\triangleq\\rho\_\{t\}/\\rho\_\{0\}, which is the discrete\-time analog ofρ​\(ϕ𝐐\(B\)​\(s\);𝐐\(B\)\)/ρ​\(𝜽ref;𝐐\(B\)\)\\rho\(\\phi\_\{\\mathbf\{Q\}^\{\(B\)\}\}\(s\);\\mathbf\{Q\}^\{\(B\)\}\)/\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{Q\}^\{\(B\)\}\)in Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)\. After training, for eachBBwe report the trajectory\-wise minimum

κ^\(B\)≜mint∈\[0,Tmax\]⁡ρt/ρ0,\\hat\{\\kappa\}^\{\(B\)\}\\;\\triangleq\\;\\min\_\{t\\in\[0,T\_\{\\max\}\]\}\\rho\_\{t\}/\\rho\_\{0\},\(36\)as the empirical estimate ofκ\\kappain Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)under budgetBB\.

##### Findings\.

Figure[5](https://arxiv.org/html/2605.30651#A1.F5)reports the verification results forB∈\{1,3,5,10,20\}B\\in\\\{1,3,5,10,20\\\}\. Each panel plots the corpus\-level decay rateρt\\rho\_\{t\}\(blue solid, left axis\) and the anchor\-relative ratioκ^t=ρt/ρ0\\hat\{\\kappa\}\_\{t\}=\\rho\_\{t\}/\\rho\_\{0\}\(red dashed, right axis\) against the training steptt\. The trajectory\-wise minimumκ^\(B\)\\hat\{\\kappa\}^\{\(B\)\}and the step at which it is attained are annotated in each panel title\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x8.png)Figure 5:Empirical verification of Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)on Qwen\-2\.5\-7B forB∈\{1,3,5,10,20\}B\\in\\\{1,3,5,10,20\\\}\. Blue:ρt\\rho\_\{t\}\(left axis\); red dashed:κ^t=ρt/ρ0\\hat\{\\kappa\}\_\{t\}=\\rho\_\{t\}/\\rho\_\{0\}\(right axis\)\. Panel titles reportκ^\(B\)=mint⁡κ^t\\hat\{\\kappa\}^\{\(B\)\}=\\min\_\{t\}\\hat\{\\kappa\}\_\{t\}and the step at which it is attained\.We highlight three observations from the results\.

- •*Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)holds across all configurations\.*For everyBB, the trajectory\-wise minimumκ^\(B\)\\hat\{\\kappa\}^\{\(B\)\}is strictly positive \(κ^\(B\)∈\[0\.012,0\.279\]\\hat\{\\kappa\}^\{\(B\)\}\\in\[0\.012,0\.279\]\), confirming thatρ\\rhoalong the SFT trajectory does not collapse to zero relative to its anchor\-time value\. This empirically validates the existence of a positiveκ\\kappain Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)\.
- •*Smallest budget yields the largestκ^\\hat\{\\kappa\}\.*The minimum ratio atB=1B=1\(κ^\(1\)=0\.279\\hat\{\\kappa\}^\{\(1\)\}=0\.279\) substantially exceeds the values forB≥3B\\geq 3\(which fall in the range0\.0120\.012–0\.0150\.015\)\. Single\-trajectory training produces a more stableρ\\rhotrajectory and a tighter anchor\-relative bound, while multi\-trajectory training induces more oscillatoryρt\\rho\_\{t\}but still maintains a positive lower bound\.
- •*κ^\(B\)\\hat\{\\kappa\}^\{\(B\)\}stabilizes for moderateBB\.*ForB∈\{3,5,10,20\}B\\in\\\{3,5,10,20\\\}, the minimum ratio remains within a narrow range \(0\.0120\.012–0\.0150\.015\), indicating that beyond a small budget, the corpus\-level anchor\-relative bound depends only weakly onBB\.

##### On the early\-step transient inρt\\rho\_\{t\}\.

A closer look at Figure[5](https://arxiv.org/html/2605.30651#A1.F5)reveals that, forB≥3B\\geq 3,ρt\\rho\_\{t\}is non\-monotone: it exhibits a sharp transient spike during the first∼\\sim10% of training steps \(reaching values an order of magnitude aboveρ0\\rho\_\{0\}\) before settling to a positive plateau\. Two factors are consistent with this behavior\. First, the empirical NTKΘ​\(𝜽t;𝐪\)\\Theta\(\\boldsymbol\{\\theta\}\_\{t\};\\mathbf\{q\}\)adapts most rapidly during the warmup phase of SFT, before the lazy\-training regime of Assumption[A\.1](https://arxiv.org/html/2605.30651#A1.Thmappassumption1)fully takes effect; the looser NTK stability in this early window admits larger fluctuations ofρt\\rho\_\{t\}around its anchor\-time value\. Second, weighted multi\-trajectory gradients∇𝜽ℒ​\(𝜽t;𝐪\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\_\{t\};\\mathbf\{q\}\)admit cancellation across trajectory blocks with different residual directions, which can produce non\-monotone‖∇𝜽ℒ‖2\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\\\|^\{2\}in early steps\. Crucially, the empiricalκ^\(B\)\\hat\{\\kappa\}^\{\(B\)\}reported in the panel titles is the trajectory\-wise*minimum*ofρt/ρ0\\rho\_\{t\}/\\rho\_\{0\}over the entire training run, which already absorbs this transient: even at the lowest point of the trajectory,ρt/ρ0\\rho\_\{t\}/\\rho\_\{0\}remains bounded away from zero, which is exactly what Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)requires\. The transient therefore tightens \(rather than relaxes\) the empirical estimate ofκ\\kappa\.

## Appendix BTheoretical Analysis of LARK

This appendix collects the formal statements, proofs, and supporting derivations for the theoretical claims in Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)\(the local linearization analysis of LARK\) and Section[4\.3](https://arxiv.org/html/2605.30651#S4.SS3)\(the closed\-form selection rule\)\.

We begin in Appendix[B\.1](https://arxiv.org/html/2605.30651#A2.SS1)by establishing thatρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)is quasiconvex onΔK\\Delta^\{K\}and that its \(constrained or unconstrained\) maximum collapses to a one\-hot solution; this is the formal justification for the local\-linearization strategy adopted in Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)\. Appendix[B\.2](https://arxiv.org/html/2605.30651#A2.SS2)fixes the additional notation and structural assumptions used by the rest of the appendix\. Appendix[B\.3](https://arxiv.org/html/2605.30651#A2.SS3)collects auxiliary lemmas \(gradient identities and bounds at the LM\-head and backbone, together with the Taylor decomposition ofρ\\rhoaround𝐩\\mathbf\{p\}\) that are reused by the subsequent proofs\. Appendices[B\.4](https://arxiv.org/html/2605.30651#A2.SS4),[B\.5](https://arxiv.org/html/2605.30651#A2.SS5),[B\.6](https://arxiv.org/html/2605.30651#A2.SS6), and[B\.7](https://arxiv.org/html/2605.30651#A2.SS7)respectively give the formal versions and full proofs of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)\(forward\-passρk∗\\rho\_\{k\}^\{\*\}estimation\), Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)\(χ2\\chi^\{2\}controls bothR1R\_\{1\}andR2R\_\{2\}\), Theorem[1](https://arxiv.org/html/2605.30651#Thmtheorem1)\(the LARK objective lower bound\), and Lemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)\(theBB\-parameterized closed form\)\. Finally, Appendix[B\.8](https://arxiv.org/html/2605.30651#A2.SS8)provides a quantitative error bound on the surrogateg^k\\hat\{g\}\_\{k\}relative to the exact directional derivativegk∗g\_\{k\}^\{\*\}, together with empirical evidence that the two are tightly correlated on real student models\.

### B\.1Quasiconvexity ofρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)and Collapse to One\-Hot

Throughout this subsection, fix𝜽=𝜽ref\\boldsymbol\{\\theta\}=\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}and recall from Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)that𝐠k=∇𝜽ℓk\\mathbf\{g\}\_\{k\}=\\nabla\_\{\\boldsymbol\{\\theta\}\}\\ell\_\{k\}withℓk=ℓ​\(𝜽ref,𝐲k\)\>0\\ell\_\{k\}=\\ell\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\},\\mathbf\{y\}\_\{k\}\)\>0\. Define

N​\(𝐪\)≜‖∑kqk​𝐠k‖2=𝐪⊤​G​𝐪,D​\(𝐪\)≜∑kqk​ℓk,N\(\\mathbf\{q\}\)\\;\\triangleq\\;\\Bigl\\\|\\textstyle\\sum\_\{k\}q\_\{k\}\\mathbf\{g\}\_\{k\}\\Bigr\\\|^\{2\}\\;=\\;\\mathbf\{q\}^\{\\top\}G\\,\\mathbf\{q\},\\qquad D\(\\mathbf\{q\}\)\\;\\triangleq\\;\\textstyle\\sum\_\{k\}q\_\{k\}\\ell\_\{k\},\(37\)whereG∈ℝK×KG\\in\\mathbb\{R\}^\{K\\times K\}withGi​j=⟨𝐠i,𝐠j⟩G\_\{ij\}=\\langle\\mathbf\{g\}\_\{i\},\\,\\mathbf\{g\}\_\{j\}\\rangleis the gradient Gram matrix\. SinceG⪰0G\\succeq 0,N​\(𝐪\)N\(\\mathbf\{q\}\)is a convex quadratic in𝐪\\mathbf\{q\}, andD​\(𝐪\)\>0D\(\\mathbf\{q\}\)\>0is affine, soρ​\(𝜽ref;𝐪\)=N​\(𝐪\)/D​\(𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)=N\(\\mathbf\{q\}\)/D\(\\mathbf\{q\}\)\.

###### Proposition 3\(Quasiconvexity ofρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\)\.

ρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)is quasiconvex in𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}\.

###### Proof\.

It suffices to show that for everyc≥0c\\geq 0, the sublevel set𝒮c≜\{𝐪∈ΔK:ρ​\(𝜽ref;𝐪\)≤c\}\\mathcal\{S\}\_\{c\}\\triangleq\\\{\\mathbf\{q\}\\in\\Delta^\{K\}:\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\\leq c\\\}is convex\. We have

𝐪∈𝒮c\\displaystyle\\mathbf\{q\}\\in\\mathcal\{S\}\_\{c\}⇔\(a\)ρ​\(𝜽ref;𝐪\)≤c\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{\\iff\}\}\\;\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\\leq c⇔\(b\)N​\(𝐪\)D​\(𝐪\)≤c\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{\\iff\}\}\\;\\frac\{N\(\\mathbf\{q\}\)\}\{D\(\\mathbf\{q\}\)\}\\leq c⇔\(c\)N​\(𝐪\)≤c⋅D​\(𝐪\)\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\iff\}\}\\;N\(\\mathbf\{q\}\)\\leq c\\cdot D\(\\mathbf\{q\}\)⇔\(d\)𝐪⊤​G​𝐪−c​∑kqk​ℓk≤0,\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{\\iff\}\}\\;\\mathbf\{q\}^\{\\top\}G\\,\\mathbf\{q\}\-c\\textstyle\\sum\_\{k\}q\_\{k\}\\ell\_\{k\}\\leq 0,\(38\)where \(a\) is the definition of𝒮c\\mathcal\{S\}\_\{c\}; \(b\) substitutesρ​\(𝜽ref;𝐪\)=N​\(𝐪\)/D​\(𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)=N\(\\mathbf\{q\}\)/D\(\\mathbf\{q\}\); \(c\) multiplies both sides byD​\(𝐪\)\>0D\(\\mathbf\{q\}\)\>0; and \(d\) substitutes the definitions ofNNandDDfrom \([37](https://arxiv.org/html/2605.30651#A2.E37)\)\. The left\-hand side of the final inequality in \([38](https://arxiv.org/html/2605.30651#A2.E38)\) is the sum of the convex quadratic𝐪⊤​G​𝐪\\mathbf\{q\}^\{\\top\}G\\,\\mathbf\{q\}\(sinceG⪰0G\\succeq 0\) and the affine term−c​∑kqk​ℓk\-c\\sum\_\{k\}q\_\{k\}\\ell\_\{k\}, hence convex in𝐪\\mathbf\{q\}\. Therefore𝒮c\\mathcal\{S\}\_\{c\}is the intersection of a convex sublevel set with the convex setΔK\\Delta^\{K\}, which is convex\. ∎

###### Corollary 1\(Maximum at a vertex\)\.

max𝐪∈ΔK⁡ρ​\(𝜽ref;𝐪\)\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)is attained at an extreme point ofΔK\\Delta^\{K\}\.

###### Proof\.

We have

max𝐪∈ΔK⁡ρ​\(𝜽ref;𝐪\)\\displaystyle\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)=\(a\)max𝐪∈ΔK⁡ρ​\(𝜽ref;𝐪\)\|extreme point\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\\;\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\\big\|\_\{\\text\{extreme point\}\}=\(b\)maxk∈\[K\]⁡ρ​\(𝜽ref;𝐞k\),\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{=\}\}\\;\\max\_\{k\\in\[K\]\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{e\}\_\{k\}\),\(39\)where \(a\) follows because every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\}can be written as a convex combination of the extreme points\{𝐞1,…,𝐞K\}\\\{\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{K\}\\\}, namely𝐪=∑k=1Kqk​𝐞k\\mathbf\{q\}=\\sum\_\{k=1\}^\{K\}q\_\{k\}\\mathbf\{e\}\_\{k\}, and quasiconvexity impliesρ​\(𝜽ref;𝐪\)≤maxk∈\[K\]:qk\>0⁡ρ​\(𝜽ref;𝐞k\)≤maxk∈\[K\]⁡ρ​\(𝜽ref;𝐞k\)\.\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\\leq\\max\_\{k\\in\[K\]:\\,q\_\{k\}\>0\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{e\}\_\{k\}\)\\leq\\max\_\{k\\in\[K\]\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{e\}\_\{k\}\)\.The reverse inequality is immediate since each𝐞k∈ΔK\\mathbf\{e\}\_\{k\}\\in\\Delta^\{K\}\. Hence the maximum overΔK\\Delta^\{K\}is attained at an extreme point\. Step \(b\) identifies the extreme points ofΔK\\Delta^\{K\}as the standard basis vectors\{𝐞1,…,𝐞K\}\\\{\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{K\}\\\}\. ∎

###### Corollary 2\(One\-hot optimum\)\.

max𝐪∈ΔK⁡ρ​\(𝜽ref;𝐪\)=maxk∈\[K\]⁡ρ​\(𝜽ref;𝐞k\)=maxk∈\[K\]⁡‖𝐠k‖2ℓk=maxk∈\[K\]⁡ρk∗\.\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\\;=\\;\\max\_\{k\\in\[K\]\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{e\}\_\{k\}\)\\;=\\;\\max\_\{k\\in\[K\]\}\\frac\{\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}\}\{\\ell\_\{k\}\}\\;=\\;\\max\_\{k\\in\[K\]\}\\rho\_\{k\}^\{\*\}\.\(40\)

###### Proof\.

We have

max𝐪∈ΔK⁡ρ​\(𝜽ref;𝐪\)\\displaystyle\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)=\(a\)maxk∈\[K\]⁡ρ​\(𝜽ref;𝐞k\)\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\\;\\max\_\{k\\in\[K\]\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{e\}\_\{k\}\)=\(b\)maxk∈\[K\]⁡N​\(𝐞k\)D​\(𝐞k\)=maxk∈\[K\]⁡‖𝐠k‖2ℓk=\(c\)maxk∈\[K\]⁡ρk∗,\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{=\}\}\\;\\max\_\{k\\in\[K\]\}\\frac\{N\(\\mathbf\{e\}\_\{k\}\)\}\{D\(\\mathbf\{e\}\_\{k\}\)\}\\;=\\;\\max\_\{k\\in\[K\]\}\\frac\{\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}\}\{\\ell\_\{k\}\}\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{=\}\}\\;\\max\_\{k\\in\[K\]\}\\rho\_\{k\}^\{\*\},\(41\)where \(a\) applies Corollary[1](https://arxiv.org/html/2605.30651#Thmcorollary1); \(b\) substitutes𝐪=𝐞k\\mathbf\{q\}=\\mathbf\{e\}\_\{k\}into the definitions in \([37](https://arxiv.org/html/2605.30651#A2.E37)\), givingN​\(𝐞k\)=‖𝐠k‖2N\(\\mathbf\{e\}\_\{k\}\)=\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}andD​\(𝐞k\)=ℓkD\(\\mathbf\{e\}\_\{k\}\)=\\ell\_\{k\}; and \(c\) uses the definitionρk∗=‖𝐠k‖2/ℓk\\rho\_\{k\}^\{\*\}=\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}/\\ell\_\{k\}from Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)\. ∎

### B\.2Notation and Structural Assumptions

This subsection fixes the additional notation and structural assumptions used throughout Appendices[B\.3](https://arxiv.org/html/2605.30651#A2.SS3)–[B\.8](https://arxiv.org/html/2605.30651#A2.SS8)\.

#### B\.2\.1Notation

##### Symbols inherited from the main text\.

We use the same symbols as Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2):𝐱\\mathbf\{x\}is the question;𝐲k=\(y1k,…,y\|ak\|k\)\\mathbf\{y\}\_\{k\}=\(y\_\{1\}^\{k\},\\ldots,y\_\{\|a\_\{k\}\|\}^\{k\}\)is thekk\-th candidate trajectory, of length\|ak\|\|a\_\{k\}\|;𝝅tk∈Δ\|𝒱\|\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\\in\\Delta^\{\|\\mathcal\{V\}\|\}is the predictive distribution of the reference studentπref\\pi\_\{\\mathrm\{ref\}\}at positionttunder teacher forcing;𝜹​\(ytk\)∈Δ\|𝒱\|\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\in\\Delta^\{\|\\mathcal\{V\}\|\}is the one\-hot vector for the ground\-truth token;ℓk=ℓ​\(𝜽ref,𝐲k\)\\ell\_\{k\}=\\ell\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\},\\mathbf\{y\}\_\{k\}\)is the length\-normalized cross\-entropy of trajectory𝐲k\\mathbf\{y\}\_\{k\};𝐠k=∇𝜽ℓk\\mathbf\{g\}\_\{k\}=\\nabla\_\{\\boldsymbol\{\\theta\}\}\\ell\_\{k\}is its gradient;ρk∗=‖𝐠k‖2/ℓk\\rho\_\{k\}^\{\*\}=\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}/\\ell\_\{k\}is the per\-trajectory rate;ρ^k\\hat\{\\rho\}\_\{k\}is its forward\-pass proxy \(see Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)\);g^k\\hat\{g\}\_\{k\}andgk∗g\_\{k\}^\{\*\}are the surrogate and exact directional derivatives at𝐩\\mathbf\{p\};R1,R2R\_\{1\},R\_\{2\}are the residual terms in \([4](https://arxiv.org/html/2605.30651#S4.E4)\);𝐩=\(1/K,…,1/K\)∈ΔK\\mathbf\{p\}=\(1/K,\\ldots,1/K\)\\in\\Delta^\{K\}is the uniform prior\.

##### Parameter decomposition\.

We split the student parameters as𝜽=\(𝐖out,𝜽rest\)\\boldsymbol\{\\theta\}=\(\\mathbf\{W\}\_\{\\mathrm\{out\}\},\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\), where𝐖out∈ℝ\|𝒱\|×d\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}is the LM head matrix and𝜽rest\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}collects all backbone parameters\. For each trajectory𝐲k\\mathbf\{y\}\_\{k\}and positiontt, the backbone produces a hidden state𝐡t∈ℝd\\mathbf\{h\}\_\{t\}\\in\\mathbb\{R\}^\{d\}, the LM head produces logits𝐳t=𝐖out​𝐡t∈ℝ\|𝒱\|\\mathbf\{z\}\_\{t\}=\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\mathbf\{h\}\_\{t\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}, and𝝅tk=softmax​\(𝐳t\)\\boldsymbol\{\\pi\}\_\{t\}^\{k\}=\\mathrm\{softmax\}\(\\mathbf\{z\}\_\{t\}\)\. The token\-level loss isℓt=−log⁡πtk​\(ytk\)\\ell\_\{t\}=\-\\log\\pi\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\), so thatℓk=1\|ak\|​∑tℓt\\ell\_\{k\}=\\tfrac\{1\}\{\|a\_\{k\}\|\}\\sum\_\{t\}\\ell\_\{t\}\.

##### Token\-level residual\.

We write the per\-token prediction residual as𝜹t≜𝝅tk−𝜹​\(ytk\)∈ℝ\|𝒱\|\\boldsymbol\{\\delta\}\_\{t\}\\triangleq\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\-\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}, which coincides with∇𝐳tℓt\\nabla\_\{\\mathbf\{z\}\_\{t\}\}\\ell\_\{t\}by the standard softmax–cross\-entropy identity \(see Lemma[5](https://arxiv.org/html/2605.30651#Thmlemma5)below\)\.

##### Stacked matrices\.

For each trajectory𝐲k\\mathbf\{y\}\_\{k\}we collect the per\-token residuals and hidden states into matrices

𝚫≜\[𝜹1,…,𝜹\|ak\|\]⊤∈ℝ\|ak\|×\|𝒱\|,𝐇≜\[𝐡1,…,𝐡\|ak\|\]⊤∈ℝ\|ak\|×d\.\\boldsymbol\{\\Delta\}\\;\\triangleq\\;\[\\boldsymbol\{\\delta\}\_\{1\},\\ldots,\\boldsymbol\{\\delta\}\_\{\|a\_\{k\}\|\}\]^\{\\top\}\\;\\in\\;\\mathbb\{R\}^\{\|a\_\{k\}\|\\times\|\\mathcal\{V\}\|\},\\qquad\\mathbf\{H\}\\;\\triangleq\\;\[\\mathbf\{h\}\_\{1\},\\ldots,\\mathbf\{h\}\_\{\|a\_\{k\}\|\}\]^\{\\top\}\\;\\in\\;\\mathbb\{R\}^\{\|a\_\{k\}\|\\times d\}\.\(42\)We also write𝐉H​\(𝐲k\)≜\[∂𝐡t/∂𝜽rest\]t=1\|ak\|∈ℝ\|ak\|​d×\|𝜽rest\|\\mathbf\{J\}\_\{H\}\(\\mathbf\{y\}\_\{k\}\)\\triangleq\[\\partial\\mathbf\{h\}\_\{t\}/\\partial\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\]\_\{t=1\}^\{\|a\_\{k\}\|\}\\in\\mathbb\{R\}^\{\|a\_\{k\}\|\\,d\\times\|\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\|\}for the sequence\-level Jacobian of the hidden states with respect to the backbone parameters\. The dependence on𝐲k\\mathbf\{y\}\_\{k\}is suppressed when the trajectory is clear from context\.

##### Aggregated quantities\.

Two aggregations of the per\-token residual norms appear repeatedly\. Define the \(length\-normalized\) Brier score

Brier​\(𝐲k\)≜1\|ak\|​∑t=1\|ak\|‖𝜹t‖2=‖𝚫‖F2\|ak\|\.\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\\;\\triangleq\\;\\tfrac\{1\}\{\|a\_\{k\}\|\}\\sum\_\{t=1\}^\{\|a\_\{k\}\|\}\\\|\\boldsymbol\{\\delta\}\_\{t\}\\\|^\{2\}\\;=\\;\\tfrac\{\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}\}\{\|a\_\{k\}\|\}\.\(43\)With this notation, the forward\-pass proxy of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)can be written as

ρ^k=∑t‖𝜹t‖2∑tℓt=Brier​\(𝐲k\)ℓk,\\hat\{\\rho\}\_\{k\}\\;=\\;\\frac\{\\sum\_\{t\}\\\|\\boldsymbol\{\\delta\}\_\{t\}\\\|^\{2\}\}\{\\sum\_\{t\}\\ell\_\{t\}\}\\;=\\;\\frac\{\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\}\{\\ell\_\{k\}\},\(44\)so that the1/\|ak\|1/\|a\_\{k\}\|factors cancel exactly in numerator and denominator\.

##### Multi\-trajectory quantities\.

Across the candidate pool, writeG∈ℝK×KG\\in\\mathbb\{R\}^\{K\\times K\}for the gradient Gram matrix withGi​j=⟨𝐠i,𝐠j⟩G\_\{ij\}=\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{g\}\_\{j\}\\rangle, and letℓ=\(ℓ1,…,ℓK\)⊤\\boldsymbol\{\\ell\}=\(\\ell\_\{1\},\\ldots,\\ell\_\{K\}\)^\{\\top\}\. We use the shorthands

𝐠¯≜1K​∑k𝐠k,ℓ¯≜1K​∑kℓk,\\bar\{\\mathbf\{g\}\}\\;\\triangleq\\;\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{k\}\\mathbf\{g\}\_\{k\},\\qquad\\bar\{\\ell\}\\;\\triangleq\\;\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{k\}\\ell\_\{k\},\(45\)together withρ^max≜maxi⁡ρ^i\\hat\{\\rho\}\_\{\\max\}\\triangleq\\max\_\{i\}\\hat\{\\rho\}\_\{i\},ℓmin≜mini⁡ℓi\\ell\_\{\\min\}\\triangleq\\min\_\{i\}\\ell\_\{i\}, andℓmax≜maxi⁡ℓi\\ell\_\{\\max\}\\triangleq\\max\_\{i\}\\ell\_\{i\}\. We also write‖𝐯‖𝐩2≜∑kpk​vk2\\\|\\mathbf\{v\}\\\|\_\{\\mathbf\{p\}\}^\{2\}\\triangleq\\sum\_\{k\}p\_\{k\}v\_\{k\}^\{2\}for the𝐩\\mathbf\{p\}\-weighted Euclidean norm; under uniform𝐩\\mathbf\{p\},‖𝐯‖𝐩2=1K​∑kvk2\\\|\\mathbf\{v\}\\\|\_\{\\mathbf\{p\}\}^\{2\}=\\tfrac\{1\}\{K\}\\sum\_\{k\}v\_\{k\}^\{2\}\.

#### B\.2\.2Structural assumptions

We make four structural assumptions on the reference studentπref\\pi\_\{\\mathrm\{ref\}\}and its candidate\-pool statistics\. All four are mild conditions that are either standard in the convergence analysis of deep networks or directly verifiable from a single forward pass\.

###### Assumption B\.1\(Hidden\-state norm stability\)\.

There exist constants0<C−≤C\+0<C\_\{\-\}\\leq C\_\{\+\}such that for every candidate trajectory𝐲k\\mathbf\{y\}\_\{k\}and every token positiontt,

C−2≤‖𝐡t‖2≤C\+2\.C\_\{\-\}^\{2\}\\;\\leq\\;\\\|\\mathbf\{h\}\_\{t\}\\\|^\{2\}\\;\\leq\\;C\_\{\+\}^\{2\}\.\(46\)

###### Assumption B\.2\(Local Jacobian conditioning\)\.

There existsΛ\>0\\Lambda\>0such that for every candidate trajectory𝐲k\\mathbf\{y\}\_\{k\},

λmax​\(𝐉H​\(𝐲k\)​𝐉H​\(𝐲k\)⊤\)≤Λ\.\\lambda\_\{\\max\}\\bigl\(\\mathbf\{J\}\_\{H\}\(\\mathbf\{y\}\_\{k\}\)\\,\\mathbf\{J\}\_\{H\}\(\\mathbf\{y\}\_\{k\}\)^\{\\top\}\\bigr\)\\;\\leq\\;\\Lambda\.\(47\)

###### Assumption B\.3\(Matrix cosine alignment\)\.

There existsζ−\>0\\zeta\_\{\-\}\>0such that for every candidate trajectory𝐲k\\mathbf\{y\}\_\{k\},

ζ​\(𝐲k\)≜‖𝚫⊤​𝐇‖F2‖𝚫‖F2⋅‖𝐇‖F2≥ζ−\>0\.\\zeta\(\\mathbf\{y\}\_\{k\}\)\\;\\triangleq\\;\\frac\{\\\|\\boldsymbol\{\\Delta\}^\{\\top\}\\mathbf\{H\}\\\|\_\{F\}^\{2\}\}\{\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}\\cdot\\\|\\mathbf\{H\}\\\|\_\{F\}^\{2\}\}\\;\\geq\\;\\zeta\_\{\-\}\\;\>\\;0\.\(48\)

###### Assumption B\.4\(Strictly positive trajectory loss\)\.

ℓmin=mini∈\[K\]⁡ℓi\>0\\ell\_\{\\min\}=\\min\_\{i\\in\[K\]\}\\ell\_\{i\}\>0\.

### B\.3Auxiliary Lemmas

This subsection collects building\-block lemmas that are reused by the proofs in Appendices[B\.4](https://arxiv.org/html/2605.30651#A2.SS4)–[B\.6](https://arxiv.org/html/2605.30651#A2.SS6)\. Lemmas[5](https://arxiv.org/html/2605.30651#Thmlemma5)–[7](https://arxiv.org/html/2605.30651#Thmlemma7)are the gradient identity and gradient norm bounds for the LM head and the backbone; they are the main tools for the proof of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)in Appendix[B\.4](https://arxiv.org/html/2605.30651#A2.SS4)\. Lemma[8](https://arxiv.org/html/2605.30651#Thmlemma8)gives the Taylor decomposition ofρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)around the uniform prior𝐩\\mathbf\{p\}, together with the closed form ofgk∗g\_\{k\}^\{\*\}and the definition ofg^k\\hat\{g\}\_\{k\}in terms of forward\-pass quantities; this is the starting identity for the proof of Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)in Appendix[B\.5](https://arxiv.org/html/2605.30651#A2.SS5)\. Lemma[9](https://arxiv.org/html/2605.30651#Thmlemma9)provides a forward\-computable upper bound on the entries of the gradient Gram matrixGGvia the proxyρ^i\\hat\{\\rho\}\_\{i\}, which controls everyGG\-dependent quantity that appears in the proof of Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)\. Finally, Lemma[10](https://arxiv.org/html/2605.30651#Thmlemma10)gives the closed\-form Hessian ofρ\\rhoonΔK\\Delta^\{K\}, used to bound the second\-order remainderR2R\_\{2\}\.

#### B\.3\.1LM head and backbone gradient identities and bounds

###### Lemma 5\(LM head gradient\)\.

Letℓt=−log⁡πtk​\(ytk\)\\ell\_\{t\}=\-\\log\\pi\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\),𝛑tk=softmax​\(𝐳t\)\\boldsymbol\{\\pi\}\_\{t\}^\{k\}=\\mathrm\{softmax\}\(\\mathbf\{z\}\_\{t\}\), and𝐳t=𝐖out​𝐡t\\mathbf\{z\}\_\{t\}=\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\mathbf\{h\}\_\{t\}\. Then

∂ℓt∂𝐖out=𝜹t​𝐡t⊤∈ℝ\|𝒱\|×d,∂ℓt∂𝐡t=𝐖out⊤​𝜹t\.\\frac\{\\partial\\ell\_\{t\}\}\{\\partial\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\;=\\;\\boldsymbol\{\\delta\}\_\{t\}\\,\\mathbf\{h\}\_\{t\}^\{\\top\}\\;\\in\\;\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\},\\qquad\\frac\{\\partial\\ell\_\{t\}\}\{\\partial\\mathbf\{h\}\_\{t\}\}\\;=\\;\\mathbf\{W\}\_\{\\mathrm\{out\}\}^\{\\top\}\\,\\boldsymbol\{\\delta\}\_\{t\}\.\(49\)

###### Proof\.

We prove the first identity; the second is immediate from the chain rule applied to𝐳t=𝐖out​𝐡t\\mathbf\{z\}\_\{t\}=\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\mathbf\{h\}\_\{t\}\. For any direction𝐔∈ℝ\|𝒱\|×d\\mathbf\{U\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\},

⟨∂ℓt∂𝐖out,𝐔⟩F\\displaystyle\\left\\langle\\frac\{\\partial\\ell\_\{t\}\}\{\\partial\\mathbf\{W\}\_\{\\mathrm\{out\}\}\},\\,\\mathbf\{U\}\\right\\rangle\_\{F\}=\(a\)⟨∂ℓt∂𝐳t,∂𝐳t∂𝐖out​\[𝐔\]⟩\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\\;\\left\\langle\\frac\{\\partial\\ell\_\{t\}\}\{\\partial\\mathbf\{z\}\_\{t\}\},\\,\\frac\{\\partial\\mathbf\{z\}\_\{t\}\}\{\\partial\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\[\\mathbf\{U\}\]\\right\\rangle=\(b\)⟨∂ℓt∂𝐳t,𝐔𝐡t⟩\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{=\}\}\\;\\left\\langle\\frac\{\\partial\\ell\_\{t\}\}\{\\partial\\mathbf\{z\}\_\{t\}\},\\,\\mathbf\{U\}\\mathbf\{h\}\_\{t\}\\right\\rangle=\(c\)∑v=1\|𝒱\|∂ℓt∂zt,v​\[𝐔𝐡t\]v\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{=\}\}\\;\\sum\_\{v=1\}^\{\|\\mathcal\{V\}\|\}\\frac\{\\partial\\ell\_\{t\}\}\{\\partial z\_\{t,v\}\}\\,\[\\mathbf\{U\}\\mathbf\{h\}\_\{t\}\]\_\{v\}=\(d\)∑v=1\|𝒱\|\(πtk​\(v\)−𝟏​\[v=ytk\]\)​\[𝐔𝐡t\]v\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{=\}\}\\;\\sum\_\{v=1\}^\{\|\\mathcal\{V\}\|\}\\bigl\(\\pi\_\{t\}^\{k\}\(v\)\-\\mathbf\{1\}\[v=y\_\{t\}^\{k\}\]\\bigr\)\\,\[\\mathbf\{U\}\\mathbf\{h\}\_\{t\}\]\_\{v\}=\(e\)⟨𝝅tk−𝜹​\(ytk\),𝐔𝐡t⟩\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(e\)\}\}\{\{=\}\}\\;\\langle\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\-\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\),\\,\\mathbf\{U\}\\mathbf\{h\}\_\{t\}\\rangle=\(f\)⟨𝜹t​𝐡t⊤,𝐔⟩F,\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(f\)\}\}\{\{=\}\}\\;\\langle\\boldsymbol\{\\delta\}\_\{t\}\\,\\mathbf\{h\}\_\{t\}^\{\\top\},\\,\\mathbf\{U\}\\rangle\_\{F\},\(50\)where \(a\) is the chain rule for the Fréchet derivative; \(b\) follows from the linearity of𝐳t=𝐖out​𝐡t\\mathbf\{z\}\_\{t\}=\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\mathbf\{h\}\_\{t\}in𝐖out\\mathbf\{W\}\_\{\\mathrm\{out\}\}, giving∂𝐳t∂𝐖out​\[𝐔\]=𝐔𝐡t\\frac\{\\partial\\mathbf\{z\}\_\{t\}\}\{\\partial\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\[\\mathbf\{U\}\]=\\mathbf\{U\}\\mathbf\{h\}\_\{t\}; \(c\) expands the inner product componentwise; \(d\) is the standard softmax–cross\-entropy gradient identity,∂ℓt/∂zt,v=πtk​\(v\)−𝟏​\[v=ytk\]\\partial\\ell\_\{t\}/\\partial z\_\{t,v\}=\\pi\_\{t\}^\{k\}\(v\)\-\\mathbf\{1\}\[v=y\_\{t\}^\{k\}\], obtained by applying the quotient rule toπtk​\(ytk\)=ezt,ytk/∑v′ezt,v′\\pi\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\)=e^\{z\_\{t,y\_\{t\}^\{k\}\}\}/\\sum\_\{v^\{\\prime\}\}e^\{z\_\{t,v^\{\\prime\}\}\}; \(e\) recognizes the sum as⟨𝜹t,𝐔𝐡t⟩\\langle\\boldsymbol\{\\delta\}\_\{t\},\\mathbf\{U\}\\mathbf\{h\}\_\{t\}\\ranglewith𝜹t=𝝅tk−𝜹​\(ytk\)\\boldsymbol\{\\delta\}\_\{t\}=\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\-\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\); and \(f\) uses the trace identity⟨𝐚,𝐁𝐜⟩=⟨𝐚𝐜⊤,𝐁⟩F\\langle\\mathbf\{a\},\\mathbf\{B\}\\mathbf\{c\}\\rangle=\\langle\\mathbf\{a\}\\mathbf\{c\}^\{\\top\},\\mathbf\{B\}\\rangle\_\{F\}\. Since𝐔\\mathbf\{U\}was arbitrary,∂ℓt/∂𝐖out=𝜹t​𝐡t⊤\\partial\\ell\_\{t\}/\\partial\\mathbf\{W\}\_\{\\mathrm\{out\}\}=\\boldsymbol\{\\delta\}\_\{t\}\\mathbf\{h\}\_\{t\}^\{\\top\}\. ∎

###### Lemma 6\(LM head gradient bounds\)\.

Under Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)and[B\.3](https://arxiv.org/html/2605.30651#A2.Thmappassumption3), for every candidate trajectory𝐲k\\mathbf\{y\}\_\{k\},

ζ−​C−2​Brier​\(𝐲k\)≤‖∇𝐖outℓk‖F2≤C\+2​Brier​\(𝐲k\)\.\\zeta\_\{\-\}\\,C\_\{\-\}^\{2\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\\;\\leq\\;\\\|\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\ell\_\{k\}\\\|\_\{F\}^\{2\}\\;\\leq\\;C\_\{\+\}^\{2\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\.\(51\)

###### Proof\.

By Lemma[5](https://arxiv.org/html/2605.30651#Thmlemma5)and the definitionℓk=1\|ak\|​∑tℓt\\ell\_\{k\}=\\tfrac\{1\}\{\|a\_\{k\}\|\}\\sum\_\{t\}\\ell\_\{t\},

∇𝐖outℓk=1\|ak\|​∑t𝜹t​𝐡t⊤=1\|ak\|​𝚫⊤​𝐇,\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\ell\_\{k\}\\;=\\;\\tfrac\{1\}\{\|a\_\{k\}\|\}\\textstyle\\sum\_\{t\}\\boldsymbol\{\\delta\}\_\{t\}\\mathbf\{h\}\_\{t\}^\{\\top\}\\;=\\;\\tfrac\{1\}\{\|a\_\{k\}\|\}\\boldsymbol\{\\Delta\}^\{\\top\}\\mathbf\{H\},\(52\)where the second equality follows from the definitions of𝚫,𝐇\\boldsymbol\{\\Delta\},\\mathbf\{H\}in \([42](https://arxiv.org/html/2605.30651#A2.E42)\)\. Hence‖∇𝐖outℓk‖F2=‖𝚫⊤​𝐇‖F2/\|ak\|2\\\|\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\ell\_\{k\}\\\|\_\{F\}^\{2\}=\\\|\\boldsymbol\{\\Delta\}^\{\\top\}\\mathbf\{H\}\\\|\_\{F\}^\{2\}/\|a\_\{k\}\|^\{2\}\.

Lower bound\.

‖∇𝐖outℓk‖F2\\displaystyle\\\|\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\ell\_\{k\}\\\|\_\{F\}^\{2\}=\(a\)‖𝚫⊤​𝐇‖F2\|ak\|2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\\;\\frac\{\\\|\\boldsymbol\{\\Delta\}^\{\\top\}\\mathbf\{H\}\\\|\_\{F\}^\{2\}\}\{\|a\_\{k\}\|^\{2\}\}≥\(b\)ζ−​‖𝚫‖F2​‖𝐇‖F2\|ak\|2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{\\geq\}\}\\;\\frac\{\\zeta\_\{\-\}\\,\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}\\,\\\|\\mathbf\{H\}\\\|\_\{F\}^\{2\}\}\{\|a\_\{k\}\|^\{2\}\}≥\(c\)ζ−​\|ak\|​Brier​\(𝐲k\)⋅\|ak\|​C−2\|ak\|2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\geq\}\}\\;\\frac\{\\zeta\_\{\-\}\\,\|a\_\{k\}\|\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\\cdot\|a\_\{k\}\|\\,C\_\{\-\}^\{2\}\}\{\|a\_\{k\}\|^\{2\}\}=ζ−​C−2​Brier​\(𝐲k\),\\displaystyle\\;=\\;\\zeta\_\{\-\}\\,C\_\{\-\}^\{2\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\),\(53\)where \(a\) is \([52](https://arxiv.org/html/2605.30651#A2.E52)\); \(b\) applies Assumption[B\.3](https://arxiv.org/html/2605.30651#A2.Thmappassumption3), which gives‖𝚫⊤​𝐇‖F2≥ζ−​‖𝚫‖F2​‖𝐇‖F2\\\|\\boldsymbol\{\\Delta\}^\{\\top\}\\mathbf\{H\}\\\|\_\{F\}^\{2\}\\geq\\zeta\_\{\-\}\\,\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}\\,\\\|\\mathbf\{H\}\\\|\_\{F\}^\{2\}; and \(c\) substitutes‖𝚫‖F2=\|ak\|​Brier​\(𝐲k\)\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}=\|a\_\{k\}\|\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)from \([43](https://arxiv.org/html/2605.30651#A2.E43)\) together with‖𝐇‖F2=∑t‖𝐡t‖2≥\|ak\|​C−2\\\|\\mathbf\{H\}\\\|\_\{F\}^\{2\}=\\sum\_\{t\}\\\|\\mathbf\{h\}\_\{t\}\\\|^\{2\}\\geq\|a\_\{k\}\|\\,C\_\{\-\}^\{2\}from Assumption[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)\.

Upper bound\.

‖∇𝐖outℓk‖F2\\displaystyle\\\|\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\ell\_\{k\}\\\|\_\{F\}^\{2\}=\(d\)‖𝚫⊤​𝐇‖F2\|ak\|2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{=\}\}\\;\\frac\{\\\|\\boldsymbol\{\\Delta\}^\{\\top\}\\mathbf\{H\}\\\|\_\{F\}^\{2\}\}\{\|a\_\{k\}\|^\{2\}\}≤\(e\)‖𝚫‖F2​‖𝐇‖F2\|ak\|2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(e\)\}\}\{\{\\leq\}\}\\;\\frac\{\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}\\,\\\|\\mathbf\{H\}\\\|\_\{F\}^\{2\}\}\{\|a\_\{k\}\|^\{2\}\}≤\(f\)\|ak\|​Brier​\(𝐲k\)⋅\|ak\|​C\+2\|ak\|2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(f\)\}\}\{\{\\leq\}\}\\;\\frac\{\|a\_\{k\}\|\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\\cdot\|a\_\{k\}\|\\,C\_\{\+\}^\{2\}\}\{\|a\_\{k\}\|^\{2\}\}=C\+2​Brier​\(𝐲k\),\\displaystyle\\;=\\;C\_\{\+\}^\{2\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\),\(54\)where \(d\) is \([52](https://arxiv.org/html/2605.30651#A2.E52)\); \(e\) is the matrix Cauchy–Schwarz inequality‖𝚫⊤​𝐇‖F2≤‖𝚫‖F2​‖𝐇‖F2\\\|\\boldsymbol\{\\Delta\}^\{\\top\}\\mathbf\{H\}\\\|\_\{F\}^\{2\}\\leq\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}\\,\\\|\\mathbf\{H\}\\\|\_\{F\}^\{2\}; and \(f\) substitutes‖𝚫‖F2=\|ak\|​Brier​\(𝐲k\)\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}=\|a\_\{k\}\|\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)from \([43](https://arxiv.org/html/2605.30651#A2.E43)\) together with‖𝐇‖F2=∑t‖𝐡t‖2≤\|ak\|​C\+2\\\|\\mathbf\{H\}\\\|\_\{F\}^\{2\}=\\sum\_\{t\}\\\|\\mathbf\{h\}\_\{t\}\\\|^\{2\}\\leq\|a\_\{k\}\|\\,C\_\{\+\}^\{2\}from Assumption[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)\. ∎

###### Lemma 7\(Backbone gradient bound\)\.

Under Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)and[B\.2](https://arxiv.org/html/2605.30651#A2.Thmappassumption2), for every candidate trajectory𝐲k\\mathbf\{y\}\_\{k\},

‖∇𝜽restℓk‖2≤Λ​‖𝐖out‖op2\|ak\|​Brier​\(𝐲k\)\.\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\}\\ell\_\{k\}\\\|^\{2\}\\;\\leq\\;\\frac\{\\Lambda\\,\\\|\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{\\mathrm\{op\}\}^\{2\}\}\{\|a\_\{k\}\|\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\.\(55\)

###### Proof\.

By the chain rule together with the second identity of Lemma[5](https://arxiv.org/html/2605.30651#Thmlemma5),∂ℓt/∂𝐡t=𝐖out⊤​𝜹t\\partial\\ell\_\{t\}/\\partial\\mathbf\{h\}\_\{t\}=\\mathbf\{W\}\_\{\\mathrm\{out\}\}^\{\\top\}\\boldsymbol\{\\delta\}\_\{t\}, hence

∇𝜽restℓk=𝐉H​\(𝐲k\)⊤​vec​\(1\|ak\|​𝚫​𝐖out\)\.\\nabla\_\{\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\}\\ell\_\{k\}\\;=\\;\\mathbf\{J\}\_\{H\}\(\\mathbf\{y\}\_\{k\}\)^\{\\top\}\\,\\mathrm\{vec\}\\\!\\left\(\\tfrac\{1\}\{\|a\_\{k\}\|\}\\boldsymbol\{\\Delta\}\\,\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\right\)\.\(56\)Then

‖∇𝜽restℓk‖2\\displaystyle\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\}\\ell\_\{k\}\\\|^\{2\}≤\(a\)λmax​\(𝐉H​\(𝐲k\)​𝐉H​\(𝐲k\)⊤\)​‖vec​\(1\|ak\|​𝚫​𝐖out\)‖2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{\\leq\}\}\\;\\lambda\_\{\\max\}\\bigl\(\\mathbf\{J\}\_\{H\}\(\\mathbf\{y\}\_\{k\}\)\\,\\mathbf\{J\}\_\{H\}\(\\mathbf\{y\}\_\{k\}\)^\{\\top\}\\bigr\)\\left\\\|\\mathrm\{vec\}\\\!\\left\(\\tfrac\{1\}\{\|a\_\{k\}\|\}\\boldsymbol\{\\Delta\}\\,\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\right\)\\right\\\|^\{2\}≤\(b\)Λ\|ak\|2​‖𝚫​𝐖out‖F2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{\\leq\}\}\\;\\frac\{\\Lambda\}\{\|a\_\{k\}\|^\{2\}\}\\,\\\|\\boldsymbol\{\\Delta\}\\,\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{F\}^\{2\}≤\(c\)Λ​‖𝐖out‖op2\|ak\|2​‖𝚫‖F2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\leq\}\}\\;\\frac\{\\Lambda\\,\\\|\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{\\mathrm\{op\}\}^\{2\}\}\{\|a\_\{k\}\|^\{2\}\}\\,\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}=\(d\)Λ​‖𝐖out‖op2\|ak\|​Brier​\(𝐲k\),\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{=\}\}\\;\\frac\{\\Lambda\\,\\\|\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{\\mathrm\{op\}\}^\{2\}\}\{\|a\_\{k\}\|\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\),\(57\)where \(a\) uses‖𝐀⊤​𝐯‖2≤λmax​\(𝐀𝐀⊤\)​‖𝐯‖2\\\|\\mathbf\{A\}^\{\\top\}\\mathbf\{v\}\\\|^\{2\}\\leq\\lambda\_\{\\max\}\(\\mathbf\{A\}\\mathbf\{A\}^\{\\top\}\)\\\|\\mathbf\{v\}\\\|^\{2\}with𝐀=𝐉H​\(𝐲k\)\\mathbf\{A\}=\\mathbf\{J\}\_\{H\}\(\\mathbf\{y\}\_\{k\}\); \(b\) applies Assumption[B\.2](https://arxiv.org/html/2605.30651#A2.Thmappassumption2)together with the identity‖vec​\(𝐌\)‖2=‖𝐌‖F2\\\|\\mathrm\{vec\}\(\\mathbf\{M\}\)\\\|^\{2\}=\\\|\\mathbf\{M\}\\\|\_\{F\}^\{2\}; \(c\) uses the submultiplicativity‖𝚫​𝐖out‖F≤‖𝐖out‖op​‖𝚫‖F\\\|\\boldsymbol\{\\Delta\}\\,\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{F\}\\leq\\\|\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{\\mathrm\{op\}\}\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}; and \(d\) substitutes‖𝚫‖F2=\|ak\|​Brier​\(𝐲k\)\\\|\\boldsymbol\{\\Delta\}\\\|\_\{F\}^\{2\}=\|a\_\{k\}\|\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)from \([43](https://arxiv.org/html/2605.30651#A2.E43)\)\. ∎

#### B\.3\.2Taylor decomposition ofρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)

The next lemma packages the algebraic preliminaries for the linearization argument: the closed form of the exact directional derivativegk∗g\_\{k\}^\{\*\}, the definition of the forward\-pass surrogateg^k\\hat\{g\}\_\{k\}, and the resulting decomposition of the change inρ\\rhointo a linear term plus two residuals\.

###### Lemma 8\(Taylor decomposition\)\.

LetN​\(𝐪\)≜‖∑kqk​𝐠k‖2=𝐪⊤​G​𝐪N\(\\mathbf\{q\}\)\\triangleq\\\|\\sum\_\{k\}q\_\{k\}\\mathbf\{g\}\_\{k\}\\\|^\{2\}=\\mathbf\{q\}^\{\\top\}G\\mathbf\{q\}andD​\(𝐪\)≜∑kqk​ℓk=ℓ⊤​𝐪D\(\\mathbf\{q\}\)\\triangleq\\sum\_\{k\}q\_\{k\}\\ell\_\{k\}=\\boldsymbol\{\\ell\}^\{\\top\}\\mathbf\{q\}, so thatρ​\(𝛉ref;𝐪\)=N​\(𝐪\)/D​\(𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)=N\(\\mathbf\{q\}\)/D\(\\mathbf\{q\}\)\. Then:

- \(i\)\(*Closed form ofgk∗g\_\{k\}^\{\*\}\.*\) The exact directional derivative at𝐩\\mathbf\{p\}satisfies gk∗=∂∂qk​N​\(𝐪\)D​\(𝐪\)\|𝐪=𝐩=2​𝐠k⊤​𝐠¯ℓ¯−ρ​\(𝜽ref;𝐩\)​ℓkℓ¯\.g\_\{k\}^\{\*\}\\;=\\;\\left\.\\frac\{\\partial\}\{\\partial q\_\{k\}\}\\frac\{N\(\\mathbf\{q\}\)\}\{D\(\\mathbf\{q\}\)\}\\right\|\_\{\\mathbf\{q\}=\\mathbf\{p\}\}\\;=\\;\\frac\{2\\,\\mathbf\{g\}\_\{k\}^\{\\top\}\\bar\{\\mathbf\{g\}\}\}\{\\bar\{\\ell\}\}\\;\-\\;\\frac\{\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)\\,\\ell\_\{k\}\}\{\\bar\{\\ell\}\}\.\(58\)
- \(ii\)\(*Forward\-pass surrogate\.*\) Applying the diagonal approximations𝐠k⊤​𝐠¯≈‖𝐠k‖2/K\\mathbf\{g\}\_\{k\}^\{\\top\}\\bar\{\\mathbf\{g\}\}\\approx\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}/Kand‖𝐠¯‖2≈1K2​∑i‖𝐠i‖2\\\|\\bar\{\\mathbf\{g\}\}\\\|^\{2\}\\approx\\tfrac\{1\}\{K^\{2\}\}\\sum\_\{i\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}in \([58](https://arxiv.org/html/2605.30651#A2.E58)\), and substituting the proxyρ^i\\hat\{\\rho\}\_\{i\}forρi∗=‖𝐠i‖2/ℓi\\rho\_\{i\}^\{\*\}=\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}/\\ell\_\{i\}, gives g^k=ℓk∑iℓi​\(2​ρ^k−∑iρ^i​ℓi∑iℓi\)\.\\hat\{g\}\_\{k\}\\;=\\;\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\\!\\left\(2\\hat\{\\rho\}\_\{k\}\-\\frac\{\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\,\\ell\_\{i\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\right\)\.\(59\)
- \(iii\)\(*Decomposition\.*\) Setting𝜼≜𝐠^−𝐠∗\\boldsymbol\{\\eta\}\\triangleq\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}, the change inρ\\rhoadmits the decomposition ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)=⟨𝐪−𝐩,𝐠^⟩\+R1\+R2,\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)\\;=\\;\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\+\\;R\_\{1\}\\;\+\\;R\_\{2\},\(60\)where R1≜−⟨𝐪−𝐩,𝜼⟩,R2≜ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)−⟨𝐪−𝐩,𝐠∗⟩R\_\{1\}\\;\\triangleq\\;\-\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\boldsymbol\{\\eta\}\\rangle,\\qquad R\_\{2\}\\;\\triangleq\\;\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)\-\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\mathbf\{g\}^\{\*\}\\rangle\(61\)are the surrogate substitution error and the second\-order Taylor remainder, respectively\.

###### Proof\.

For \(i\), apply the quotient rule toN​\(𝐪\)/D​\(𝐪\)N\(\\mathbf\{q\}\)/D\(\\mathbf\{q\}\):

∂∂qk​N​\(𝐪\)D​\(𝐪\)=D​\(𝐪\)⋅2​𝐠k⊤​∑iqi​𝐠i−N​\(𝐪\)​ℓkD​\(𝐪\)2\.\\frac\{\\partial\}\{\\partial q\_\{k\}\}\\frac\{N\(\\mathbf\{q\}\)\}\{D\(\\mathbf\{q\}\)\}\\;=\\;\\frac\{D\(\\mathbf\{q\}\)\\cdot 2\\,\\mathbf\{g\}\_\{k\}^\{\\top\}\\\!\\sum\_\{i\}q\_\{i\}\\mathbf\{g\}\_\{i\}\-N\(\\mathbf\{q\}\)\\,\\ell\_\{k\}\}\{D\(\\mathbf\{q\}\)^\{2\}\}\.\(62\)Evaluating at𝐪=𝐩\\mathbf\{q\}=\\mathbf\{p\}withpk=1/Kp\_\{k\}=1/Kgives∑ipi​𝐠i=𝐠¯\\sum\_\{i\}p\_\{i\}\\mathbf\{g\}\_\{i\}=\\bar\{\\mathbf\{g\}\},D​\(𝐩\)=ℓ¯D\(\\mathbf\{p\}\)=\\bar\{\\ell\}, andN​\(𝐩\)/D​\(𝐩\)=ρ​\(𝜽ref;𝐩\)N\(\\mathbf\{p\}\)/D\(\\mathbf\{p\}\)=\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\), yielding \([58](https://arxiv.org/html/2605.30651#A2.E58)\)\.

For \(ii\), substitute𝐠k⊤​𝐠¯≈‖𝐠k‖2/K\\mathbf\{g\}\_\{k\}^\{\\top\}\\bar\{\\mathbf\{g\}\}\\approx\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}/Kandρ​\(𝜽ref;𝐩\)=‖𝐠¯‖2/ℓ¯≈1K2​ℓ¯​∑i‖𝐠i‖2\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)=\\\|\\bar\{\\mathbf\{g\}\}\\\|^\{2\}/\\bar\{\\ell\}\\approx\\tfrac\{1\}\{K^\{2\}\\bar\{\\ell\}\}\\sum\_\{i\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}into \([58](https://arxiv.org/html/2605.30651#A2.E58)\), and replace each‖𝐠i‖2=ρi∗​ℓi\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}=\\rho\_\{i\}^\{\*\}\\,\\ell\_\{i\}byρ^i​ℓi\\hat\{\\rho\}\_\{i\}\\,\\ell\_\{i\}\. UsingK​ℓ¯=∑iℓiK\\bar\{\\ell\}=\\sum\_\{i\}\\ell\_\{i\}and simplifying gives \([59](https://arxiv.org/html/2605.30651#A2.E59)\)\.

For \(iii\), expandρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)=⟨𝐪−𝐩,𝐠∗⟩\+R2\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)=\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\mathbf\{g\}^\{\*\}\\rangle\+R\_\{2\}by definition ofR2R\_\{2\}, and add and subtract⟨𝐪−𝐩,𝐠^⟩\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\hat\{\\mathbf\{g\}\}\\rangleto obtain \([60](https://arxiv.org/html/2605.30651#A2.E60)\) withR1R\_\{1\}as in \([61](https://arxiv.org/html/2605.30651#A2.E61)\)\. ∎

#### B\.3\.3Forward\-computable bound on the gradient Gram matrix

The next lemma upgrades the per\-trajectory bound of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)to a uniform, forward\-computable bound on every entry of the gradient Gram matrixGG\. It is the main tool that lets us replace gradient\-dependent quantities \(such as𝐠k⊤​𝐠¯\\mathbf\{g\}\_\{k\}^\{\\top\}\\bar\{\\mathbf\{g\}\},‖𝐠¯‖2\\\|\\bar\{\\mathbf\{g\}\}\\\|^\{2\}, and‖G‖op\\\|G\\\|\_\{\\mathrm\{op\}\}\) by forward\-only quantities throughout the proofs of Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)and Theorem[1](https://arxiv.org/html/2605.30651#Thmtheorem1)\.

###### Lemma 9\(Gram\-matrix entry bound\)\.

Assume Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)–[B\.3](https://arxiv.org/html/2605.30651#A2.Thmappassumption3), and let

E\+≜C\+2​maxi∈\[K\]⁡\(1\+εhead​\(\|ai\|\)\),εhead​\(\|ai\|\)=Λ​‖𝐖out‖op2C\+2​\|ai\|,E\_\{\+\}\\;\\triangleq\\;C\_\{\+\}^\{2\}\\max\_\{i\\in\[K\]\}\\bigl\(1\+\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{i\}\|\)\\bigr\),\\qquad\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{i\}\|\)\\;=\\;\\frac\{\\Lambda\\,\\\|\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{\\mathrm\{op\}\}^\{2\}\}\{C\_\{\+\}^\{2\}\\,\|a\_\{i\}\|\},\(63\)so that Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)yields the uniform inequalityρi∗≤E\+​ρ^i\\rho\_\{i\}^\{\*\}\\leq E\_\{\+\}\\,\\hat\{\\rho\}\_\{i\}for everyi∈\[K\]i\\in\[K\]\. Then for everyi,j∈\[K\]i,j\\in\[K\],

\|Gi​j\|≤Gi​i​Gj​j=ρi∗​ρj∗​ℓi​ℓj≤E\+​ρ^i​ρ^j​ℓi​ℓj\.\|G\_\{ij\}\|\\;\\leq\\;\\sqrt\{G\_\{ii\}\\,G\_\{jj\}\}\\;=\\;\\sqrt\{\\rho\_\{i\}^\{\*\}\\,\\rho\_\{j\}^\{\*\}\\,\\ell\_\{i\}\\,\\ell\_\{j\}\}\\;\\leq\\;E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{i\}\\,\\hat\{\\rho\}\_\{j\}\\,\\ell\_\{i\}\\,\\ell\_\{j\}\}\.\(64\)

###### Proof\.

We have

\|Gi​j\|\\displaystyle\|G\_\{ij\}\|≤\(a\)Gi​i​Gj​j\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{\\leq\}\}\\;\\sqrt\{G\_\{ii\}\\,G\_\{jj\}\}=\(b\)ρi∗​ρj∗​ℓi​ℓj\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{=\}\}\\;\\sqrt\{\\rho\_\{i\}^\{\*\}\\,\\rho\_\{j\}^\{\*\}\\,\\ell\_\{i\}\\,\\ell\_\{j\}\}≤\(c\)E\+​ρ^i​ρ^j​ℓi​ℓj,\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\leq\}\}\\;E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{i\}\\,\\hat\{\\rho\}\_\{j\}\\,\\ell\_\{i\}\\,\\ell\_\{j\}\},\(65\)where \(a\) is the entry\-wise Cauchy–Schwarz bound for inner products,\|⟨𝐠i,𝐠j⟩\|≤‖𝐠i‖​‖𝐠j‖\|\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{g\}\_\{j\}\\rangle\|\\leq\\\|\\mathbf\{g\}\_\{i\}\\\|\\\|\\mathbf\{g\}\_\{j\}\\\|; \(b\) usesGi​i=‖𝐠i‖2=ρi∗​ℓiG\_\{ii\}=\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}=\\rho\_\{i\}^\{\*\}\\,\\ell\_\{i\}from the definitionρi∗=‖𝐠i‖2/ℓi\\rho\_\{i\}^\{\*\}=\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}/\\ell\_\{i\}; and \(c\) applies the upper part of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)entry\-wise to each ofρi∗\\rho\_\{i\}^\{\*\}andρj∗\\rho\_\{j\}^\{\*\}\. ∎

#### B\.3\.4Hessian ofρ\\rhoon the simplex

###### Lemma 10\(Hessian formula\)\.

On the relative interior ofΔK\\Delta^\{K\}, the Hessian ofρ​\(𝛉ref;𝐪\)=N​\(𝐪\)/D​\(𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)=N\(\\mathbf\{q\}\)/D\(\\mathbf\{q\}\)in𝐪\\mathbf\{q\}is

∇𝐪2ρ​\(𝜽ref;𝐪\)=1D​\(𝐪\)​\[2​G−ℓ​\(∇ρ\)⊤−\(∇ρ\)​ℓ⊤\],\\nabla^\{2\}\_\{\\mathbf\{q\}\}\\,\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\\;=\\;\\frac\{1\}\{D\(\\mathbf\{q\}\)\}\\Bigl\[\\,2\\,G\\;\-\\;\\boldsymbol\{\\ell\}\\,\(\\nabla\\rho\)^\{\\top\}\\;\-\\;\(\\nabla\\rho\)\\,\\boldsymbol\{\\ell\}^\{\\top\}\\,\\Bigr\],\(66\)where∇ρ=\(2​G​𝐪−ρ​ℓ\)/D​\(𝐪\)\\nabla\\rho=\(2\\,G\\mathbf\{q\}\-\\rho\\,\\boldsymbol\{\\ell\}\)/D\(\\mathbf\{q\}\)\.

###### Proof\.

The quotient rule applied toρ=N/D\\rho=N/Dgives∇ρ=\(∇N−ρ​∇D\)/D=\(2​G​𝐪−ρ​ℓ\)/D\\nabla\\rho=\(\\nabla N\-\\rho\\,\\nabla D\)/D=\(2\\,G\\mathbf\{q\}\-\\rho\\,\\boldsymbol\{\\ell\}\)/D, or equivalentlyD​∇ρ=2​G​𝐪−ρ​ℓD\\,\\nabla\\rho=2\\,G\\mathbf\{q\}\-\\rho\\,\\boldsymbol\{\\ell\}\. Differentiating both sides with respect to𝐪\\mathbf\{q\}, using∇D=ℓ\\nabla D=\\boldsymbol\{\\ell\}and the product rule∇\(D​∇ρ\)=ℓ​\(∇ρ\)⊤\+D​∇2ρ\\nabla\(D\\nabla\\rho\)=\\boldsymbol\{\\ell\}\\,\(\\nabla\\rho\)^\{\\top\}\+D\\,\\nabla^\{2\}\\rho, gives

ℓ​\(∇ρ\)⊤\+D​∇2ρ=2​G−\(∇ρ\)​ℓ⊤\.\\boldsymbol\{\\ell\}\\,\(\\nabla\\rho\)^\{\\top\}\+D\\,\\nabla^\{2\}\\rho\\;=\\;2\\,G\\;\-\\;\(\\nabla\\rho\)\\,\\boldsymbol\{\\ell\}^\{\\top\}\.\(67\)Solving for∇2ρ\\nabla^\{2\}\\rhoand noting that the left\-hand side is symmetric \(which forces the symmetrized form of the rank\-2 correction\) yields \([66](https://arxiv.org/html/2605.30651#A2.E66)\)\. ∎

### B\.4Proof of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)\(Forward\-passρk∗\\rho\_\{k\}^\{\*\}Estimation\)

This subsection states and proves the formal version of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)\. The informal version in Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2)asserts the existence of absolute positive constantsC1,C2C\_\{1\},C\_\{2\}withρ^k∈\[C1​ρk∗,C2​ρk∗\]\\hat\{\\rho\}\_\{k\}\\in\[C\_\{1\}\\,\\rho\_\{k\}^\{\*\},\\,C\_\{2\}\\,\\rho\_\{k\}^\{\*\}\]\. The formal statement below makes the constantsC1,C2C\_\{1\},C\_\{2\}explicit in terms of the structural quantities of Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)–[B\.3](https://arxiv.org/html/2605.30651#A2.Thmappassumption3)\.

###### Lemma 11\(Formal version of Lemma[1](https://arxiv.org/html/2605.30651#Thmlemma1)\)\.

Under Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)–[B\.3](https://arxiv.org/html/2605.30651#A2.Thmappassumption3), for every candidate trajectory𝐲k\\mathbf\{y\}\_\{k\},

C1​ρk∗≤ρ^k≤C2​ρk∗,C\_\{1\}\\,\\rho\_\{k\}^\{\*\}\\;\\leq\\;\\hat\{\\rho\}\_\{k\}\\;\\leq\\;C\_\{2\}\\,\\rho\_\{k\}^\{\*\},\(68\)where

C1≜1E\+,C2≜1ζ−​C−2,E\+≜C\+2​maxi∈\[K\]⁡\(1\+εhead​\(\|ai\|\)\),C\_\{1\}\\;\\triangleq\\;\\frac\{1\}\{E\_\{\+\}\},\\qquad C\_\{2\}\\;\\triangleq\\;\\frac\{1\}\{\\zeta\_\{\-\}\\,C\_\{\-\}^\{2\}\},\\qquad E\_\{\+\}\\;\\triangleq\\;C\_\{\+\}^\{2\}\\max\_\{i\\in\[K\]\}\\bigl\(1\+\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{i\}\|\)\\bigr\),\(69\)and the LM\-head correction term is

εhead​\(\|ai\|\)≜Λ​‖𝐖out‖op2C\+2​\|ai\|\.\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{i\}\|\)\\;\\triangleq\\;\\frac\{\\Lambda\\,\\\|\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{\\mathrm\{op\}\}^\{2\}\}\{C\_\{\+\}^\{2\}\\,\|a\_\{i\}\|\}\.\(70\)

###### Proof\.

Since𝜽=\(𝐖out,𝜽rest\)\\boldsymbol\{\\theta\}=\(\\mathbf\{W\}\_\{\\mathrm\{out\}\},\\,\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\)are disjoint parameter blocks,

‖∇𝜽ℓk‖2=‖∇𝐖outℓk‖F2\+‖∇𝜽restℓk‖2\.\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\,\\ell\_\{k\}\\\|^\{2\}\\;=\\;\\\|\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\,\\ell\_\{k\}\\\|\_\{F\}^\{2\}\\;\+\\;\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\}\\,\\ell\_\{k\}\\\|^\{2\}\.\(71\)Using the equivalent formρ^k=Brier​\(𝐲k\)/ℓk\\hat\{\\rho\}\_\{k\}=\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)/\\ell\_\{k\}from \([44](https://arxiv.org/html/2605.30651#A2.E44)\) together withρk∗=‖∇𝜽ℓk‖2/ℓk\\rho\_\{k\}^\{\*\}=\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\,\\ell\_\{k\}\\\|^\{2\}/\\ell\_\{k\}from Section[4\.2](https://arxiv.org/html/2605.30651#S4.SS2), it suffices to relate‖∇𝜽ℓk‖2\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\ell\_\{k\}\\\|^\{2\}toBrier​\(𝐲k\)\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\.

Upper bound onρ^k\\hat\{\\rho\}\_\{k\}\.

ρk∗\\displaystyle\\rho\_\{k\}^\{\*\}=\(a\)‖∇𝜽ℓk‖2ℓk\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\\;\\frac\{\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\,\\ell\_\{k\}\\\|^\{2\}\}\{\\ell\_\{k\}\}≥\(b\)‖∇𝐖outℓk‖F2ℓk\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{\\geq\}\}\\;\\frac\{\\\|\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\,\\ell\_\{k\}\\\|\_\{F\}^\{2\}\}\{\\ell\_\{k\}\}≥\(c\)ζ−​C−2​Brier​\(𝐲k\)ℓk\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\geq\}\}\\;\\frac\{\\zeta\_\{\-\}\\,C\_\{\-\}^\{2\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\}\{\\ell\_\{k\}\}=\(d\)ζ−​C−2​ρ^k,\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{=\}\}\\;\\zeta\_\{\-\}\\,C\_\{\-\}^\{2\}\\,\\hat\{\\rho\}\_\{k\},\(72\)where \(a\) is the definitionρk∗=‖∇𝜽ℓk‖2/ℓk\\rho\_\{k\}^\{\*\}=\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\,\\ell\_\{k\}\\\|^\{2\}/\\ell\_\{k\}; \(b\) applies the decomposition \([71](https://arxiv.org/html/2605.30651#A2.E71)\) and discards the non\-negative backbone term‖∇𝜽restℓk‖2≥0\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\}\\,\\ell\_\{k\}\\\|^\{2\}\\geq 0; \(c\) applies the lower bound of Lemma[6](https://arxiv.org/html/2605.30651#Thmlemma6); and \(d\) substitutesρ^k=Brier​\(𝐲k\)/ℓk\\hat\{\\rho\}\_\{k\}=\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)/\\ell\_\{k\}from \([44](https://arxiv.org/html/2605.30651#A2.E44)\)\. Therefore,

ρ^k≤1ζ−​C−2​ρk∗=C2​ρk∗\.\\hat\{\\rho\}\_\{k\}\\;\\leq\\;\\frac\{1\}\{\\zeta\_\{\-\}\\,C\_\{\-\}^\{2\}\}\\,\\rho\_\{k\}^\{\*\}\\;=\\;C\_\{2\}\\,\\rho\_\{k\}^\{\*\}\.\(73\)
Lower bound onρ^k\\hat\{\\rho\}\_\{k\}\.

ρk∗\\displaystyle\\rho\_\{k\}^\{\*\}=\(e\)‖∇𝜽ℓk‖2ℓk\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(e\)\}\}\{\{=\}\}\\;\\frac\{\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\,\\ell\_\{k\}\\\|^\{2\}\}\{\\ell\_\{k\}\}=\(f\)‖∇𝐖outℓk‖F2\+‖∇𝜽restℓk‖2ℓk\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(f\)\}\}\{\{=\}\}\\;\\frac\{\\\|\\nabla\_\{\\mathbf\{W\}\_\{\\mathrm\{out\}\}\}\\,\\ell\_\{k\}\\\|\_\{F\}^\{2\}\+\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\_\{\\mathrm\{rest\}\}\}\\,\\ell\_\{k\}\\\|^\{2\}\}\{\\ell\_\{k\}\}≤\(g\)1ℓk​\(C\+2​Brier​\(𝐲k\)\+Λ​‖𝐖out‖op2\|ak\|​Brier​\(𝐲k\)\)\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(g\)\}\}\{\{\\leq\}\}\\;\\frac\{1\}\{\\ell\_\{k\}\}\\\!\\left\(C\_\{\+\}^\{2\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\\;\+\\;\\frac\{\\Lambda\\,\\\|\\mathbf\{W\}\_\{\\mathrm\{out\}\}\\\|\_\{\\mathrm\{op\}\}^\{2\}\}\{\|a\_\{k\}\|\}\\,\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)\\right\)=\(h\)C\+2​\(1\+εhead​\(\|ak\|\)\)​ρ^k\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(h\)\}\}\{\{=\}\}\\;C\_\{\+\}^\{2\}\\,\\bigl\(1\+\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{k\}\|\)\\bigr\)\\,\\hat\{\\rho\}\_\{k\}≤\(i\)E\+​ρ^k,\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(i\)\}\}\{\{\\leq\}\}\\;E\_\{\+\}\\,\\hat\{\\rho\}\_\{k\},\(74\)where \(e\) is the definitionρk∗=‖∇𝜽ℓk‖2/ℓk\\rho\_\{k\}^\{\*\}=\\\|\\nabla\_\{\\boldsymbol\{\\theta\}\}\\,\\ell\_\{k\}\\\|^\{2\}/\\ell\_\{k\}; \(f\) applies the decomposition \([71](https://arxiv.org/html/2605.30651#A2.E71)\); \(g\) applies the upper bound of Lemma[6](https://arxiv.org/html/2605.30651#Thmlemma6)to the LM\-head term and Lemma[7](https://arxiv.org/html/2605.30651#Thmlemma7)to the backbone term; \(h\) substitutesρ^k=Brier​\(𝐲k\)/ℓk\\hat\{\\rho\}\_\{k\}=\\mathrm\{Brier\}\(\\mathbf\{y\}\_\{k\}\)/\\ell\_\{k\}from \([44](https://arxiv.org/html/2605.30651#A2.E44)\) together with the definition \([70](https://arxiv.org/html/2605.30651#A2.E70)\) ofεhead​\(\|ak\|\)\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{k\}\|\); and \(i\) follows from the definition ofE\+E\_\{\+\}in \([69](https://arxiv.org/html/2605.30651#A2.E69)\)\. Therefore,

ρ^k≥1E\+​ρk∗=C1​ρk∗\.\\hat\{\\rho\}\_\{k\}\\;\\geq\\;\\frac\{1\}\{E\_\{\+\}\}\\,\\rho\_\{k\}^\{\*\}\\;=\\;C\_\{1\}\\,\\rho\_\{k\}^\{\*\}\.\(75\)
Combining \([73](https://arxiv.org/html/2605.30651#A2.E73)\) and \([75](https://arxiv.org/html/2605.30651#A2.E75)\) yields \([68](https://arxiv.org/html/2605.30651#A2.E68)\)\. ∎

### B\.5Proof of Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)\(χ2\\chi^\{2\}Controls Both Error Terms\)

This subsection states and proves the formal version of Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)\. We use the Taylor decomposition in Lemma[8](https://arxiv.org/html/2605.30651#Thmlemma8)\. Let𝜼≜𝐠^−𝐠∗\\boldsymbol\{\\eta\}\\triangleq\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\. The two residual terms in \([60](https://arxiv.org/html/2605.30651#A2.E60)\) are

R1=−⟨𝐪−𝐩,𝜼⟩,R2=ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)−⟨𝐪−𝐩,𝐠∗⟩\.R\_\{1\}\\;=\\;\-\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\boldsymbol\{\\eta\}\\rangle,\\qquad R\_\{2\}\\;=\\;\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{p\}\)\-\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\mathbf\{g\}^\{\*\}\\rangle\.\(76\)Sincepk=1/Kp\_\{k\}=1/K, we also have

χ2​\(𝐪∥𝐩\)=∑k\(qk−pk\)2pk=K​‖𝐪−𝐩‖2\.\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)=\\sum\_\{k\}\\frac\{\(q\_\{k\}\-p\_\{k\}\)^\{2\}\}\{p\_\{k\}\}=K\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|^\{2\}\.\(77\)
Let

E\+\\displaystyle E\_\{\+\}≜C\+2​maxi∈\[K\]⁡\(1\+εhead​\(\|ai\|\)\),\\displaystyle\\triangleq C\_\{\+\}^\{2\}\\max\_\{i\\in\[K\]\}\\bigl\(1\+\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{i\}\|\)\\bigr\),\(78\)Δρ\\displaystyle\\Delta\_\{\\rho\}≜max⁡\(\|1−ζ−​C−2\|,maxi∈\[K\]⁡\|C\+2​\(1\+εhead​\(\|ai\|\)\)−1\|\)\.\\displaystyle\\triangleq\\max\\\!\\Bigl\(\\bigl\|1\-\\zeta\_\{\-\}C\_\{\-\}^\{2\}\\bigr\|,\\;\\max\_\{i\\in\[K\]\}\\bigl\|C\_\{\+\}^\{2\}\(1\+\\varepsilon\_\{\\mathrm\{head\}\}\(\|a\_\{i\}\|\)\)\-1\\bigr\|\\Bigr\)\.\(79\)By Lemma[11](https://arxiv.org/html/2605.30651#Thmlemma11), for alli∈\[K\]i\\in\[K\],ρi∗≤E\+​ρ^i\\rho\_\{i\}^\{\*\}\\leq E\_\{\+\}\\hat\{\\rho\}\_\{i\}and\|ρ^i−ρi∗\|≤Δρ​ρ^i\|\\hat\{\\rho\}\_\{i\}\-\\rho\_\{i\}^\{\*\}\|\\leq\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{i\}\. We also defineρ^max≜maxi⁡ρ^i\\hat\{\\rho\}\_\{\\max\}\\triangleq\\max\_\{i\}\\hat\{\\rho\}\_\{i\},ℓmin≜mini⁡ℓi\\ell\_\{\\min\}\\triangleq\\min\_\{i\}\\ell\_\{i\},ℓmax≜maxi⁡ℓi\\ell\_\{\\max\}\\triangleq\\max\_\{i\}\\ell\_\{i\}, andℓ¯≜K−1​∑iℓi\\bar\{\\ell\}\\triangleq K^\{\-1\}\\sum\_\{i\}\\ell\_\{i\}\.

###### Lemma 12\(Formal version of Lemma[2](https://arxiv.org/html/2605.30651#Thmlemma2)\)\.

Under Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)–[B\.4](https://arxiv.org/html/2605.30651#A2.Thmappassumption4), for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\},

\|R1\|≤α1​χ2​\(𝐪∥𝐩\)\+1,\|R2\|≤α2​χ2​\(𝐪∥𝐩\),\|R\_\{1\}\|\\;\\leq\\;\\alpha\_\{1\}\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\+1,\\qquad\|R\_\{2\}\|\\;\\leq\\;\\alpha\_\{2\}\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\),\(80\)where

α1\\displaystyle\\alpha\_\{1\}=14​\(3​Δρ​ρ^maxK\+2​E\+​ρ^max​ℓmax​∑iρ^i​ℓiK​ℓ¯\+E\+​ℓmax​∑iρ^i​ℓiK​ℓ¯2\)2,\\displaystyle=\\frac\{1\}\{4\}\\left\(\\frac\{3\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}\}\{\\sqrt\{K\}\}\+\\frac\{2E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{\\max\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\}\{\\sqrt\{K\}\\,\\bar\{\\ell\}\}\+\\frac\{E\_\{\+\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\{K\\bar\{\\ell\}^\{2\}\}\\right\)^\{2\},\(81\)α2\\displaystyle\\alpha\_\{2\}=E\+​∑iρ^i​ℓiK​ℓmin​\(1\+K​ℓmaxℓmin\)2\.\\displaystyle=\\frac\{E\_\{\+\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\{K\\ell\_\{\\min\}\}\\left\(1\+\\frac\{\\sqrt\{K\}\\,\\ell\_\{\\max\}\}\{\\ell\_\{\\min\}\}\\right\)^\{2\}\.\(82\)

###### Proof\.

We first controlR1R\_\{1\}\. By Cauchy–Schwarz and \([77](https://arxiv.org/html/2605.30651#A2.E77)\),

\|R1\|\\displaystyle\|R\_\{1\}\|=\|∑k\(qk−pk\)​ηk\|\\displaystyle=\\left\|\\sum\_\{k\}\(q\_\{k\}\-p\_\{k\}\)\\eta\_\{k\}\\right\|≤K​‖𝐪−𝐩‖2​1K​∑kηk2\\displaystyle\\leq\\sqrt\{K\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|^\{2\}\}\\sqrt\{\\frac\{1\}\{K\}\\sum\_\{k\}\\eta\_\{k\}^\{2\}\}=χ2​\(𝐪∥𝐩\)​‖𝜼‖𝐩,\\displaystyle=\\sqrt\{\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\}\\,\\\|\\boldsymbol\{\\eta\}\\\|\_\{\\mathbf\{p\}\},\(83\)where‖𝐯‖𝐩2≜K−1​∑kvk2\\\|\\mathbf\{v\}\\\|\_\{\\mathbf\{p\}\}^\{2\}\\triangleq K^\{\-1\}\\sum\_\{k\}v\_\{k\}^\{2\}under the uniform prior\. It remains to bound‖𝜼‖𝐩\\\|\\boldsymbol\{\\eta\}\\\|\_\{\\mathbf\{p\}\}\.

Introduce the intermediate vector

g~k≜ℓk∑iℓi​\(2​ρk∗−∑iρi∗​ℓi∑iℓi\)\.\\tilde\{g\}\_\{k\}\\triangleq\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\left\(2\\rho\_\{k\}^\{\*\}\-\\frac\{\\sum\_\{i\}\\rho\_\{i\}^\{\*\}\\ell\_\{i\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\right\)\.\(84\)This vector keeps the exact per\-trajectory ratesρi∗\\rho\_\{i\}^\{\*\}but uses the same diagonal approximation that leads tog^k\\hat\{g\}\_\{k\}\. Therefore,

𝜼=\(𝐠^−𝐠~\)\+\(𝐠~−𝐠∗\)≜𝜼prox\+𝜼diag\.\\boldsymbol\{\\eta\}=\(\\hat\{\\mathbf\{g\}\}\-\\tilde\{\\mathbf\{g\}\}\)\+\(\\tilde\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\)\\triangleq\\boldsymbol\{\\eta\}\_\{\\mathrm\{prox\}\}\+\\boldsymbol\{\\eta\}\_\{\\mathrm\{diag\}\}\.\(85\)
For the proxy\-substitution term, subtracting \([84](https://arxiv.org/html/2605.30651#A2.E84)\) from \([59](https://arxiv.org/html/2605.30651#A2.E59)\) gives

ηprox,k=ℓk∑iℓi​\(2​\(ρ^k−ρk∗\)−∑i\(ρ^i−ρi∗\)​ℓi∑iℓi\)\.\\eta\_\{\\mathrm\{prox\},k\}=\\frac\{\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\left\(2\(\\hat\{\\rho\}\_\{k\}\-\\rho\_\{k\}^\{\*\}\)\-\\frac\{\\sum\_\{i\}\(\\hat\{\\rho\}\_\{i\}\-\\rho\_\{i\}^\{\*\}\)\\ell\_\{i\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\\right\)\.\(86\)Using\|ρ^i−ρi∗\|≤Δρ​ρ^max\|\\hat\{\\rho\}\_\{i\}\-\\rho\_\{i\}^\{\*\}\|\\leq\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}for allii, we obtain

\|ηprox,k\|≤3​Δρ​ρ^max​ℓk∑iℓi\.\|\\eta\_\{\\mathrm\{prox\},k\}\|\\leq\\frac\{3\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}\\ell\_\{k\}\}\{\\sum\_\{i\}\\ell\_\{i\}\}\.\(87\)Hence

‖𝜼prox‖𝐩2\\displaystyle\\\|\\boldsymbol\{\\eta\}\_\{\\mathrm\{prox\}\}\\\|\_\{\\mathbf\{p\}\}^\{2\}=1K​∑k\|ηprox,k\|2\\displaystyle=\\frac\{1\}\{K\}\\sum\_\{k\}\|\\eta\_\{\\mathrm\{prox\},k\}\|^\{2\}≤\(3​Δρ​ρ^max\)2K​∑kℓk2\(∑iℓi\)2\\displaystyle\\leq\\frac\{\(3\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}\)^\{2\}\}\{K\}\\frac\{\\sum\_\{k\}\\ell\_\{k\}^\{2\}\}\{\(\\sum\_\{i\}\\ell\_\{i\}\)^\{2\}\}≤\(3​Δρ​ρ^max\)2K,\\displaystyle\\leq\\frac\{\(3\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}\)^\{2\}\}\{K\},\(88\)where the last inequality uses∑kℓk2≤\(∑iℓi\)2\\sum\_\{k\}\\ell\_\{k\}^\{2\}\\leq\(\\sum\_\{i\}\\ell\_\{i\}\)^\{2\}\. Thus

‖𝜼prox‖𝐩≤3​Δρ​ρ^maxK\.\\\|\\boldsymbol\{\\eta\}\_\{\\mathrm\{prox\}\}\\\|\_\{\\mathbf\{p\}\}\\leq\\frac\{3\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}\}\{\\sqrt\{K\}\}\.\(89\)
We next bound the diagonal\-approximation error\. Define

ok≜𝐠k⊤​𝐠¯−‖𝐠k‖2K=1K​∑i≠kGk​i,O≜‖𝐠¯‖2−1K2​∑i‖𝐠i‖2=1K2​∑i≠jGi​j\.o\_\{k\}\\triangleq\\mathbf\{g\}\_\{k\}^\{\\top\}\\bar\{\\mathbf\{g\}\}\-\\frac\{\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}\}\{K\}=\\frac\{1\}\{K\}\\sum\_\{i\\neq k\}G\_\{ki\},\\qquad O\\triangleq\\\|\\bar\{\\mathbf\{g\}\}\\\|^\{2\}\-\\frac\{1\}\{K^\{2\}\}\\sum\_\{i\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}=\\frac\{1\}\{K^\{2\}\}\\sum\_\{i\\neq j\}G\_\{ij\}\.\(90\)Combining the closed form ofgk∗g\_\{k\}^\{\*\}in Lemma[8](https://arxiv.org/html/2605.30651#Thmlemma8)\(i\) with \([84](https://arxiv.org/html/2605.30651#A2.E84)\) gives

ηdiag,k=g~k−gk∗=−2​okℓ¯\+O​ℓkℓ¯2\.\\eta\_\{\\mathrm\{diag\},k\}=\\tilde\{g\}\_\{k\}\-g\_\{k\}^\{\*\}=\-\\frac\{2o\_\{k\}\}\{\\bar\{\\ell\}\}\+\\frac\{O\\ell\_\{k\}\}\{\\bar\{\\ell\}^\{2\}\}\.\(91\)By Lemma[9](https://arxiv.org/html/2605.30651#Thmlemma9),\|Gi​j\|≤E\+​ρ^i​ρ^j​ℓi​ℓj\|G\_\{ij\}\|\\leq E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{i\}\\hat\{\\rho\}\_\{j\}\\ell\_\{i\}\\ell\_\{j\}\}\. Therefore,

K​\|ok\|\\displaystyle K\|o\_\{k\}\|≤∑i≠k\|Gk​i\|\\displaystyle\\leq\\sum\_\{i\\neq k\}\|G\_\{ki\}\|≤E\+​ρ^k​ℓk​∑i≠kρ^i​ℓi\\displaystyle\\leq E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{k\}\\ell\_\{k\}\}\\sum\_\{i\\neq k\}\\sqrt\{\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}≤E\+​ρ^k​ℓk​K​∑iρ^i​ℓi\.\\displaystyle\\leq E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{k\}\\ell\_\{k\}\}\\sqrt\{K\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\.\(92\)Taking the maximum overkkyields

maxk⁡\|ok\|≤E\+​ρ^max​ℓmax​∑iρ^i​ℓiK\.\\max\_\{k\}\|o\_\{k\}\|\\leq\\frac\{E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{\\max\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\}\{\\sqrt\{K\}\}\.\(93\)Similarly,

K2​\|O\|\\displaystyle K^\{2\}\|O\|≤∑i≠j\|Gi​j\|\\displaystyle\\leq\\sum\_\{i\\neq j\}\|G\_\{ij\}\|≤E\+​\(∑iρ^i​ℓi\)2\\displaystyle\\leq E\_\{\+\}\\left\(\\sum\_\{i\}\\sqrt\{\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\\right\)^\{2\}≤E\+​K​∑iρ^i​ℓi,\\displaystyle\\leq E\_\{\+\}K\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\},\(94\)and therefore

\|O\|≤E\+​∑iρ^i​ℓiK\.\|O\|\\leq\\frac\{E\_\{\+\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\{K\}\.\(95\)Using \([91](https://arxiv.org/html/2605.30651#A2.E91)\), \([93](https://arxiv.org/html/2605.30651#A2.E93)\), \([95](https://arxiv.org/html/2605.30651#A2.E95)\), and‖𝐯‖𝐩≤maxk⁡\|vk\|\\\|\\mathbf\{v\}\\\|\_\{\\mathbf\{p\}\}\\leq\\max\_\{k\}\|v\_\{k\}\|, we get

‖𝜼diag‖𝐩\\displaystyle\\\|\\boldsymbol\{\\eta\}\_\{\\mathrm\{diag\}\}\\\|\_\{\\mathbf\{p\}\}≤2​maxk⁡\|ok\|ℓ¯\+\|O\|​ℓmaxℓ¯2\\displaystyle\\leq\\frac\{2\\max\_\{k\}\|o\_\{k\}\|\}\{\\bar\{\\ell\}\}\+\\frac\{\|O\|\\ell\_\{\\max\}\}\{\\bar\{\\ell\}^\{2\}\}≤2​E\+​ρ^max​ℓmax​∑iρ^i​ℓiK​ℓ¯\+E\+​ℓmax​∑iρ^i​ℓiK​ℓ¯2\.\\displaystyle\\leq\\frac\{2E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{\\max\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\}\{\\sqrt\{K\}\\,\\bar\{\\ell\}\}\+\\frac\{E\_\{\+\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\{K\\bar\{\\ell\}^\{2\}\}\.\(96\)Combining \([89](https://arxiv.org/html/2605.30651#A2.E89)\) and \([96](https://arxiv.org/html/2605.30651#A2.E96)\) gives

‖𝜼‖𝐩≤α1rms,\\\|\\boldsymbol\{\\eta\}\\\|\_\{\\mathbf\{p\}\}\\leq\\alpha\_\{1\}^\{\\mathrm\{rms\}\},\(97\)where

α1rms≜3​Δρ​ρ^maxK\+2​E\+​ρ^max​ℓmax​∑iρ^i​ℓiK​ℓ¯\+E\+​ℓmax​∑iρ^i​ℓiK​ℓ¯2\.\\alpha\_\{1\}^\{\\mathrm\{rms\}\}\\triangleq\\frac\{3\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}\}\{\\sqrt\{K\}\}\+\\frac\{2E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{\\max\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\}\{\\sqrt\{K\}\\,\\bar\{\\ell\}\}\+\\frac\{E\_\{\+\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\{K\\bar\{\\ell\}^\{2\}\}\.\(98\)
Ifα1rms=0\\alpha\_\{1\}^\{\\mathrm\{rms\}\}=0, then𝜼=𝟎\\boldsymbol\{\\eta\}=\\mathbf\{0\}and henceR1=0R\_\{1\}=0, so the desired bound is trivial\. Otherwise, applying Young’s inequality witha=χ2​\(𝐪∥𝐩\)a=\\sqrt\{\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\},b=‖𝜼‖𝐩b=\\\|\\boldsymbol\{\\eta\}\\\|\_\{\\mathbf\{p\}\}, andγ=\(α1rms\)2/2\\gamma=\(\\alpha\_\{1\}^\{\\mathrm\{rms\}\}\)^\{2\}/2gives

\|R1\|\\displaystyle\|R\_\{1\}\|≤χ2​\(𝐪∥𝐩\)​‖𝜼‖𝐩\\displaystyle\\leq\\sqrt\{\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\}\\,\\\|\\boldsymbol\{\\eta\}\\\|\_\{\\mathbf\{p\}\}≤γ2​χ2​\(𝐪∥𝐩\)\+12​γ​‖𝜼‖𝐩2\\displaystyle\\leq\\frac\{\\gamma\}\{2\}\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\+\\frac\{1\}\{2\\gamma\}\\\|\\boldsymbol\{\\eta\}\\\|\_\{\\mathbf\{p\}\}^\{2\}≤\(α1rms\)24​χ2​\(𝐪∥𝐩\)\+1\.\\displaystyle\\leq\\frac\{\(\\alpha\_\{1\}^\{\\mathrm\{rms\}\}\)^\{2\}\}\{4\}\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\+1\.\(99\)This proves the first bound in \([80](https://arxiv.org/html/2605.30651#A2.E80)\), withα1=\(α1rms\)2/4\\alpha\_\{1\}=\(\\alpha\_\{1\}^\{\\mathrm\{rms\}\}\)^\{2\}/4, which is exactly \([81](https://arxiv.org/html/2605.30651#A2.E81)\)\.

It remains to controlR2R\_\{2\}\. By Assumption[B\.4](https://arxiv.org/html/2605.30651#A2.Thmappassumption4),D​\(𝐪\)=ℓ⊤​𝐪≥ℓmin\>0D\(\\mathbf\{q\}\)=\\boldsymbol\{\\ell\}^\{\\top\}\\mathbf\{q\}\\geq\\ell\_\{\\min\}\>0onΔK\\Delta^\{K\}, soρ​\(𝜽ref;𝐪\)\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)is twice continuously differentiable on the simplex\. The second\-order Taylor remainder gives

\|R2\|≤M2​‖𝐪−𝐩‖2=M2​K​χ2​\(𝐪∥𝐩\),M≜sup𝐪∈ΔK‖∇𝐪2ρ​\(𝜽ref;𝐪\)‖op\.\|R\_\{2\}\|\\leq\\frac\{M\}\{2\}\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|^\{2\}=\\frac\{M\}\{2K\}\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\),\\qquad M\\triangleq\\sup\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\bigl\\\|\\nabla\_\{\\mathbf\{q\}\}^\{2\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\\bigr\\\|\_\{\\mathrm\{op\}\}\.\(100\)We now boundMM\. By Lemma[10](https://arxiv.org/html/2605.30651#Thmlemma10),

∇𝐪2ρ=1D​\(𝐪\)​\[2​G−ℓ​\(∇ρ\)⊤−\(∇ρ\)​ℓ⊤\],\\nabla\_\{\\mathbf\{q\}\}^\{2\}\\rho=\\frac\{1\}\{D\(\\mathbf\{q\}\)\}\\left\[2G\-\\boldsymbol\{\\ell\}\(\\nabla\\rho\)^\{\\top\}\-\(\\nabla\\rho\)\\boldsymbol\{\\ell\}^\{\\top\}\\right\],\(101\)which implies

‖∇𝐪2ρ‖op≤2​‖G‖op\+2​‖ℓ‖​‖∇ρ‖D​\(𝐪\)\.\\\|\\nabla\_\{\\mathbf\{q\}\}^\{2\}\\rho\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{2\\\|G\\\|\_\{\\mathrm\{op\}\}\+2\\\|\\boldsymbol\{\\ell\}\\\|\\\|\\nabla\\rho\\\|\}\{D\(\\mathbf\{q\}\)\}\.\(102\)For any𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\},

ρ​\(𝜽ref;𝐪\)=𝐪⊤​G​𝐪D​\(𝐪\)≤‖G‖op​‖𝐪‖2ℓmin≤‖G‖opℓmin,\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)=\\frac\{\\mathbf\{q\}^\{\\top\}G\\mathbf\{q\}\}\{D\(\\mathbf\{q\}\)\}\\leq\\frac\{\\\|G\\\|\_\{\\mathrm\{op\}\}\\\|\\mathbf\{q\}\\\|^\{2\}\}\{\\ell\_\{\\min\}\}\\leq\\frac\{\\\|G\\\|\_\{\\mathrm\{op\}\}\}\{\\ell\_\{\\min\}\},\(103\)where we used‖𝐪‖2≤∑kqk=1\\\|\\mathbf\{q\}\\\|^\{2\}\\leq\\sum\_\{k\}q\_\{k\}=1\. Since∇ρ=\(2​G​𝐪−ρ​ℓ\)/D​\(𝐪\)\\nabla\\rho=\(2G\\mathbf\{q\}\-\\rho\\boldsymbol\{\\ell\}\)/D\(\\mathbf\{q\}\),

‖∇ρ‖≤2​‖G‖op​‖𝐪‖\+ρ​‖ℓ‖D​\(𝐪\)≤‖G‖opℓmin​\(2\+‖ℓ‖ℓmin\)\.\\\|\\nabla\\rho\\\|\\leq\\frac\{2\\\|G\\\|\_\{\\mathrm\{op\}\}\\\|\\mathbf\{q\}\\\|\+\\rho\\\|\\boldsymbol\{\\ell\}\\\|\}\{D\(\\mathbf\{q\}\)\}\\leq\\frac\{\\\|G\\\|\_\{\\mathrm\{op\}\}\}\{\\ell\_\{\\min\}\}\\left\(2\+\\frac\{\\\|\\boldsymbol\{\\ell\}\\\|\}\{\\ell\_\{\\min\}\}\\right\)\.\(104\)Substituting \([104](https://arxiv.org/html/2605.30651#A2.E104)\) into \([102](https://arxiv.org/html/2605.30651#A2.E102)\) and usingD​\(𝐪\)≥ℓminD\(\\mathbf\{q\}\)\\geq\\ell\_\{\\min\}gives

‖∇𝐪2ρ‖op≤2​‖G‖opℓmin​\(1\+‖ℓ‖ℓmin\)2\.\\\|\\nabla\_\{\\mathbf\{q\}\}^\{2\}\\rho\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{2\\\|G\\\|\_\{\\mathrm\{op\}\}\}\{\\ell\_\{\\min\}\}\\left\(1\+\\frac\{\\\|\\boldsymbol\{\\ell\}\\\|\}\{\\ell\_\{\\min\}\}\\right\)^\{2\}\.\(105\)By Lemma[9](https://arxiv.org/html/2605.30651#Thmlemma9),

‖G‖op≤‖G‖F≤E\+​∑i,jρ^i​ρ^j​ℓi​ℓj=E\+​∑iρ^i​ℓi\.\\\|G\\\|\_\{\\mathrm\{op\}\}\\leq\\\|G\\\|\_\{F\}\\leq E\_\{\+\}\\sqrt\{\\sum\_\{i,j\}\\hat\{\\rho\}\_\{i\}\\hat\{\\rho\}\_\{j\}\\ell\_\{i\}\\ell\_\{j\}\}=E\_\{\+\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\.\(106\)Together with‖ℓ‖≤K​ℓmax\\\|\\boldsymbol\{\\ell\}\\\|\\leq\\sqrt\{K\}\\,\\ell\_\{\\max\}, this yields

M≤2​E\+​∑iρ^i​ℓiℓmin​\(1\+K​ℓmaxℓmin\)2\.M\\leq\\frac\{2E\_\{\+\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}\{\\ell\_\{\\min\}\}\\left\(1\+\\frac\{\\sqrt\{K\}\\,\\ell\_\{\\max\}\}\{\\ell\_\{\\min\}\}\\right\)^\{2\}\.\(107\)Substituting \([107](https://arxiv.org/html/2605.30651#A2.E107)\) into \([100](https://arxiv.org/html/2605.30651#A2.E100)\) gives

\|R2\|≤α2​χ2​\(𝐪∥𝐩\),\|R\_\{2\}\|\\leq\\alpha\_\{2\}\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\),withα2\\alpha\_\{2\}as in \([82](https://arxiv.org/html/2605.30651#A2.E82)\)\. This proves the second bound in \([80](https://arxiv.org/html/2605.30651#A2.E80)\) and completes the proof\. ∎

### B\.6Proof of Theorem[1](https://arxiv.org/html/2605.30651#Thmtheorem1)\(LARK Objective\)

This subsection states and proves the formal version of Theorem[1](https://arxiv.org/html/2605.30651#Thmtheorem1)\. The informal statement in the main text asserts the existence of finite constantsα\>0\\alpha\>0andC∈ℝC\\in\\mathbb\{R\}, independent of the optimization variable𝐪\\mathbf\{q\}, realizing aχ2\\chi^\{2\}\-regularized lower bound on the learnability improvement; the formal version below makes both constants explicit in terms of the constantsα1,α2\\alpha\_\{1\},\\alpha\_\{2\}from Lemma[12](https://arxiv.org/html/2605.30651#Thmlemma12)\.

###### Theorem 2\(Formal version of Theorem[1](https://arxiv.org/html/2605.30651#Thmtheorem1)\)\.

Under Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)–[B\.4](https://arxiv.org/html/2605.30651#A2.Thmappassumption4), for every𝐪∈ΔK\\mathbf\{q\}\\in\\Delta^\{K\},

ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)≥⟨𝐪−𝐩,𝐠^⟩−α​χ2​\(𝐪∥𝐩\)\+C,\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)\\;\\geq\\;\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\-\\;\\alpha\\,\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\\;\+\\;C,\(109\)where

α≜α1\+α2\>0,C≜−1,\\alpha\\;\\triangleq\\;\\alpha\_\{1\}\+\\alpha\_\{2\}\\;\>\\;0,\\qquad C\\;\\triangleq\\;\-1,\(110\)andα1,α2\>0\\alpha\_\{1\},\\alpha\_\{2\}\>0are the explicit constants given by Lemma[12](https://arxiv.org/html/2605.30651#Thmlemma12)\. In particular,α\\alphaandCCare independent of𝐪\\mathbf\{q\}\.

###### Proof\.

By Lemma[8](https://arxiv.org/html/2605.30651#Thmlemma8)\(iii\), the change inρ\\rhoadmits the exact decomposition

ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)=⟨𝐪−𝐩,𝐠^⟩\+R1\+R2\.\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)\\;=\\;\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\+\\;R\_\{1\}\\;\+\\;R\_\{2\}\.\(111\)By Lemma[12](https://arxiv.org/html/2605.30651#Thmlemma12), the two residuals satisfy

\|R1\|≤α1​χ2​\(𝐪∥𝐩\)\+1,\|R2\|≤α2​χ2​\(𝐪∥𝐩\),\|R\_\{1\}\|\\;\\leq\\;\\alpha\_\{1\}\\,\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\+1,\\qquad\|R\_\{2\}\|\\;\\leq\\;\\alpha\_\{2\}\\,\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\),\(112\)whereα1,α2\>0\\alpha\_\{1\},\\alpha\_\{2\}\>0are given by \([81](https://arxiv.org/html/2605.30651#A2.E81)\) and \([82](https://arxiv.org/html/2605.30651#A2.E82)\), respectively\. Combining \([111](https://arxiv.org/html/2605.30651#A2.E111)\) and \([112](https://arxiv.org/html/2605.30651#A2.E112)\) gives

ρ​\(𝜽ref;𝐪\)−ρ​\(𝜽ref;𝐩\)\\displaystyle\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{q\}\)\-\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\,\\mathbf\{p\}\)=\(a\)⟨𝐪−𝐩,𝐠^⟩\+R1\+R2\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(a\)\}\}\{\{=\}\}\\;\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\+\\;R\_\{1\}\\;\+\\;R\_\{2\}≥\(b\)⟨𝐪−𝐩,𝐠^⟩−\|R1\|−\|R2\|\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(b\)\}\}\{\{\\geq\}\}\\;\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\-\\;\|R\_\{1\}\|\\;\-\\;\|R\_\{2\}\|≥\(c\)⟨𝐪−𝐩,𝐠^⟩−\(α1​χ2​\(𝐪∥𝐩\)\+1\)−α2​χ2​\(𝐪∥𝐩\)\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(c\)\}\}\{\{\\geq\}\}\\;\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\-\\;\\bigl\(\\alpha\_\{1\}\\,\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\+1\\bigr\)\\;\-\\;\\alpha\_\{2\}\\,\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)=\(d\)⟨𝐪−𝐩,𝐠^⟩−\(α1\+α2\)​χ2​\(𝐪∥𝐩\)−1,\\displaystyle\\;\\stackrel\{\{\\scriptstyle\(d\)\}\}\{\{=\}\}\\;\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\-\\;\(\\alpha\_\{1\}\+\\alpha\_\{2\}\)\\,\\chi^\{2\}\(\\mathbf\{q\}\\,\\\|\\,\\mathbf\{p\}\)\\;\-\\;1,\(113\)where \(a\) is \([111](https://arxiv.org/html/2605.30651#A2.E111)\); \(b\) usesR1,R2≥−\|R1\|,−\|R2\|R\_\{1\},R\_\{2\}\\geq\-\|R\_\{1\}\|,\-\|R\_\{2\}\|; \(c\) substitutes the bounds \([112](https://arxiv.org/html/2605.30651#A2.E112)\) from Lemma[12](https://arxiv.org/html/2605.30651#Thmlemma12); and \(d\) collects the twoχ2\\chi^\{2\}terms\. Identifyingα=α1\+α2\\alpha=\\alpha\_\{1\}\+\\alpha\_\{2\}andC=−1C=\-1as in \([110](https://arxiv.org/html/2605.30651#A2.E110)\) yields \([109](https://arxiv.org/html/2605.30651#A2.E109)\)\. ∎

### B\.7Proof of Lemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)\(BB\-Parameterized Closed Form\)

Theorem[2](https://arxiv.org/html/2605.30651#Thmtheorem2)motivates aχ2\\chi^\{2\}\-regularized linear surrogate\. In the fixed\-budget setting, we calibrate the regularization temperature so that the optimizer has exactlyBBactive coordinates\. By the equivalence \([77](https://arxiv.org/html/2605.30651#A2.E77)\) between theχ2\\chi^\{2\}\-divergence and the squared Euclidean distance under the uniform prior, the corresponding objective can be written as

𝐪^=arg⁡max𝐪∈ΔK⁡\{⟨𝐪−𝐩,𝐠^⟩−τ​K2​‖𝐪−𝐩‖2\},\\hat\{\\mathbf\{q\}\}\\;=\\;\\arg\\max\_\{\\mathbf\{q\}\\in\\Delta^\{K\}\}\\bigg\\\{\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\\;\-\\;\\frac\{\\tau\\,K\}\{2\}\\,\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|^\{2\}\\bigg\\\},\(114\)where we writeτ\\taufor the temperature to be chosen as a function of the budgetBB\.

###### Proof of Lemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)\.

The objective \([114](https://arxiv.org/html/2605.30651#A2.E114)\) is strictly concave in𝐪\\mathbf\{q\}wheneverτ​K\>0\\tau K\>0, so the KKT conditions characterize its unique global maximizer overΔK\\Delta^\{K\}\. Introduce a Lagrange multiplierλ∈ℝ\\lambda\\in\\mathbb\{R\}for the equality constraint∑kqk=1\\sum\_\{k\}q\_\{k\}=1and KKT multipliersμk≥0\\mu\_\{k\}\\geq 0for the non\-negativity constraintsqk≥0q\_\{k\}\\geq 0\. The Lagrangian is

ℒ​\(𝐪,λ,𝝁\)=⟨𝐪−𝐩,𝐠^⟩−τ​K2​‖𝐪−𝐩‖2−λ​\(∑kqk−1\)\+∑kμk​qk\.\\mathcal\{L\}\(\\mathbf\{q\},\\lambda,\\boldsymbol\{\\mu\}\)=\\langle\\mathbf\{q\}\-\\mathbf\{p\},\\,\\hat\{\\mathbf\{g\}\}\\rangle\-\\frac\{\\tau K\}\{2\}\\\|\\mathbf\{q\}\-\\mathbf\{p\}\\\|^\{2\}\-\\lambda\\Bigl\(\\sum\_\{k\}q\_\{k\}\-1\\Bigr\)\+\\sum\_\{k\}\\mu\_\{k\}q\_\{k\}\.\(115\)Stationarity and complementary slackness give

g^k−τ​K​\(qk−pk\)−λ\+μk=0,μk​qk=0,μk≥0\.\\hat\{g\}\_\{k\}\-\\tau K\(q\_\{k\}\-p\_\{k\}\)\-\\lambda\+\\mu\_\{k\}=0,\\qquad\\mu\_\{k\}q\_\{k\}=0,\\qquad\\mu\_\{k\}\\geq 0\.\(116\)Sincepk=1/Kp\_\{k\}=1/K, these conditions imply the projection form

q^k=\[1K\+g^k−λτ​K\]\+,\\hat\{q\}\_\{k\}=\\left\[\\frac\{1\}\{K\}\+\\frac\{\\hat\{g\}\_\{k\}\-\\lambda\}\{\\tau K\}\\right\]\_\{\+\},\(117\)where\[x\]\+=max⁡\{x,0\}\[x\]\_\{\+\}=\\max\\\{x,0\\\}\. The thresholdλ\\lambdais determined by the normalization condition∑kq^k=1\\sum\_\{k\}\\hat\{q\}\_\{k\}=1\.

Sort the scores asg^\(1\)≥g^\(2\)≥⋯≥g^\(K\)\\hat\{g\}\_\{\(1\)\}\\geq\\hat\{g\}\_\{\(2\)\}\\geq\\cdots\\geq\\hat\{g\}\_\{\(K\)\}, where\(i\)\(i\)denotes the sorted index\. From \([117](https://arxiv.org/html/2605.30651#A2.E117)\),q^\(i\)\>0\\hat\{q\}\_\{\(i\)\}\>0if and only ifg^\(i\)\>λ−τ\\hat\{g\}\_\{\(i\)\}\>\\lambda\-\\tau\. Thus an active set of sizeBBis obtained whenever

g^\(B\)\>λ−τ≥g^\(B\+1\)\.\\hat\{g\}\_\{\(B\)\}\>\\lambda\-\\tau\\geq\\hat\{g\}\_\{\(B\+1\)\}\.\(118\)Under the assumptiong^\(B\)\>g^\(B\+1\)\\hat\{g\}\_\{\(B\)\}\>\\hat\{g\}\_\{\(B\+1\)\}, the top\-BBactive set is well\-defined\. We choose the boundary value

λ=g^\(B\+1\)\+τ,\\lambda=\\hat\{g\}\_\{\(B\+1\)\}\+\\tau,\(119\)which makes the\(B\+1\)\(B\+1\)\-st coordinate exactly inactive while keeping the firstBBcoordinates positive\.

Substituting \([119](https://arxiv.org/html/2605.30651#A2.E119)\) into \([117](https://arxiv.org/html/2605.30651#A2.E117)\), for everyi≤Bi\\leq Bwe obtain

q^\(i\)\\displaystyle\\hat\{q\}\_\{\(i\)\}=1K\+g^\(i\)−g^\(B\+1\)−ττ​K\\displaystyle=\\frac\{1\}\{K\}\+\\frac\{\\hat\{g\}\_\{\(i\)\}\-\\hat\{g\}\_\{\(B\+1\)\}\-\\tau\}\{\\tau K\}=g^\(i\)−g^\(B\+1\)τ​K\.\\displaystyle=\\frac\{\\hat\{g\}\_\{\(i\)\}\-\\hat\{g\}\_\{\(B\+1\)\}\}\{\\tau K\}\.\(120\)Fori\>Bi\>B, the projection in \([117](https://arxiv.org/html/2605.30651#A2.E117)\) givesq^\(i\)=0\\hat\{q\}\_\{\(i\)\}=0\. Imposing the simplex normalization on the active coordinates gives

1=∑i=1Bq^\(i\)=1τ​K​∑i=1B\(g^\(i\)−g^\(B\+1\)\)\.1=\\sum\_\{i=1\}^\{B\}\\hat\{q\}\_\{\(i\)\}=\\frac\{1\}\{\\tau K\}\\sum\_\{i=1\}^\{B\}\\bigl\(\\hat\{g\}\_\{\(i\)\}\-\\hat\{g\}\_\{\(B\+1\)\}\\bigr\)\.\(121\)Therefore the budget\-calibrated temperature is

τ∗​\(B\)=1K​∑i=1B\(g^\(i\)−g^\(B\+1\)\),\\tau^\{\*\}\(B\)=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{B\}\\bigl\(\\hat\{g\}\_\{\(i\)\}\-\\hat\{g\}\_\{\(B\+1\)\}\\bigr\),\(122\)which is positive becauseg^\(B\)\>g^\(B\+1\)\\hat\{g\}\_\{\(B\)\}\>\\hat\{g\}\_\{\(B\+1\)\}\. Substituting \([122](https://arxiv.org/html/2605.30651#A2.E122)\) into \([120](https://arxiv.org/html/2605.30651#A2.E120)\) yields

q^\(i\)=g^\(i\)−g^\(B\+1\)∑j=1B\(g^\(j\)−g^\(B\+1\)\)for​i≤B,q^\(i\)=0for​i\>B\.\\hat\{q\}\_\{\(i\)\}=\\frac\{\\hat\{g\}\_\{\(i\)\}\-\\hat\{g\}\_\{\(B\+1\)\}\}\{\\sum\_\{j=1\}^\{B\}\\bigl\(\\hat\{g\}\_\{\(j\)\}\-\\hat\{g\}\_\{\(B\+1\)\}\\bigr\)\}\\quad\\text\{for \}i\\leq B,\\qquad\\hat\{q\}\_\{\(i\)\}=0\\quad\\text\{for \}i\>B\.\(123\)This is the closed\-formBB\-parameterized weighting rule stated in Lemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)\. ∎

### B\.8Theoretical and Empirical Validation of the Surrogate𝐠^\\hat\{\\mathbf\{g\}\}

The forward\-pass surrogate𝐠^\\hat\{\\mathbf\{g\}\}is the central computational shortcut underlying the LARK algorithm: it replaces the exact directional derivative𝐠∗=∇𝐪ρ​\(𝜽ref;𝐪\)\|𝐪=𝐩\\mathbf\{g\}^\{\*\}=\\nabla\_\{\\mathbf\{q\}\}\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)\|\_\{\\mathbf\{q\}=\\mathbf\{p\}\}, whose evaluation requires one backward pass per candidate trajectory, with a quantity computable from forward passes alone\. This subsection studies the relationship between𝐠^\\hat\{\\mathbf\{g\}\}and𝐠∗\\mathbf\{g\}^\{\*\}from two complementary perspectives: a theoretical RMS error bound and an empirical comparison on real student models\.

#### B\.8\.1Theoretical Error Bound

The bounds derived in Appendix[B\.5](https://arxiv.org/html/2605.30651#A2.SS5)immediately yield a quantitative bound on‖𝐠^−𝐠∗‖\\\|\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\\\|in the𝐩\\mathbf\{p\}\-weighted norm‖𝐯‖𝐩2≜K−1​∑kvk2\\\|\\mathbf\{v\}\\\|\_\{\\mathbf\{p\}\}^\{2\}\\triangleq K^\{\-1\}\\sum\_\{k\}v\_\{k\}^\{2\}\.

###### Corollary 3\(𝐩\\mathbf\{p\}\-weighted RMS error of𝐠^\\hat\{\\mathbf\{g\}\}\)\.

Under Assumptions[B\.1](https://arxiv.org/html/2605.30651#A2.Thmappassumption1)–[B\.4](https://arxiv.org/html/2605.30651#A2.Thmappassumption4),

‖𝐠^−𝐠∗‖𝐩=1K​∑k=1K\|g^k−gk∗\|2≤α1rms,\\\|\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\\\|\_\{\\mathbf\{p\}\}\\;=\\;\\sqrt\{\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\bigl\|\\hat\{g\}\_\{k\}\-g\_\{k\}^\{\*\}\\bigr\|^\{2\}\}\\;\\leq\\;\\alpha\_\{1\}^\{\\mathrm\{rms\}\},\(124\)whereα1rms\\alpha\_\{1\}^\{\\mathrm\{rms\}\}is the RMS upper bound from \([98](https://arxiv.org/html/2605.30651#A2.E98)\), given by

α1rms=3​Δρ​ρ^maxK⏟proxy term\+2​E\+​ρ^max​ℓmax​∑iρ^i​ℓiK​ℓ¯⏟off\-diagonal Gram term\+E\+​ℓmax​∑iρ^i​ℓiK​ℓ¯2⏟aggregate off\-diagonal term\.\\alpha\_\{1\}^\{\\mathrm\{rms\}\}\\;=\\;\\underbrace\{\\frac\{3\\,\\Delta\_\{\\rho\}\\,\\hat\{\\rho\}\_\{\\max\}\}\{\\sqrt\{K\}\}\}\_\{\\text\{proxy term\}\}\\;\+\\;\\underbrace\{\\frac\{2\\,E\_\{\+\}\\,\\sqrt\{\\hat\{\\rho\}\_\{\\max\}\\,\\ell\_\{\\max\}\\,\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\,\\ell\_\{i\}\}\}\{\\sqrt\{K\}\\,\\bar\{\\ell\}\}\}\_\{\\text\{off\-diagonal Gram term\}\}\\;\+\\;\\underbrace\{\\frac\{E\_\{\+\}\\,\\ell\_\{\\max\}\\,\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\,\\ell\_\{i\}\}\{K\\,\\bar\{\\ell\}^\{\\,2\}\}\}\_\{\\text\{aggregate off\-diagonal term\}\}\.\(125\)The constantα1\\alpha\_\{1\}in Lemma[12](https://arxiv.org/html/2605.30651#Thmlemma12)is recovered asα1=\(α1rms\)2/4\\alpha\_\{1\}=\(\\alpha\_\{1\}^\{\\mathrm\{rms\}\}\)^\{2\}/4\.

###### Proof\.

By the triangle inequality,‖𝐠^−𝐠∗‖𝐩≤‖𝜼prox‖𝐩\+‖𝜼diag‖𝐩\\\|\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\\\|\_\{\\mathbf\{p\}\}\\leq\\\|\\boldsymbol\{\\eta\}\_\{\\mathrm\{prox\}\}\\\|\_\{\\mathbf\{p\}\}\+\\\|\\boldsymbol\{\\eta\}\_\{\\mathrm\{diag\}\}\\\|\_\{\\mathbf\{p\}\}, where𝜼prox\\boldsymbol\{\\eta\}\_\{\\mathrm\{prox\}\}and𝜼diag\\boldsymbol\{\\eta\}\_\{\\mathrm\{diag\}\}are defined in the proof of Lemma[12](https://arxiv.org/html/2605.30651#Thmlemma12)\. Combining the bounds in \([89](https://arxiv.org/html/2605.30651#A2.E89)\) and \([96](https://arxiv.org/html/2605.30651#A2.E96)\) gives \([124](https://arxiv.org/html/2605.30651#A2.E124)\), withα1rms\\alpha\_\{1\}^\{\\mathrm\{rms\}\}given by \([125](https://arxiv.org/html/2605.30651#A2.E125)\)\. ∎

The three terms in \([125](https://arxiv.org/html/2605.30651#A2.E125)\) correspond to the three sources of surrogate error\. The first term,3​Δρ​ρ^max/K3\\Delta\_\{\\rho\}\\hat\{\\rho\}\_\{\\max\}/\\sqrt\{K\}, is the proxy\-substitution error: it comes from replacing the exact per\-trajectory ratesρi∗\\rho\_\{i\}^\{\*\}in the closed form ofgk∗g\_\{k\}^\{\*\}by the forward\-pass proxiesρ^i\\hat\{\\rho\}\_\{i\}\. Its size is controlled byΔρ\\Delta\_\{\\rho\}, which measures the multiplicative gap betweenρi∗\\rho\_\{i\}^\{\*\}andρ^i\\hat\{\\rho\}\_\{i\}in Lemma[11](https://arxiv.org/html/2605.30651#Thmlemma11)\. The second term,2​E\+​ρ^max​ℓmax​∑iρ^i​ℓi/\(K​ℓ¯\)2E\_\{\+\}\\sqrt\{\\hat\{\\rho\}\_\{\\max\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}\}/\(\\sqrt\{K\}\\,\\bar\{\\ell\}\), controls the per\-coordinate off\-diagonal Gram residualok=𝐠k⊤​𝐠¯−‖𝐠k‖2/Ko\_\{k\}=\\mathbf\{g\}\_\{k\}^\{\\top\}\\bar\{\\mathbf\{g\}\}\-\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}/K, which is ignored by the diagonal approximation𝐠k⊤​𝐠¯≈‖𝐠k‖2/K\\mathbf\{g\}\_\{k\}^\{\\top\}\\bar\{\\mathbf\{g\}\}\\approx\\\|\\mathbf\{g\}\_\{k\}\\\|^\{2\}/K\. The third term,E\+​ℓmax​∑iρ^i​ℓi/\(K​ℓ¯2\)E\_\{\+\}\\ell\_\{\\max\}\\sum\_\{i\}\\hat\{\\rho\}\_\{i\}\\ell\_\{i\}/\(K\\bar\{\\ell\}^\{2\}\), controls the aggregate off\-diagonal massO=‖𝐠¯‖2−K−2​∑i‖𝐠i‖2O=\\\|\\bar\{\\mathbf\{g\}\}\\\|^\{2\}\-K^\{\-2\}\\sum\_\{i\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}left over by the approximation‖𝐠¯‖2≈K−2​∑i‖𝐠i‖2\\\|\\bar\{\\mathbf\{g\}\}\\\|^\{2\}\\approx K^\{\-2\}\\sum\_\{i\}\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\. The latter two terms are small when the per\-trajectory gradients are close to orthogonal in parameter space\.

The bound in \([124](https://arxiv.org/html/2605.30651#A2.E124)\) is an aggregate RMS bound across theKKcandidates, not a per\-coordinate guarantee\. By the equivalence of finite\-dimensional norms,

maxk∈\[K\]⁡\|g^k−gk∗\|≤∑k=1K\|g^k−gk∗\|2=K​‖𝐠^−𝐠∗‖𝐩≤K​α1rms\.\\max\_\{k\\in\[K\]\}\|\\hat\{g\}\_\{k\}\-g\_\{k\}^\{\*\}\|\\;\\leq\\;\\sqrt\{\\sum\_\{k=1\}^\{K\}\|\\hat\{g\}\_\{k\}\-g\_\{k\}^\{\*\}\|^\{2\}\}\\;=\\;\\sqrt\{K\}\\,\\\|\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\\\|\_\{\\mathbf\{p\}\}\\;\\leq\\;\\sqrt\{K\}\\,\\alpha\_\{1\}^\{\\mathrm\{rms\}\}\.\(126\)Thus a single coordinate can have a larger error in the worst case, while the RMS error remains controlled\. This RMS norm is the relevant quantity for the residual analysis because the Cauchy–Schwarz step in \([83](https://arxiv.org/html/2605.30651#A2.E83)\) naturally pairsχ2​\(𝐪∥𝐩\)\\sqrt\{\\chi^\{2\}\(\\mathbf\{q\}\\\|\\mathbf\{p\}\)\}with‖𝜼‖𝐩\\\|\\boldsymbol\{\\eta\}\\\|\_\{\\mathbf\{p\}\}\.

#### B\.8\.2Empirical Correlation between𝐠^\\hat\{\\mathbf\{g\}\}and𝐠∗\\mathbf\{g\}^\{\*\}

The bound in Corollary[3](https://arxiv.org/html/2605.30651#Thmcorollary3)is a worst\-case guarantee whose constants depend on the structural quantities of Lemma[11](https://arxiv.org/html/2605.30651#Thmlemma11)\. We complement this analysis by directly measuring how well𝐠^\\hat\{\\mathbf\{g\}\}tracks𝐠∗\\mathbf\{g\}^\{\*\}on a real student model\.

We sample5050questions uniformly from the candidate pool described in Appendix[C\.1](https://arxiv.org/html/2605.30651#A3.SS1)\. For each question, we compute both the LARK surrogateg^k\\hat\{g\}\_\{k\}and the exact directional derivativegk∗=∂ρ​\(𝜽ref;𝐪\)/∂qk\|𝐪=𝐩g\_\{k\}^\{\*\}=\\partial\\rho\(\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\};\\mathbf\{q\}\)/\\partial q\_\{k\}\|\_\{\\mathbf\{q\}=\\mathbf\{p\}\}\. The exact derivative is obtained by one full\-parameter backward pass per candidate trajectory at the anchor model𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}\(Qwen\-2\.5\-7B\), rather than by a low\-rank or kernel approximation\. Across5050questions andK=33K=33candidates per question, this produces1,6501\{,\}650paired values\(g^k,gk∗\)\(\\hat\{g\}\_\{k\},g\_\{k\}^\{\*\}\)\. To compare values across questions, we apply a per\-question affine rescaling that aligns the mean and scale ofg^k\\hat\{g\}\_\{k\}with those ofgk∗g\_\{k\}^\{\*\}\. This rescaling preserves within\-question rankings and therefore does not affectRecall​@​B\\mathrm\{Recall@\}B\.

We report three metrics\. First, we compute pooled Pearsonrrand Spearmanρ\\rhobetween the rescaledg^k\\hat\{g\}\_\{k\}andgk∗g\_\{k\}^\{\*\}to measure scalar agreement\. Second, because LARK uses𝐠^\\hat\{\\mathbf\{g\}\}primarily through its top\-BBranking, we reportRecall​@​B≜\|topB​\(𝐠^\)∩topB​\(𝐠∗\)\|/B\\mathrm\{Recall@\}B\\triangleq\|\\mathrm\{top\}\_\{B\}\(\\hat\{\\mathbf\{g\}\}\)\\cap\\mathrm\{top\}\_\{B\}\(\\mathbf\{g\}^\{\*\}\)\|/Baveraged over questions forB∈\{1,…,10\}B\\in\\\{1,\\ldots,10\\\}, with random baselineB/KB/K\. Third, we report the per\-question relative𝐩\\mathbf\{p\}\-weighted RMS error‖𝐠^−𝐠∗‖𝐩/‖𝐠∗‖𝐩\\\|\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\\\|\_\{\\mathbf\{p\}\}/\\\|\\mathbf\{g\}^\{\*\}\\\|\_\{\\mathbf\{p\}\}, matching the norm used in Corollary[3](https://arxiv.org/html/2605.30651#Thmcorollary3)\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x9.png)Figure 6:Empirical correlation between𝐠^\\hat\{\\mathbf\{g\}\}and𝐠∗\\mathbf\{g\}^\{\*\}on Qwen\-2\.5\-7B\.\(a\)Per\-question affine\-rescaledg^k\\hat\{g\}\_\{k\}versus exactgk∗g\_\{k\}^\{\*\}across50×3350\\times 33candidates; pooled Pearsonr=0\.748r=0\.748and Spearmanρ=0\.566\\rho=0\.566\.\(b\)Recall​@​B\\mathrm\{Recall@\}Bbetween thetopB\\mathrm\{top\}\_\{B\}rankings of𝐠^\\hat\{\\mathbf\{g\}\}and𝐠∗\\mathbf\{g\}^\{\*\}as a function of selection budgetBB; the dashed gray line is the random baselineB/KB/K\.\(c\)Histogram of per\-question relative𝐩\\mathbf\{p\}\-weighted RMS error‖𝐠^−𝐠∗‖𝐩/‖𝐠∗‖𝐩\\\|\\hat\{\\mathbf\{g\}\}\-\\mathbf\{g\}^\{\*\}\\\|\_\{\\mathbf\{p\}\}/\\\|\\mathbf\{g\}^\{\*\}\\\|\_\{\\mathbf\{p\}\}across the5050verification questions; mean0\.5740\.574, median0\.5900\.590\.Figure[6](https://arxiv.org/html/2605.30651#A2.F6)\(a\) shows a clear positive relationship betweeng^k\\hat\{g\}\_\{k\}andgk∗g\_\{k\}^\{\*\}, with pooled Pearsonr=0\.748r=0\.748and Spearmanρ=0\.566\\rho=0\.566\. This indicates that the surrogate captures a substantial portion of the scalar variation ingk∗g\_\{k\}^\{\*\}after per\-question affine rescaling, while the remaining gap is consistent with the off\-diagonal Gram interactions omitted by the diagonal approximation in \([125](https://arxiv.org/html/2605.30651#A2.E125)\)\.

The ranking agreement is also substantially above chance\. As shown in Figure[6](https://arxiv.org/html/2605.30651#A2.F6)\(b\),Recall​@​1≈0\.38\\mathrm\{Recall@\}1\\approx 0\.38, far above the random baseline1/K≈0\.031/K\\approx 0\.03\. At the main experimental budgetB=3B=3,Recall​@​3≈0\.56\\mathrm\{Recall@\}3\\approx 0\.56, compared with the random baseline3/K≈0\.093/K\\approx 0\.09\. The recall remains around0\.60\.6for larger budgets, indicating that𝐠^\\hat\{\\mathbf\{g\}\}preserves meaningful top\-BBranking information for the selection rule used by LARK\. The error bars also show nontrivial question\-level variation, so this result should be interpreted as strong aggregate alignment rather than a perfect per\-question match\.

Figure[6](https://arxiv.org/html/2605.30651#A2.F6)\(c\) shows that the relative𝐩\\mathbf\{p\}\-weighted RMS error concentrates around0\.60\.6\(mean0\.5740\.574, median0\.5900\.590\), which is far below the loose worst\-case per\-coordinate factorK≈5\.7\\sqrt\{K\}\\approx 5\.7allowed by \([126](https://arxiv.org/html/2605.30651#A2.E126)\)\. The empirical error is therefore much smaller than the theoretical worst case, as expected in non\-adversarial candidate pools\.

Overall, these diagnostics support the use of𝐠^\\hat\{\\mathbf\{g\}\}as a forward\-pass surrogate for top\-BBtrajectory selection\. The surrogate is not a high\-precision coordinate\-wise estimator of𝐠∗\\mathbf\{g\}^\{\*\}, but it is positively aligned with𝐠∗\\mathbf\{g\}^\{\*\}at the scalar level and preserves useful top\-BBranking information\. This is the operative notion of agreement for LARK, since the closed\-form selection rule in Lemma[3](https://arxiv.org/html/2605.30651#Thmlemma3)depends on the top\-BBordering and the score margins above the threshold, rather than on the exact magnitude of every coordinate\. The empirical results therefore support the computational tradeoff made by LARK: avoidingKKfull\-parameter backward passes per question while retaining the ranking signal needed for downstream selection\.

## Appendix CExperimental Details

### C\.1Candidate Pool

We adopt the trajectory candidate pool released byYanget al\.\[[2026](https://arxiv.org/html/2605.30651#bib.bib3)\]without modification\. The pool consists of 5,000 mathematical reasoning problems sampled from NuminaMath\[Liet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib18)\], each paired with 33 reasoning trajectories produced by 11 teacher models with 3 independent rollouts per teacher\. All selection methods in our experiments—including LARK and every baseline in Appendix[C\.2](https://arxiv.org/html/2605.30651#A3.SS2)—operate on this identical pool, so any difference in downstream performance is attributable to the selection rule rather than to differences in the underlying candidate set\.

##### Teacher pool\.

The 11 teacher models span a range of scales and model families, including both proprietary and open\-weight reasoning models: GPT\-OSS\-20B, GPT\-OSS\-120B, Phi\-4\-R\+, QwQ\-32B, DeepSeek\-R1, Qwen3\-235B, Qwen3\-30B, Qwen3\-8B, Qwen3\-4B, Nemotron, and Magistral\-S\. For each \(problem, teacher\) pair, three independent rollouts are generated, yielding11×3=3311\\times 3=33trajectories per problem\. We refer the reader toYanget al\.\[[2026](https://arxiv.org/html/2605.30651#bib.bib3)\]for the full inference configuration of each teacher \(decoding temperature, top\-pp, maximum generation length\)\.

##### Correctness filtering\.

Every trajectory retained in the pool has been verified to reach the correct final answer under the extraction rule ofYanget al\.\[[2026](https://arxiv.org/html/2605.30651#bib.bib3)\], which matches the rule described in Appendix[C\.4](https://arxiv.org/html/2605.30651#A3.SS4)\. Trajectories whose final answer disagrees with the ground truth are not included\. As a result, all 33 candidates for each problem are*correct*reasoning trajectories that nonetheless differ in reasoning style, length, and student\-side learnability—a setting that isolates the selection problem from the orthogonal question of correctness verification\.

### C\.2Trajectory Selection Baselines

We compare LARK against the seven baselines reported in Table[1](https://arxiv.org/html/2605.30651#S5.T1)\. Throughout this subsection, let𝐱\\mathbf\{x\}denote the prompt \(system and user turns\), and let\{𝐲k\}k=1K\\\{\\mathbf\{y\}\_\{k\}\\\}\_\{k=1\}^\{K\}withK=33K=33denote the candidate trajectories for𝐱\\mathbf\{x\}in the pool𝒞​\(𝐱\)\\mathcal\{C\}\(\\mathbf\{x\}\)described in Appendix[C\.1](https://arxiv.org/html/2605.30651#A3.SS1)\. Each trajectory has length\|ak\|\|a\_\{k\}\|, withtt\-th tokenytky\_\{t\}^\{k\}, predictive distribution𝝅tk=π𝜽\(⋅∣𝐱,𝐲<tk\)∈Δ\|𝒱\|\\boldsymbol\{\\pi\}\_\{t\}^\{k\}=\\pi\_\{\\boldsymbol\{\\theta\}\}\(\\cdot\\mid\\mathbf\{x\},\\mathbf\{y\}\_\{<t\}^\{k\}\)\\in\\Delta^\{\|\\mathcal\{V\}\|\}under the studentπ𝜽\\pi\_\{\\boldsymbol\{\\theta\}\}, ground\-truth one\-hot𝜹​\(ytk\)∈Δ\|𝒱\|\\boldsymbol\{\\delta\}\(y\_\{t\}^\{k\}\)\\in\\Delta^\{\|\\mathcal\{V\}\|\}, and length\-normalized cross\-entropyℓk=1\|ak\|​∑t=1\|ak\|−log⁡𝝅tk​\(ytk\)\\ell\_\{k\}=\\tfrac\{1\}\{\|a\_\{k\}\|\}\\sum\_\{t=1\}^\{\|a\_\{k\}\|\}\-\\log\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\)\. Each baseline assigns a scalar score to every candidate𝐲k\\mathbf\{y\}\_\{k\}and selects the top\-BBtrajectories under that score with uniform weights1/B1/B; LARK instead applies its closed\-formχ2\\chi^\{2\}\-BBrule to obtain non\-uniform weights\.

##### Random\.

We sample one trajectory uniformly at random from\{𝐲k\}k=1K\\\{\\mathbf\{y\}\_\{k\}\\\}\_\{k=1\}^\{K\}:

𝐲⋆∼Uniform​\(\{𝐲k\}k=1K\)\.\\mathbf\{y\}^\{\\star\}\\sim\\mathrm\{Uniform\}\\bigl\(\\\{\\mathbf\{y\}\_\{k\}\\\}\_\{k=1\}^\{K\}\\bigr\)\.ForB\>1B\>1, we sampleBBtrajectories without replacement\.

##### Token Lengthmax\.

We score each candidate by its token length and select the longest:

scorelen​\(𝐲k\)≜\|ak\|,𝐲⋆=arg⁡maxk∈\[K\]⁡scorelen​\(𝐲k\)\.\\mathrm\{score\}\_\{\\text\{len\}\}\(\\mathbf\{y\}\_\{k\}\)\\triangleq\|a\_\{k\}\|,\\qquad\\mathbf\{y\}^\{\\star\}=\\arg\\max\_\{k\\in\[K\]\}\\mathrm\{score\}\_\{\\text\{len\}\}\(\\mathbf\{y\}\_\{k\}\)\.

##### Rule\-based Qualitymax\(LIMO\-style\)\.

We implement a lightweight heuristic inspired by LIMO\-style filtering\[Yeet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib1)\], combining four indicators computed from the assistant text of𝐲k\\mathbf\{y\}\_\{k\}: \(i\) elaborated reasoning \(total word length\), \(ii\) self\-verification \(frequency of “check” and “verify”\), \(iii\) exploratory approach \(frequency of “perhaps” and “might”\), \(iv\) adaptive granularity \(frequency of “therefore” and “since”\)\. For each candidate, the frequency of each keyword group is normalized by the total word count\. Wezz\-score each indicator across theKKcandidates within the same pool and take a weighted sum:

scorerule​\(𝐲k\)=0\.30⋅z​\(len\)\+0\.20⋅z​\(selfver\)\+0\.25⋅z​\(explore\)\+0\.25⋅z​\(granularity\),\\mathrm\{score\}\_\{\\text\{rule\}\}\(\\mathbf\{y\}\_\{k\}\)=0\.30\\cdot z\(\\mathrm\{len\}\)\+0\.20\\cdot z\(\\mathrm\{selfver\}\)\+0\.25\\cdot z\(\\mathrm\{explore\}\)\+0\.25\\cdot z\(\\mathrm\{granularity\}\),wherez​\(⋅\)z\(\\cdot\)denotes within\-poolzz\-scoring\. We select𝐲⋆=arg⁡maxk∈\[K\]⁡scorerule​\(𝐲k\)\\mathbf\{y\}^\{\\star\}=\\arg\\max\_\{k\\in\[K\]\}\\mathrm\{score\}\_\{\\text\{rule\}\}\(\\mathbf\{y\}\_\{k\}\)\.

##### LLM\-judged Qualitymax\.

We useQwen3\-32B\-Instruct\[Yanget al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib24)\]as a judge model in non\-thinking mode\. Given\(𝐱,𝐲k\)\(\\mathbf\{x\},\\mathbf\{y\}\_\{k\}\), the judge outputs a JSON object with anoverall\_scoresjudge​\(𝐲k\)∈\[0,1\]s\_\{\\text\{judge\}\}\(\\mathbf\{y\}\_\{k\}\)\\in\[0,1\]and a textualoverall\_reason\. We select𝐲⋆=arg⁡maxk∈\[K\]⁡sjudge​\(𝐲k\)\\mathbf\{y\}^\{\\star\}=\\arg\\max\_\{k\\in\[K\]\}s\_\{\\text\{judge\}\}\(\\mathbf\{y\}\_\{k\}\)\. The verbatim judge prompt is given below\.

Judge Prompt: LLM\-as\-a\-Judge for Trajectory QualityYou are a meticulous and highly critical evaluator of AI reasoning\. Your primary goal is to identify and quantify subtle flaws, logical gaps, inefficiencies, and hidden assumptions\. Do not default to a high score\. Your starting assumption should be critical, and you must rigorously justify every point awarded\.First, please carefully read the following problem statement: <Problem\>\{question\}</Problem\>Now, please carefully read the following candidate’s chain\-of\-thought reasoning: <Reasoning\>\{reasoning\_to\_evaluate\}</Reasoning\>When evaluating this reasoning, you must adhere to the following five key evaluation criteria and the scoring rubric below\.Scoring Guidelines and Calibration: You must use the full 0\.0 to 1\.0 scale\. Scores should not be clustered at the top\. Use this rubric to anchor your scores:1\.0 \(Exceptional/Flawless\): Reserved for reasoning that is not only correct but also elegant, insightful, and comprehensive\. It is perfectly structured and leaves no room for doubt\. This score should be exceedingly rare\.0\.8 \- 0\.9 \(Excellent but Imperfect\): The core reasoning is valid and well\-supported, but there may be very minor, superficial issues \(e\.g\., a trivial typo in a formula that doesn’t affect the outcome, a slightly awkward phrasing\)\. The conclusion is unaffected\.0\.5 \- 0\.7 \(Competent but Flawed\): The reasoning is generally on the right track but contains noticeable and non\-trivial flaws\.0\.2 \- 0\.4 \(Poor\): The reasoning contains fundamental flaws that largely invalidate the process or conclusion\.0\.0 \- 0\.1 \(Unacceptable\): The reasoning is completely incorrect, irrelevant, nonsensical, or makes no meaningful attempt to solve the problem\.Evaluation Criteria: Factual Accuracy, Logical Rigor, Solution Completeness, Reasoning Efficiency, Presentation Quality\.For each criterion, give a score from 0\.0 to 1\.0 \(in 0\.1 increments\) and a brief justification in JSON\.Your output must be a single, valid JSON object with:``` { "dimensional_evaluation": {...}, "overall_score": <float between 0.0 and 1.0>, "overall_reason": "<concise summary>" } ```

##### Global Naturalness \(GRAPE\)max\[Zhanget al\.,[2025a](https://arxiv.org/html/2605.30651#bib.bib4)\]\.

GRAPE scores each candidate by the average log\-likelihood assigned to its tokens by the student under teacher forcing, equivalently the negative length\-normalized cross\-entropy:

scoreGRAPE​\(𝐲k\)≜−ℓk=1\|ak\|​∑t=1\|ak\|log⁡𝝅tk​\(ytk\)\.\\mathrm\{score\}\_\{\\text\{GRAPE\}\}\(\\mathbf\{y\}\_\{k\}\)\\;\\triangleq\\;\-\\,\\ell\_\{k\}\\;=\\;\\frac\{1\}\{\|a\_\{k\}\|\}\\sum\_\{t=1\}^\{\|a\_\{k\}\|\}\\log\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\)\.We select𝐲⋆=arg⁡maxk∈\[K\]⁡scoreGRAPE​\(𝐲k\)\\mathbf\{y\}^\{\\star\}=\\arg\\max\_\{k\\in\[K\]\}\\mathrm\{score\}\_\{\\text\{GRAPE\}\}\(\\mathbf\{y\}\_\{k\}\), which is equivalent to selecting the trajectory of*minimum*length\-normalized cross\-entropy under the student\.

##### Local Naturalnessmax\[Justet al\.,[2025](https://arxiv.org/html/2605.30651#bib.bib8)\]\.

We follow the Local Naturalness metric\. Each trajectory is split intoJkJ\_\{k\}sentence\-level steps𝐲k=\(𝐬1,…,𝐬Jk\)\\mathbf\{y\}\_\{k\}=\(\\mathbf\{s\}\_\{1\},\\ldots,\\mathbf\{s\}\_\{J\_\{k\}\}\), where each𝐬j\\mathbf\{s\}\_\{j\}is a contiguous sub\-sequence of tokens\. With local context sizemm\(we usem=4m=4\), we compute

scorelocal​\(𝐲k\)≜1Jk​∑j=1Jk\(1\|𝐬j\|​∑u=1\|𝐬j\|log⁡π𝜽​\(sj,u\|𝐱,𝐬j−m:j−1,𝐬j,<u\)\),\\mathrm\{score\}\_\{\\text\{local\}\}\(\\mathbf\{y\}\_\{k\}\)\\;\\triangleq\\;\\frac\{1\}\{J\_\{k\}\}\\sum\_\{j=1\}^\{J\_\{k\}\}\\\!\\left\(\\frac\{1\}\{\|\\mathbf\{s\}\_\{j\}\|\}\\sum\_\{u=1\}^\{\|\\mathbf\{s\}\_\{j\}\|\}\\log\\pi\_\{\\boldsymbol\{\\theta\}\}\\\!\\left\(s\_\{j,u\}\\,\\big\|\\,\\mathbf\{x\},\\,\\mathbf\{s\}\_\{j\-m:j\-1\},\\,\\mathbf\{s\}\_\{j,<u\}\\right\)\\right\),where𝐬j−m:j−1\\mathbf\{s\}\_\{j\-m:j\-1\}denotes up tommimmediately preceding steps andsj,us\_\{j,u\}is theuu\-th token of step𝐬j\\mathbf\{s\}\_\{j\}\. We select𝐲⋆=arg⁡maxk∈\[K\]⁡scorelocal​\(𝐲k\)\\mathbf\{y\}^\{\\star\}=\\arg\\max\_\{k\\in\[K\]\}\\mathrm\{score\}\_\{\\text\{local\}\}\(\\mathbf\{y\}\_\{k\}\)\. Unlike GRAPE, this score requiresm\+1m\+1forward passes per trajectory \(one per masked\-context rollout\); see Appendix[D\.1](https://arxiv.org/html/2605.30651#A4.SS1)\.

##### Rank\-Surprisal Ratio \(RSR\)min\[Yanget al\.,[2026](https://arxiv.org/html/2605.30651#bib.bib3)\]\.

RSR computes token\-level surprisal and rank under the student\. For each tokenytky\_\{t\}^\{k\}with context𝐜tk=\(𝐱,𝐲<tk\)\\mathbf\{c\}\_\{t\}^\{k\}=\(\\mathbf\{x\},\\mathbf\{y\}\_\{<t\}^\{k\}\), define the surprisal and rank as

stk≜−log⁡𝝅tk​\(ytk\),rtk≜1\+∑v∈𝒱𝕀​\[π𝜽​\(v∣𝐜tk\)\>𝝅tk​\(ytk\)\]\.s\_\{t\}^\{k\}\\;\\triangleq\\;\-\\log\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\),\\qquad r\_\{t\}^\{k\}\\;\\triangleq\\;1\+\\\!\\\!\\sum\_\{v\\in\\mathcal\{V\}\}\\\!\\\!\\mathbb\{I\}\\\!\\left\[\\pi\_\{\\boldsymbol\{\\theta\}\}\(v\\mid\\mathbf\{c\}\_\{t\}^\{k\}\)\>\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\)\\right\]\.Ranks are clipped atrmaxr\_\{\\max\}\(we usermax=100r\_\{\\max\}=100\):r¯tk≜min⁡\(rtk,rmax\)\\bar\{r\}\_\{t\}^\{k\}\\triangleq\\min\(r\_\{t\}^\{k\},r\_\{\\max\}\)\. The trajectory\-level score is

scoreRSR​\(𝐲k\)≜∑t=1\|ak\|r¯tk∑t=1\|ak\|stk,𝐲⋆=arg⁡mink∈\[K\]⁡scoreRSR​\(𝐲k\)\.\\mathrm\{score\}\_\{\\text\{RSR\}\}\(\\mathbf\{y\}\_\{k\}\)\\;\\triangleq\\;\\frac\{\\sum\_\{t=1\}^\{\|a\_\{k\}\|\}\\bar\{r\}\_\{t\}^\{k\}\}\{\\sum\_\{t=1\}^\{\|a\_\{k\}\|\}s\_\{t\}^\{k\}\},\\qquad\\mathbf\{y\}^\{\\star\}=\\arg\\min\_\{k\\in\[K\]\}\\mathrm\{score\}\_\{\\text\{RSR\}\}\(\\mathbf\{y\}\_\{k\}\)\.Surprisal and clipped rank are computed from a single forward pass; clipping atrmaxr\_\{\\max\}avoids theO​\(\|𝒱\|\)O\(\|\\mathcal\{V\}\|\)full sort while remaining exact \(Appendix[C\.3](https://arxiv.org/html/2605.30651#A3.SS3)\)\.

### C\.3Implementation Details

##### SFT training\.

We use a single hyperparameter recipe across all three students \(Qwen\-2\.5\-7B, Qwen\-2\.5\-1\.5B, Llama\-3\.2\-3B\) and across all selection methods, so that any difference in downstream performance can be attributed to the selection rule rather than to per\-method training tuning\. Each student is fine\-tuned with full\-parameter SFT under the following configuration:

- •Maximum sequence length:32,76832\{,\}768tokens, kept identical across selection scoring, SFT, and evaluation; trajectories exceeding this length \(less than1%1\\%of the pool\) are truncated from the right\.
- •Learning rate:1\.0×10−51\.0\\times 10^\{\-5\}for the main5,0005\{,\}000\-trajectory experiments and5\.0×10−65\.0\\times 10^\{\-6\}for the500500\-trajectory verification of Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)in Appendix[A\.3](https://arxiv.org/html/2605.30651#A1.SS3), linearly scaled to the smaller dataset to avoid early\-step instability\.
- •Effective global batch size:6464trajectories, realized as88GPUs×\\timesper\-device batch11×\\timesgradient accumulation88\.
- •Epochs:55\.
- •Optimizer:AdamW \(PyTorch implementation\) with default\(β1,β2,ε\)=\(0\.9,0\.999,10−8\)\(\\beta\_\{1\},\\beta\_\{2\},\\varepsilon\)=\(0\.9,0\.999,10^\{\-8\}\)and weight decay0\.
- •Learning\-rate schedule:cosine decay with linear warm\-up over the first5%5\\%of total steps\.
- •Gradient clipping:max\-norm1\.01\.0, delegated to DeepSpeed viagradient\_clipping: auto\.
- •Precision:bf16 mixed precision throughout, with fp32 communication for ZeRO collectives\. We disable bf16 reduced\-precision matmul accumulation \(torch\.backends\.cuda\.matmul\.allow\_bf16\_reduced\_precision\_reduction==False\) to suppress occasional NaN gradients in the LM\-head backward on A100; fp16 is never used\.
- •Memory and throughput optimizations:non\-reentrant gradient checkpointing, FlashAttention\-2, sequence packing with greedy bin\-packing and resetposition\_ids, and Liger fused kernels \(RoPE, RMSNorm, SwiGLU\)\. The fused linear\-CE kernel is disabled because it is incompatible with sequence packing\.

##### Software stack\.

Training and inference run inside a single conda environment with PyTorch2\.9\.12\.9\.1\(CUDA12\.812\.8\), Transformers4\.57\.14\.57\.1, PEFT0\.18\.00\.18\.0, FlashAttention2\.8\.32\.8\.3, Liger\-Kernel, and DeepSpeed\. Distributed training is launched viatorchrun \-\-nproc\_per\_node=8on top of the HuggingFaceTrainer, using DeepSpeed ZeRO Stage22\(zero2\_bf16\.json:stage 2,communication\_data\_type fp32,reduce/allgather\_bucket\_size=5×108=5\\times 10^\{8\}\)\. Under ZeRO\-22, optimizer states and gradients are sharded across the88GPUs while parameters are replicated, which is sufficient for the≤8\\leq 8B\-parameter students considered\. Evaluation uses vLLM0\.15\.00\.15\.0for batched multi\-sample decoding, with SGLang0\.5\.90\.5\.9available as a fallback engine\.

##### Hardware and wall\-clock\.

All experiments run on a single node with8×8\\timesNVIDIA A100 80GB \(SXM4\) GPUs\. Each5,0005\{,\}000\-trajectory SFT run completes in roughly33–55hours depending on the student size, and each500500\-trajectory verification run for Appendix[D\.2](https://arxiv.org/html/2605.30651#A4.SS2)completes in about2525–4545minutes\. Aggregating across33students,77selection baselines \(Appendix[C\.2](https://arxiv.org/html/2605.30651#A3.SS2)\), and the LARK ablation grid \(top\-B∈\{1,3,5,10\}B\\in\\\{1,3,5,10\\\}together withχ2\\chi^\{2\}\- and KL\-tempered weighting variants\), the total training budget is approximately1,6001\{,\}600A100\-hours\. Evaluation is sharded across the same88GPUs withtensor\_parallel\_size=1=1per shard, with generations merged across shards afterwards, adding on average∼15\\sim 15minutes per \(student, benchmark\) pair\.

##### Prompt consistency and truncation\.

A single chat template is applied throughout the pipeline\. The sametokenizer\.apply\_chat\_templatecall is used for \(i\) selection\-time scoring \(when computing the forward\-pass quantitiesρ^k\\hat\{\\rho\}\_\{k\}andg^k\\hat\{g\}\_\{k\}on candidate trajectories under𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}\), \(ii\) SFT example construction \(masking out the system and user turns so that only the assistant turn contributes to the loss\), and \(iii\) evaluation prompting \(generation prompt appended viaadd\_generation\_prompt=True\)\. All three stages truncate to the same32,76832\{,\}768\-token limit\. The reasoning instruction"Please reason step by step, and put your final answer within \\boxed\{\}"is held fixed across selection, training, and evaluation, ensuring that selection scores measured on𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}remain meaningful for the fine\-tuned model under matched prompting\.

##### Reproducibility\.

We fix the global seed to4242forrandom,numpy, andtorchin all selection\-scoring and Condition[1](https://arxiv.org/html/2605.30651#Thmcondition1)verification experiments, and rely on the HuggingFaceTrainerdefault seed of4242for SFT\. All headline numbers in Table[1](https://arxiv.org/html/2605.30651#S5.T1)are reported as the mean±\\pmstandard deviation over33independent evaluation runs that differ only in the sampling seed of the inference engine, with decoding hyperparameterstemperature=0\.6=0\.6,top\_p=0\.95=0\.95,repetition\_penalty=1\.1=1\.1, and maximum generation length32,76832\{,\}768, evaluated under ACC@55with55samples per problem \(Appendix[C\.4](https://arxiv.org/html/2605.30651#A3.SS4)\)\. To facilitate reproduction we will release, under an MIT licence, the full training code, the YAML configuration files \(train/configs/sft\_full\.yaml,train/configs/sft\_full\_500\.yaml,train/configs/zero2\_bf16\.json\), the precomputed per\-trajectory scores \(data/Q1/<student\>/grape\_g/scores\.json\), and the selection rules used in each ablation\.

### C\.4Evaluation Protocol

##### Benchmarks\.

We evaluate on the same four reasoning benchmarks reported in Section[5\.1](https://arxiv.org/html/2605.30651#S5.SS1), which span competition mathematics, grade\-level mathematics, and graduate\-level science: AIME\-2024\[[1](https://arxiv.org/html/2605.30651#bib.bib20)\]\(3030problems from the20242024American Invitational Mathematics Examination\), AMC\[Mathematical Association of America,[2025](https://arxiv.org/html/2605.30651#bib.bib19)\]\(8383problems from the American Mathematics Competitions\), MATH\-500\[Hendryckset al\.,[2021](https://arxiv.org/html/2605.30651#bib.bib22)\]\(500500problems sampled from the MATH test set, which covers algebra, geometry, number theory, probability, and precalculus\), and GPQA\-Diamond\[Reinet al\.,[2024](https://arxiv.org/html/2605.30651#bib.bib21)\]\(198198multiple\-choice questions written by domain experts in physics, chemistry, and biology\)\. The four benchmarks together cover both numeric\-answer and multiple\-choice formats, which lets us assess whether the improvements brought by LARK transfer beyond the mathematical reasoning domain on which the candidate pool is constructed\. The Avg column in Table[1](https://arxiv.org/html/2605.30651#S5.T1)is the unweighted mean of these four benchmarks; because the four splits differ in the number of problems, the per\-benchmark numbers remain the primary basis of comparison\.

##### Decoding configuration\.

For every benchmark and every fine\-tuned student, we generate55independent samples per problem with the following decoding parameters:

- •temperature=0\.6=0\.6
- •top\-p=0\.95p=0\.95
- •top\-k=−1k=\-1\(no top\-kktruncation\)
- •maximum generation length==32,76832\{,\}768tokens \(matching the SFT context length\)

The same decoding configuration is used across all baselines and LARK to ensure that any difference in downstream accuracy is attributable to the selection rule rather than to inference\-time differences\.

##### Metric \(ACC@5\)\.

We reportACC@5as our primary evaluation metric\. For each test problem, we sample55independent generations under the decoding configuration above; the problem is marked*correct*if any of the55generations contains the correct final answer under the extraction rule described below\. ACC@5 captures whether the student model is capable of producing a correct reasoning trajectory for the problem within a small sampling budget, and is less sensitive to single\-sample variance than greedy ACC@1\. Each experiment is repeated under three independent decoding seeds, with the selected training data and the SFT seed held fixed; we report the mean and standard deviation of ACC@5 across these three decoding seeds, so the reported variability captures inference\-time sampling noise rather than training\-seed noise\.

##### System prompts\.

We use a fixed system prompt per benchmark family, distinguishing numeric\-answer tasks \(AIME, AMC\), multiple\-choice tasks \(GPQA\), and free\-form\-answer tasks \(MATH\-500\)\. All three prompts share the same overall structure — enforcing English output, separating the reasoning chain from the final answer, and requiring the answer in a`\\boxed\{\.\.\.\}`environment for unambiguous extraction — and differ only in the answer\-format specification\.

System Prompt: AIME / AMC \(Numeric Answer\)You are an AI mathematician\. All content you output MUST be in English\. Use the question to deduce the correct numeric answer\.Finish all your reasoning, then on a NEW line output onlyboxed\{<number\>\}\.The content insideboxed\{<number\>\}must be the final numeric answer only, with no expressions, variables, or additional text\.

System Prompt: GPQA \(Multiple Choice\)You are an AI mathematician\. All content you output MUST be in English\. Use the question to deduce the correct choice\.Finish all your reasoning, then on a NEW line output onlyboxed\{<letter\>\}\.The content insideboxed\{<letter\>\}must be exactly one of A, B, C, D\.

System Prompt: MATH \(Free\-form Answer\)You are an AI mathematician\. All content you output MUST be in English\. Use the question to deduce the correct answer\.Finish all your reasoning, then on a NEW line output onlyboxed\{<answer\>\}\.

##### Answer extraction rule\.

We extract the predicted final answer from each generation using the following deterministic rule:

- •If the generation contains at least one`\\boxed\{\.\.\.\}`token, we take the content inside the*last*`\\boxed\{\.\.\.\}`as the predicted answer\. The last \(rather than first\) occurrence is used because models occasionally produce intermediate boxed expressions during reasoning; only the final boxed value is treated as the answer\.
- •If the generation contains no`\\boxed\{\.\.\.\}`token, the generation is treated as*incorrect*, regardless of whether the surrounding text contains a correct value\.

For numeric\-answer tasks \(AIME, AMC\), the extracted string is canonicalized by stripping surrounding whitespace and converting fractionsa/ba/bto their decimal form before comparison with the ground truth\. For multiple\-choice tasks \(GPQA\), the extracted string is uppercased and compared against the gold letter in\{A,B,C,D\}\\\{\\mathrm\{A\},\\mathrm\{B\},\\mathrm\{C\},\\mathrm\{D\}\\\}\. For free\-form answers \(MATH\-500\), we follow the canonicalization rules ofHendryckset al\.\[[2021](https://arxiv.org/html/2605.30651#bib.bib22)\], which normalize fraction, square\-root, and exponent notation prior to string matching\.

##### Efficient scoring for rank\-based metrics\.

The RSR baseline of Appendix[C\.2](https://arxiv.org/html/2605.30651#A3.SS2)requires the clipped rankr¯tk=min⁡\(rtk,rmax\)\\bar\{r\}\_\{t\}^\{k\}=\\min\(r\_\{t\}^\{k\},r\_\{\\max\}\)at every token position\. A naive implementation would sort the full\|𝒱\|\|\\mathcal\{V\}\|\-dimensional logit vector at each token, costingO​\(\|𝒱\|​log⁡\|𝒱\|\)O\(\|\\mathcal\{V\}\|\\log\|\\mathcal\{V\}\|\)per token\. We avoid this by observing that under rank clipping atrmaxr\_\{\\max\}, only the relative order of the top\-rmaxr\_\{\\max\}logits matters: any token whose true rank exceedsrmaxr\_\{\\max\}is mapped tormaxr\_\{\\max\}regardless of its exact rank\. We therefore extract the top\-rmaxr\_\{\\max\}logits viatorch\.topk\(costO​\(\|𝒱\|​log⁡rmax\)O\(\|\\mathcal\{V\}\|\\log r\_\{\\max\}\)per token\), check whether the gold tokenytky\_\{t\}^\{k\}is among them, and assignr¯tk\\bar\{r\}\_\{t\}^\{k\}accordingly\. This procedure is exact under rank clipping and reduces the per\-token cost by a factor oflog⁡\|𝒱\|/log⁡rmax\\log\|\\mathcal\{V\}\|/\\log r\_\{\\max\}, which is roughly5×5\\times–6×6\\timesfor\|𝒱\|≈1\.5×105\|\\mathcal\{V\}\|\\approx 1\.5\\times 10^\{5\}andrmax=100r\_\{\\max\}=100\. The same forward pass that produces the surprisalstk=−log⁡𝝅tk​\(ytk\)s\_\{t\}^\{k\}=\-\\log\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\(y\_\{t\}^\{k\}\)is reused to produce the top\-rmaxr\_\{\\max\}logits, so RSR scoring requires only one forward pass per candidate trajectory\.

##### Prompt consistency\.

We use a consistent chat template across \(i\) selection\-time scoring \(when the student model evaluates each candidate trajectory under teacher forcing\), \(ii\) SFT loss formatting \(when the model is fine\-tuned on the selected trajectories\), and \(iii\) evaluation\-time prompting \(when the fine\-tuned model is queried on benchmark problems\)\. Using the same chat template at all three stages reduces train–test mismatch and ensures that selection scores measured on𝜽ref\\boldsymbol\{\\theta\}\_\{\\mathrm\{ref\}\}remain meaningful for the fine\-tuned model\. All inputs are truncated to the same maximum sequence length of32,76832\{,\}768tokens used for SFT\.

### C\.5Train/Eval Contamination Audit

To ensure that the downstream evaluation results in Table[1](https://arxiv.org/html/2605.30651#S5.T1)reflect genuine generalization rather than memorization of training\-time supervision, we audit the overlap between the5,0005\{,\}000\-problem NuminaMath candidate pool of Appendix[C\.1](https://arxiv.org/html/2605.30651#A3.SS1)and every problem in the evaluation benchmarks of Appendix[C\.4](https://arxiv.org/html/2605.30651#A3.SS4)\.

##### Method\.

For each evaluation problem we compute the character88\-gram Jaccard similarity against every problem in the candidate pool, after a normalization step that lowercases the text, collapses whitespace, and strips redundant LaTeX spacing macros\. We report two statistics: \(i\)*exact matches*\(Jaccard=1=1after normalization\), which would indicate verbatim duplication; and \(ii\)*near\-matches*at three progressively stricter thresholds,J≥0\.4J\\geq 0\.4,J≥0\.7J\\geq 0\.7, andJ≥0\.85J\\geq 0\.85, to capture different degrees of surface overlap\. The metric is sensitive to both surface and structural duplication: an exact\-template variant with all numerals changed typically scoresJ∈\[0\.7,0\.95\]J\\in\[0\.7,0\.95\], while a fully independent problem on the same topic typically scoresJ<0\.2J<0\.2\.

##### Results\.

Table[2](https://arxiv.org/html/2605.30651#A3.T2)summarizes the audit across all four evaluation benchmarks \(AIME\-2024, AMC, GPQA\-Diamond, and MATH\-500, broken down by difficulty levels L1–L5\)\. The audit detects𝟎\\mathbf\{0\}exact matches across all811811evaluation problems\. AIME\-2024, AMC, GPQA\-Diamond, and MATH\-L1 contain0near\-matches at every threshold; GPQA\-Diamond is graduate\-level physics, chemistry, and biology with no domain overlap with the math\-only NuminaMath training pool, while AIME, AMC, and MATH\-L1 simply do not appear in the pool under any threshold\. MATH\-L2 through L5 contain3131near\-matches atJ≥0\.4J\\geq 0\.4\(∼3\.7%\\sim 3\.7\\%of the MATH\-500 split\), of which only11case atJ≥0\.85J\\geq 0\.85\.

Table 2:Train/eval contamination audit\. For each evaluation benchmark, we report the number of problems whose character88\-gram Jaccard similarity against the NuminaMath candidate pool of Appendix[C\.1](https://arxiv.org/html/2605.30651#A3.SS1)exceeds the indicated threshold\.*Exact*denotes Jaccard=1=1after text normalization\. MATH\-L1 through MATH\-L5 are the five difficulty levels of MATH\-500\.
##### Categorization of MATH\-L2–L5 near\-matches\.

Manual inspection of all3131flagged pairs places them in three categories, none of which constitute contamination:

1. 1\.Template\-shared, numerically perturbed\.Same algebraic template, different constants or signs, leading to a different polynomial and a different answer\. The highest\-similarity case \(MATH\-L2 \#11,J=0\.869J=0\.869\) asks for the smallestnnsuch that all roots ofz4\+z2\+1=0z^\{4\}\+z^\{2\}\+1=0arenn\-th roots of unity, while the closest training problem asks the same question forz4−z2\+1=0z^\{4\}\-z^\{2\}\+1=0\.
2. 2\.Template\-shared, different question on the same setup\.Same physical or geometric setup, but a different quantity is asked, with no derivable relation between the two answers\. For example,MATH\-L5 \#46\(J=0\.693J=0\.693\) describes a hot\-air balloon held by four ropes anchored at pointsA,B,C,DA,B,C,D; the training problem on the same setup asks for the length ofO​HOH, whereas the evaluation problem asks for the greatest rope length saved by replacingH​C\+H​DHC\+HDwith a single rope at a chosen point\. These are companion problems from the parent MATH dataset\.
3. 3\.Surface\-level template variation\.Same problem stem with different numerical inputs \(e\.g\., coefficients in a quadratic completion task\)\.MATH\-L4 \#38\(J=0\.664J=0\.664\) asks to rewritex2\+2\.6​x\+3\.6x^\{2\}\+2\.6x\+3\.6in the form\(x\+b\)2\+c\(x\+b\)^\{2\}\+c, while the training problem performs the same operation onx2−20​x\+36x^\{2\}\-20x\+36\. The procedure is identical but the inputs and outputs differ\.

In none of the3131flagged cases does the answer to a training problem transfer to the corresponding evaluation problem, and no exact duplicates of any AIME\-2024, AMC, GPQA\-Diamond, or MATH\-500 problem appear in the training pool\. The downstream Acc@5 results in Table[1](https://arxiv.org/html/2605.30651#S5.T1)therefore reflect generalization to held\-out problems rather than retrieval of memorized supervision\.

## Appendix DDiagnostic Analyses of LARK

This appendix collects additional evaluation results and diagnostic analyses that validate the design choices of LARK beyond the main downstream comparison\. Appendix[D\.1](https://arxiv.org/html/2605.30651#A4.SS1)verifies the efficiency claim through both an asymptotic complexity derivation and a wall\-clock measurement on88A100 GPUs, showing that LARK matches the lowest cost achieved by any per\-trajectory scoring baseline\. Appendix[D\.2](https://arxiv.org/html/2605.30651#A4.SS2)sweeps the selection budgetBBfrom 1 to 20 on a 500\-problem diagnostic subset and studies how performance changes with the budget\. Appendix[D\.3](https://arxiv.org/html/2605.30651#A4.SS3)zooms into a single training problem to illustrate, end to end, the score landscape over candidate trajectories, the token\-level structure of the chosen trajectory, theχ2\\chi^\{2\}\-BBweighting it induces, and the resulting weighted SFT objective for that problem\.

### D\.1Computational Cost

We separate every per\-trajectory score into two parts: a single forward pass over the trajectory through the student model, with costCfwd​\(\|ak\|\)C\_\{\\mathrm\{fwd\}\}\(\|a\_\{k\}\|\), and a per\-token post\-processing step on the resulting logits\. Decomposing the total cost in this way exposes the differences across methods, which would otherwise be hidden by the dominant forward\-pass term\.

##### Per\-trajectory cost decomposition\.

A single forward pass through a transformer withLLlayers, hidden dimensiondd, and vocabulary size\|𝒱\|\|\\mathcal\{V\}\|on a trajectory of length\|ak\|\|a\_\{k\}\|has cost

Cfwd​\(\|ak\|\)=𝒪​\(L​\|ak\|​d2\)⏟MLP\+𝒪​\(L​\|ak\|2​d\)⏟attention\+𝒪​\(\|ak\|⋅\|𝒱\|⋅d\)⏟LM head\.\\displaystyle C\_\{\\mathrm\{fwd\}\}\(\|a\_\{k\}\|\)=\\underbrace\{\\mathcal\{O\}\(L\|a\_\{k\}\|d^\{2\}\)\}\_\{\\text\{MLP\}\}\+\\underbrace\{\\mathcal\{O\}\(L\|a\_\{k\}\|^\{2\}d\)\}\_\{\\text\{attention\}\}\+\\underbrace\{\\mathcal\{O\}\(\|a\_\{k\}\|\\cdot\|\\mathcal\{V\}\|\\cdot d\)\}\_\{\\text\{LM head\}\}\.After this forward pass, every per\-trajectory selection method applies a post\-processing operation to the logits at each token\. GRAPE computes the cross\-entropy of the gold token from a\|𝒱\|\|\\mathcal\{V\}\|\-dimensional log\-sum\-exp, with per\-token cost𝒪​\(\|𝒱\|\)\\mathcal\{O\}\(\|\\mathcal\{V\}\|\)\. LARK computes both the cross\-entropy and the Brier component∑v𝝅tk​\(v\)2\\sum\_\{v\}\\boldsymbol\{\\pi\}\_\{t\}^\{k\}\(v\)^\{2\}, again with per\-token cost𝒪​\(\|𝒱\|\)\\mathcal\{O\}\(\|\\mathcal\{V\}\|\)\. RSR additionally selects the top\-rmaxr\_\{\\max\}entries of each token distribution to form a partial rank statistic, with per\-token cost𝒪​\(\|𝒱\|​log⁡rmax\)\\mathcal\{O\}\(\|\\mathcal\{V\}\|\\log r\_\{\\max\}\)\. Local Naturalness instead runsm\+1m\+1independent forward passes per trajectory \(one for each masked rollout\), each followed by an𝒪​\(\|𝒱\|\)\\mathcal\{O\}\(\|\\mathcal\{V\}\|\)per\-token operation\.

##### Total complexity\.

LetNNdenote the number of training problems andKKthe number of candidate trajectories per problem\. Aggregating the forward and the post\-processing terms over allN​KNKcandidates yields the total cost of each method:

GRAPE:\\displaystyle\\text\{GRAPE\}\\\!:\\quad𝒪​\(N​K⋅\[Cfwd​\(\|ak\|\)\+\|ak\|⋅\|𝒱\|\]\),\\displaystyle\\mathcal\{O\}\\\!\\big\(NK\\cdot\[C\_\{\\mathrm\{fwd\}\}\(\|a\_\{k\}\|\)\+\|a\_\{k\}\|\\cdot\|\\mathcal\{V\}\|\]\\big\),LARK:\\displaystyle\\text\{LARK\}\\\!:\\quad𝒪​\(N​K⋅\[Cfwd​\(\|ak\|\)\+\|ak\|⋅\|𝒱\|\]\),\\displaystyle\\mathcal\{O\}\\\!\\big\(NK\\cdot\[C\_\{\\mathrm\{fwd\}\}\(\|a\_\{k\}\|\)\+\|a\_\{k\}\|\\cdot\|\\mathcal\{V\}\|\]\\big\),RSR:\\displaystyle\\text\{RSR\}\\\!:\\quad𝒪​\(N​K⋅\[Cfwd​\(\|ak\|\)\+\|ak\|⋅\|𝒱\|​log⁡rmax\]\),\\displaystyle\\mathcal\{O\}\\\!\\big\(NK\\cdot\[C\_\{\\mathrm\{fwd\}\}\(\|a\_\{k\}\|\)\+\|a\_\{k\}\|\\cdot\|\\mathcal\{V\}\|\\log r\_\{\\max\}\]\\big\),Local Naturalness:\\displaystyle\\text\{Local Naturalness\}\\\!:\\quad𝒪​\(N​K⋅\[\(m\+1\)​Cfwd​\(\|ak\|\)\+\|ak\|⋅\|𝒱\|\]\)\.\\displaystyle\\mathcal\{O\}\\\!\\big\(NK\\cdot\[\(m\+1\)\\,C\_\{\\mathrm\{fwd\}\}\(\|a\_\{k\}\|\)\+\|a\_\{k\}\|\\cdot\|\\mathcal\{V\}\|\]\\big\)\.LARK matches the lowest cost achieved by GRAPE\. RSR adds alog⁡rmax\\log r\_\{\\max\}factor on the per\-token term because of its rank selection step, and Local Naturalness multiplies the dominant forward pass bym\+1m\+1because of its repeated rollouts\. The strict orderingGRAPE=LARK<RSR\\text\{GRAPE\}=\\text\{LARK\}<\\text\{RSR\}andGRAPE=LARK<Local Naturalness\\text\{GRAPE\}=\\text\{LARK\}<\\text\{Local Naturalness\}follows from\|ak\|⋅\|𝒱\|<\|ak\|⋅\|𝒱\|​log⁡rmax\|a\_\{k\}\|\\cdot\|\\mathcal\{V\}\|<\|a\_\{k\}\|\\cdot\|\\mathcal\{V\}\|\\log r\_\{\\max\}and fromCfwd<\(m\+1\)​CfwdC\_\{\\mathrm\{fwd\}\}<\(m\+1\)\\,C\_\{\\mathrm\{fwd\}\}form≥1m\\geq 1\. We do not assign a model\-independent asymptotic formula to LLM\-judged Quality because its cost depends on the judge model and serving setup; we therefore report it in the wall\-clock comparison\.

##### Empirical wall\-clock cost\.

We benchmark each scoring method by measuring its per\-sample wall\-clock cost on88A100 GPUs\.222Random, Token Length, and Rule\-based Quality require no forward pass through the student and run in negligible time, so we omit them from the figure\.Figure[7](https://arxiv.org/html/2605.30651#A4.F7)reports the resulting cost across all five scoring methods\. LARK \(1\.321\.32s/sample\) matches GRAPE \(1\.281\.28s/sample\) to within3%3\\%, which reflects the fact that they share the same forward pass and differ only by an additional𝒪​\(\|𝒱\|\)\\mathcal\{O\}\(\|\\mathcal\{V\}\|\)Brier sum per token\. LLM\-judged Quality is1\.6×1\.6\\timesslower at2\.112\.11s/sample because of the auxiliary judge call\. RSR is2\.7×2\.7\\timesslower at3\.443\.44s/sample, which is consistent with the extralog⁡rmax\\log r\_\{\\max\}rank\-selection factor in its per\-token operation\. Local Naturalness is the most expensive at5\.065\.06s/sample, which is4\.0×4\.0\\timesslower than LARK and matches the\(m\+1\)\(m\+1\)\-fold forward\-pass cost predicted by the asymptotic analysis\. The empirical ordering matches the constant\-factor ranking implied by the complexity formulas\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x10.png)Figure 7:Per\-sample wall\-clock cost of trajectory scoringon88A100 GPUs\. The number in parentheses is the slowdown relative to the fastest method\. LARK is essentially as fast as GRAPE,1\.6×1\.6\\timesfaster than LLM\-judged Quality,2\.7×2\.7\\timesfaster than RSR, and4\.0×4\.0\\timesfaster than Local Naturalness\. Methods that require no forward pass through the student \(Random, Token Length, Rule\-based Quality\) are omitted because their cost is negligible\.

### D\.2Selection Budget Scaling

We study how LARK behaves as the number of selected trajectories per question changes\. This analysis sweeps the selection budgetB∈\{1,3,5,10,20\}B\\in\\\{1,3,5,10,20\\\}on Qwen\-2\.5\-7B and evaluates the resulting student on AMC Acc@5\. Because this sweep is conducted on the same 500\-problem subset used for our diagnostic experiments, rather than the full 5,000\-problem training set used in Table[1](https://arxiv.org/html/2605.30651#S5.T1), these results are intended to show the relative trend across budgets rather than to replace the main evaluation results\.

For each budgetBB, we apply the LARK selection rule to chooseBBtrajectories per problem, fine\-tune Qwen\-2\.5\-7B with the same SFT protocol described in Appendix[C](https://arxiv.org/html/2605.30651#A3), and report AMC Acc@5 in Figure[8](https://arxiv.org/html/2605.30651#A4.F8)\. LARK remains effective across the entire budget range: accuracy increases from49\.4%49\.4\\%atB=1B\{=\}1to65\.1%65\.1\\%atB=3B\{=\}3, reaches its highest value of68\.7%68\.7\\%atB=10B\{=\}10, and remains strong at66\.3%66\.3\\%whenB=20B\{=\}20\.

The trend is not strictly monotonic\. Moving fromB=1B\{=\}1toB=3B\{=\}3brings a large improvement, indicating that using multiple selected trajectories can provide complementary supervision\. However, increasing the budget beyondB=3B\{=\}3gives smaller and less stable gains\. In particular,B=5B\{=\}5underperformsB=3B\{=\}3, whileB=10B\{=\}10gives the best result andB=20B\{=\}20slightly decreases\. This pattern suggests that a small number of carefully selected trajectories already captures most of the useful training signal, while adding more lower\-ranked trajectories can introduce weaker or less informative supervision\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x11.png)Figure 8:Selection budget scaling of LARKon Qwen\-2\.5\-7B, evaluated on AMC after fine\-tuning on the 500\-problem subset\. LARK achieves strong performance acrossB∈\{1,3,5,10,20\}B\\in\\\{1,3,5,10,20\\\}, with the best result obtained atB=10B\{=\}10\.Overall, the budget sweep shows that LARK is not sensitive to a single fixed choice ofBB\. The method performs well with a small budget \(B=3B\{=\}3\) and remains competitive as the budget increases\. This supports the use of a small top\-BBselection strategy in the main experiments, where the goal is to improve distillation efficiency while avoiding unnecessary supervision from less informative trajectories\.

### D\.3Case Study: A Single\-Problem Walkthrough

While Appendices[D\.1](https://arxiv.org/html/2605.30651#A4.SS1)and[D\.2](https://arxiv.org/html/2605.30651#A4.SS2)support the efficiency and budget\-scaling analyses of LARK at the aggregate level, this appendix zooms into a single training problem to illustrate the end\-to\-end behavior of the selection rule\. We walk through \(i\) the problem and its candidate pool, \(ii\) the LARK score landscape, \(iii\) the top\-1 trajectory selected by LARK, and \(iv\) the closed\-formχ2\\chi^\{2\}\-BBweights atB=3B=3\.

##### Problem and candidate pool\.

We illustrate on problempid=839\\text\{pid\}=839from the candidate pool of Appendix[C\.1](https://arxiv.org/html/2605.30651#A3.SS1):

User Question \(pid=839=839\)If𝐚\\mathbf\{a\},𝐛\\mathbf\{b\}, and𝐜\\mathbf\{c\}are vectors such that‖𝐚‖=‖𝐛‖=1\\\|\\mathbf\{a\}\\\|=\\\|\\mathbf\{b\}\\\|=1,‖𝐚\+𝐛‖=3\\\|\\mathbf\{a\}\+\\mathbf\{b\}\\\|=\\sqrt\{3\}, and𝐜−𝐚−2​𝐛=3​\(𝐚×𝐛\),\\mathbf\{c\}\-\\mathbf\{a\}\-2\\mathbf\{b\}=3\(\\mathbf\{a\}\\times\\mathbf\{b\}\),then find𝐛⋅𝐜\\mathbf\{b\}\\cdot\\mathbf\{c\}\.

The ground\-truth answer is52\\boxed\{\\dfrac\{5\}\{2\}\}\. The pool containsK=33K=33candidate trajectories produced by 11 teacher models with 3 rollouts each; all reach the correct final answer per the construction of Appendix[C\.1](https://arxiv.org/html/2605.30651#A3.SS1)\.

##### Score landscape\.

Figure[9](https://arxiv.org/html/2605.30651#A4.F9)shows the LARK scoreg^k\\hat\{g\}\_\{k\}for all 33 candidates, grouped by teacher\. The top\-1 trajectory is the second rollout ofphi4\-reason\-plus\(highlighted in red,g^=0\.01782\\hat\{g\}=0\.01782\); the same teacher’s third rollout takes rank 2 \(g^=0\.01687\\hat\{g\}=0\.01687\)\. Notably,phi4\-reason\-plusrollout 1 ranks only 25 \(g^=0\.00960\\hat\{g\}=0\.00960\): the within\-teacher spread on this teacher alone is1\.86×1\.86\\times, exceeding the spread between several pairs of teachers\. The same pattern recurs across the pool \(e\.g\.,nemotron\-super\-v15rollouts span0\.00520\.0052to0\.00780\.0078\), which underscores that LARK selects at the trajectory level rather than at the teacher level\. Teachers commonly perceived as strong on aggregate benchmarks \(deepseek\-r1\-0528,qwen3\-235b\-2507,qwq\-32b\) do not contribute any top\-3 trajectory on this problem, with their rollouts clustering near the median\.

![Refer to caption](https://arxiv.org/html/2605.30651v1/x12.png)Figure 9:g^k\\hat\{g\}\_\{k\}for all 33 candidate trajectories on problempid=839\\text\{pid\}=839, grouped by teacher\. Each teacher contributes 3 rollouts \(indexed \.1/\.2/\.3 along thexx\-axis\)\. The top\-1 trajectory selected by LARK \(phi4\-reason\-plusrollout 2,g^=0\.01782\\hat\{g\}=0\.01782\) is highlighted with a red border\. Within\-teacher variation is comparable to between\-teacher variation\.
##### Top\-1 trajectory selected by LARK\.

We quote the top\-1 trajectory \(phi4\-reason\-plusrollout 2,g^=0\.01782\\hat\{g\}=0\.01782\) in full:

Top\-1 trajectory selected by LARK \(phi4\-reason\-plusrollout 2\)<think\> We are given: Vectorsaa,bb,cc\. Conditions:‖a‖=‖b‖=1\\\|a\\\|=\\\|b\\\|=1, and‖a\+b‖=3\\\|a\+b\\\|=\\sqrt\{3\}\. Alsoc−a−2​b=3​\(a×b\)c\-a\-2b=3\(a\\times b\)\. We need findb⋅cb\\cdot c\.Step: First, note:‖a\+b‖2=\(a\+b\)2=a2\+2​a⋅b\+b2=1\+2​cos⁡θ\+1=2\+2​cos⁡θ=3\\\|a\+b\\\|^\{2\}=\(a\+b\)^\{2\}=a^\{2\}\+2a\\cdot b\+b^\{2\}=1\+2\\cos\\theta\+1=2\+2\\cos\\theta=3\. So2\+2​cos⁡θ=3⇒cos⁡θ=1/22\+2\\cos\\theta=3\\Rightarrow\\cos\\theta=1/2soa⋅b=1/2a\\cdot b=1/2\.Now, note:a×ba\\times bis perpendicular to bothaaandbb\. So it is orthogonal toaa,bb\. Then,c=a\+2​b\+3​\(a×b\)c=a\+2b\+3\(a\\times b\)\. Dotting withbb:b⋅c=b⋅a\+2​b⋅b\+3​b⋅\(a×b\)=a⋅b\+2⋅1\+3⋅0=1/2\+2=5/2b\\cdot c=b\\cdot a\+2b\\cdot b\+3b\\cdot\(a\\times b\)=a\\cdot b\+2\\cdot 1\+3\\cdot 0=1/2\+2=5/2\. So answer:5/25/2\.Thus final answer:5/25/2\. We’ll produce answer in box:\\boxed\{5/2\}\. </think\> \\boxed\{\\frac\{5\}\{2\}\}

The trajectory is substantially more compact than the average candidate in this pool\. Importantly, its length\-normalized cross\-entropyℓk\\ell\_\{k\}underθref\\theta\_\{\\text\{ref\}\}is*not*the smallest in the pool; what raisesg^k\\hat\{g\}\_\{k\}appears to be the structure of the residual rather than low loss alone\. A qualitative inspection suggests that high\-residual tokens around numeric expressions and key algebraic steps carry a large portion of the cumulative Brier mass, matching the heuristic reading ofg^k\\hat\{g\}\_\{k\}in Remark[2](https://arxiv.org/html/2605.30651#Thmremark2)\.

##### χ2\\chi^\{2\}\-BBweighting atB=3B=3\.

WithB=3B=3, the closed\-formχ2\\chi^\{2\}\-BBrule gives the weights from the score margins above the thresholdg^\(B\+1\)\\hat\{g\}\_\{\(B\+1\)\}\. The four relevant scores on this problem are:

g^\(1\)\\displaystyle\\hat\{g\}\_\{\(1\)\}=0\.01782\(phi4\-reason\-plus​\.2\),\\displaystyle=0\.01782\\quad\(\\texttt\{phi4\-reason\-plus\}\.2\),g^\(2\)\\displaystyle\\hat\{g\}\_\{\(2\)\}=0\.01687\(phi4\-reason\-plus​\.3\),\\displaystyle=0\.01687\\quad\(\\texttt\{phi4\-reason\-plus\}\.3\),g^\(3\)\\displaystyle\\hat\{g\}\_\{\(3\)\}=0\.01426\(gptoss\-20b\-high​\.1\),\\displaystyle=0\.01426\\quad\(\\texttt\{gptoss\-20b\-high\}\.1\),g^\(4\)\\displaystyle\\hat\{g\}\_\{\(4\)\}=0\.01285\(gptoss\-120b\-high​\.2;threshold\)\.\\displaystyle=0\.01285\\quad\(\\texttt\{gptoss\-120b\-high\}\.2;\\ \\text\{threshold\}\)\.The score margins above the threshold are

g^\(1\)−g^\(4\)=0\.00497,g^\(2\)−g^\(4\)=0\.00402,g^\(3\)−g^\(4\)=0\.00141,\\hat\{g\}\_\{\(1\)\}\-\\hat\{g\}\_\{\(4\)\}=0\.00497,\\qquad\\hat\{g\}\_\{\(2\)\}\-\\hat\{g\}\_\{\(4\)\}=0\.00402,\\qquad\\hat\{g\}\_\{\(3\)\}\-\\hat\{g\}\_\{\(4\)\}=0\.00141,summing to∑j=13\(g^\(j\)−g^\(4\)\)=0\.01040\\sum\_\{j=1\}^\{3\}\(\\hat\{g\}\_\{\(j\)\}\-\\hat\{g\}\_\{\(4\)\}\)=0\.01040\. The closed\-form temperature is thereforeτ∗​\(B\)=0\.01040/K≈3\.15×10−4\\tau^\{\*\}\(B\)=0\.01040/K\\approx 3\.15\\times 10^\{\-4\}, and the LARK weights are

q^\(1\)=0\.478,q^\(2\)=0\.387,q^\(3\)=0\.136\.\\hat\{q\}\_\{\(1\)\}=0\.478,\\qquad\\hat\{q\}\_\{\(2\)\}=0\.387,\\qquad\\hat\{q\}\_\{\(3\)\}=0\.136\.The contrast with hard top\-BBtruncation \(uniform weight1/B≈0\.3331/B\\approx 0\.333each\) is substantial: LARK assigns43%43\\%more weight to the rank\-1 trajectory than uniform top\-3 would, and only41%41\\%as much weight to the rank\-3 trajectory\. The asymmetry directly reflects the score gap between the twophi4\-reason\-plusrollouts and the remainder of the pool\. The resulting per\-problem contribution to the SFT loss is

ℒSFT\(pid=839\)​\(θ\)=0\.478​ℓ​\(θ;𝐲16\)\+0\.387​ℓ​\(θ;𝐲17\)\+0\.136​ℓ​\(θ;𝐲6\),\\mathcal\{L\}\_\{\\text\{SFT\}\}^\{\(\\text\{pid\}=839\)\}\(\\theta\)=0\.478\\,\\ell\(\\theta;\\mathbf\{y\}\_\{16\}\)\+0\.387\\,\\ell\(\\theta;\\mathbf\{y\}\_\{17\}\)\+0\.136\\,\\ell\(\\theta;\\mathbf\{y\}\_\{6\}\),with all other 30 candidates carrying zero weight on this problem\.

##### Summary\.

The walkthrough illustrates three points anchored at a concrete problem: trajectory\-level \(rather than teacher\-level\) variation ing^k\\hat\{g\}\_\{k\}can be the dominant source of selection signal; the LARK\-selected trajectory is not the one with the lowestℓk\\ell\_\{k\}but the one with structured residual signal; and theχ2\\chi^\{2\}\-BBweighting deviates substantially from uniform top\-BBtruncation when the top\-ranked trajectory is sharply separated from the threshold\.

## Appendix ELimitations

LARK provides a practical and theoretically motivated approach for trajectory\-level selection in reasoning distillation\. Its forward\-pass score makes the method efficient to apply to large candidate pools, and the closed\-form weighting rule avoids additional tuning while preserving the benefits of multi\-trajectory supervision\. The empirical results across multiple student models and benchmarks suggest that learnability is a useful principle for selecting reasoning trajectories\.

There are two natural limitations of the present study\. First, our experiments use a correctness\-verified candidate pool, where all retained trajectories reach the correct final answer\. This setting isolates the trajectory\-selection problem from the separate problem of correctness verification, but future work could study LARK in noisier pools that include incorrect or partially correct teacher trajectories\. Second, due to the computational cost of full\-parameter SFT, our reported standard deviations are computed over independent decoding seeds while keeping the training configuration fixed\. Running multiple independent SFT seeds would provide an even more complete estimate of training\-time variance\.

Similar Articles

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Hugging Face Daily Papers

Adaptive Teacher Exposure for Self-Distillation (ATESD) improves LLM reasoning by dynamically adjusting how much of the reference reasoning the teacher shows the student during training, using a learnable policy controller and a discounted learning-progress reward. Experiments on math benchmarks show consistent improvements over existing self-distillation and RL baselines.

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

arXiv cs.CL

This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.