LoRi: Low-Rank Distillation for Implicit Reasoning
Summary
LoRi proposes a low-rank distillation framework for implicit chain-of-thought reasoning that aligns teacher and student trajectories in a shared low-rank subspace, improving performance on mathematical reasoning benchmarks.
View Cached Full Text
Cached at: 06/05/26, 08:05 AM
# LoRi: Low-Rank Distillation for Implicit Reasoning
Source: [https://arxiv.org/html/2606.05315](https://arxiv.org/html/2606.05315)
Ryan Solgi1, Jiayi Tian1, Zheng Zhang1 1University of California\-Santa Barbara, USA solgi@ucsb\.edu,zhengzhang@ece\.ucsb\.edu
###### Abstract
Implicit chain\-of\-thought \(iCoT\) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting\. We empirically find that hidden\-state reasoning trajectories exhibit low\-rank structure\. Motivated by this observation, we propose a low\-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low\-rank tensor subspace using first\- and second\-order statistics\. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process\. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks\. Our approach consistently improves performance, especially on challenging multi\-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods\. The code is available at[https://github\.com/rmsolgi/lori](https://github.com/rmsolgi/lori)
LoRi: Low\-Rank Distillation for Implicit Reasoning
Ryan Solgi1, Jiayi Tian1, Zheng Zhang11University of California\-Santa Barbara, USAsolgi@ucsb\.edu,zhengzhang@ece\.ucsb\.edu
## 1Introduction
Large language models \(LLMs\) exhibit strong reasoning abilities under explicit chain\-of\-thought \(CoT\) prompting\(Weiet al\.,[2022](https://arxiv.org/html/2606.05315#bib.bib18); Zelikmanet al\.,[2022](https://arxiv.org/html/2606.05315#bib.bib20)\)\. However, explicit CoT is computationally expensive, can encourage reliance on non\-robust patterns, and is sensitive to inference procedures\(Liet al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib19); Linet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib21); Wanget al\.,[2022](https://arxiv.org/html/2606.05315#bib.bib22); Yaoet al\.,[2023](https://arxiv.org/html/2606.05315#bib.bib24)\)\. As a result, recent work has explored more efficient reasoning paradigms that reduce dependence on explicit textual rationales\(Heet al\.,[2026](https://arxiv.org/html/2606.05315#bib.bib23); Liet al\.,[2025b](https://arxiv.org/html/2606.05315#bib.bib25)\)\.
Implicit chain\-of\-thought \(iCoT\) encodes reasoning in latent representations rather than explicit text\(Denget al\.,[2024a](https://arxiv.org/html/2606.05315#bib.bib26); Haoet al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib27)\)\. Recent iCoT distillation approaches transfer explicit rationales into latent reasoning states\(Shenet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib6); Wuet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib28); Kuzinaet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib7); Weiet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib8)\), but still underperform explicit CoT on difficult mathematical reasoning tasks\. A key challenge is that explicit reasoning steps lack a clear correspondence to latent dynamics, making token\-to\-latent transfer inherently ill\-posed\. Existing methods rely on local supervision\(Shenet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib6)\)or sampled intermediate states\(Kuzinaet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib7)\), which may not fully capture the global reasoning trajectory\.
\(a\)High\-level overview of LoRi\.
\(b\)Representative reasoning results on GSM8K\-Hard\.
Figure 1:Overview of the proposed low\-rank iCoT distillation framework and its reasoning performance\. \(a\) Low\-rank factors learned from the teacher hidden states are used to project teacher and student representations into a shared low\-rank subspace\. The projected low\-rank representations are then matched through a loss function\. \(b\) LoRi improves reasoning accuracy over prior implicit CoT baselines across model scales on GSM8K\-Hard\.Prior work suggests that model representations exhibit low\-dimensional geometric structure\(Yu and Wu,[2023](https://arxiv.org/html/2606.05315#bib.bib9); Golowichet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib10); Modellet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib13); Parket al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib11)\)\. While recent studies analyze reasoning through trajectory geometry and separability\(Sunet al\.,[2026](https://arxiv.org/html/2606.05315#bib.bib12); Zhouet al\.,[2026](https://arxiv.org/html/2606.05315#bib.bib17); Liet al\.,[2025a](https://arxiv.org/html/2606.05315#bib.bib16)\), they do not directly examine the low\-rank structure of hidden states across tokens and layers\. By stacking hidden states across layers and CoT tokens, we find that the normalized cumulative singular values grow rapidly with rank \(Appendix[A](https://arxiv.org/html/2606.05315#A1)\), indicating that reasoning trajectories are well\-approximated by low\-dimensional subspaces\.
Motivated by this observation, we propose a low\-rank iCoT distillation framework that aligns the global geometry of the teacher’s reasoning trajectory through low\-rank statistical representations\. Instead of mimicking explicit reasoning steps, the student learns the low\-dimensional subspace underlying the teacher’s reasoning dynamics \[Fig\.[1](https://arxiv.org/html/2606.05315#S1.F1)\(a\)\]\. This allows the student to capture the principal reasoning structure with a short latent trajectory, enabling length\-invariant and efficient distillation\.
Our main contributions are summarized below:
- •Low\-rank iCoT distillation\.We propose an iCoT distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low\-rank subspace using first\- and second\-order hidden\-state statistics\.
- •Efficient, length\-invariant global reasoning transfer\.The formulation transfers long CoT reasoning into short latent trajectories independent of sequence length, yielding efficient reasoning with a small number of latent steps and without requiring additional intermediate supervision\.
- •Consistent gains across models and benchmarks\.Across models and scales, LoRi consistently outperforms prior iCoT methods, improving accuracy by up to∼\\sim12% and achieving strong gains on GSM8K\-Hard \[up to∼\\sim10%, as shown in Fig\.[1](https://arxiv.org/html/2606.05315#S1.F1)\(b\)\]\. At larger scales, LoRi substantially narrows the gap to explicit CoT\.
## 2Background and Related Work
#### iCOT Distillation\.
In reasoning, a model predicts the final answer𝒚\\boldsymbol\{y\}conditioned on an intermediate reasoning process𝝉\\boldsymbol\{\\tau\}given an input𝒙\\boldsymbol\{x\},
p\(𝒚,𝝉∣𝒙\)=p\(𝝉∣𝒙\)p\(𝒚∣𝒙,𝝉\),p\(\\boldsymbol\{y\},\\boldsymbol\{\\tau\}\\mid\\boldsymbol\{x\}\)=p\(\\boldsymbol\{\\tau\}\\mid\\boldsymbol\{x\}\)\\,p\(\\boldsymbol\{y\}\\mid\\boldsymbol\{x\},\\boldsymbol\{\\tau\}\),where𝝉\\boldsymbol\{\\tau\}may correspond to explicit textual rationales \(explicit CoT\) or implicit internal reasoning processes \(iCoT\)\. Following prior work\(Shenet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib6); Kuzinaet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib7); Weiet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib8)\), we consider a teacher–student distillation framework to transfer reasoning from CoT into iCoT\.
- •Teacher Model\.The teacher generates a natural language reasoning trajectory𝒓=\(r1,⋯,rN\)\\boldsymbol\{r\}=\(r\_\{1\},\\cdots,r\_\{N\}\)and defines the joint conditional distribution pT\(𝒚,𝒓∣𝒙\)=pT\(𝒓∣𝒙\)pT\(𝒚∣𝒙,𝒓\)\.p\_\{T\}\(\\boldsymbol\{y\},\\boldsymbol\{r\}\\mid\\boldsymbol\{x\}\)=p\_\{T\}\(\\boldsymbol\{r\}\\mid\\boldsymbol\{x\}\)\\,p\_\{T\}\(\\boldsymbol\{y\}\\mid\\boldsymbol\{x\},\\boldsymbol\{r\}\)\.
- •Student Model\.The student constructs a latent reasoning trajectory\{𝒛t\}t=1K\\\{\\boldsymbol\{z\}\_\{t\}\\\}\_\{t=1\}^\{K\}, whereK≪NK\\ll Nand each𝒛t\\boldsymbol\{z\}\_\{t\}represents a hidden reasoning state\. The trajectory is generated recursively as 𝒛t=f𝜽\(𝒙,𝒛1:t−1\),t=1,…,K,\\boldsymbol\{z\}\_\{t\}=f\_\{\\boldsymbol\{\\theta\}\}\(\\boldsymbol\{x\},\\boldsymbol\{z\}\_\{1:t\-1\}\)\\,,\\qquad t=1,\\ldots,K,wheref𝜽f\_\{\\boldsymbol\{\\theta\}\}is the student model, with𝒛1\\boldsymbol\{z\}\_\{1\}initialized from𝒙\\boldsymbol\{x\}\. The final answer is then generated autoregressively: pS\(𝒚∣𝒙,𝒛1,…,𝒛K\)\.p\_\{S\}\(\\boldsymbol\{y\}\\mid\\boldsymbol\{x\},\\boldsymbol\{z\}\_\{1\},\\ldots,\\boldsymbol\{z\}\_\{K\}\)\.
- •Training\.The student is trained through an objective of the form ℒ=ℒreason\+λℒtask,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{reason\}\}\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{task\}\},whereℒreason\\mathcal\{L\}\_\{\\mathrm\{reason\}\}transfers reasoning behavior from explicit CoT teacher into latent reasoning dynamics,ℒtask\\mathcal\{L\}\_\{\\mathrm\{task\}\}supervises answer prediction, andλ\\lambdabalances the two objectives\.
Although iCoT improves inference efficiency, it often reduces reasoning accuracy\. Existing methods mainly differ in how they transfer reasoning from explicit CoT into iCoT\. Stepwise Internalization\(Denget al\.,[2024a](https://arxiv.org/html/2606.05315#bib.bib26)\)gradually removes CoT tokens through iterative fine\-tuning, while COCONUT\(Haoet al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib27)\)replaces textual reasoning with hidden\-state dynamics\. Recent approaches rely on distillation\(Denget al\.,[2024b](https://arxiv.org/html/2606.05315#bib.bib32)\): CODI\(Shenet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib6)\)distills reasoning at the answer boundary, PCCoT\(Wuet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib28)\)adds tokens for intermediate reasoning states, KAVA\(Kuzinaet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib7)\)aligns KV\-cache dynamics, and SIM\-CoT\(Weiet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib8)\)uses step\-level supervision to guide reasoning trajectories\.
#### Low\-Dimensional Structure\.
The manifold hypothesis states that high\-dimensional data often lie on a low\-dimensional manifold capturing their intrinsic structure\(Bengioet al\.,[2013](https://arxiv.org/html/2606.05315#bib.bib15); Feffermanet al\.,[2016](https://arxiv.org/html/2606.05315#bib.bib14); Chenet al\.,[2022](https://arxiv.org/html/2606.05315#bib.bib3)\)\. Recent work suggests that LLM representations exhibit similar low\-dimensional geometry, including approximately low\-rank structure of activations\(Yu and Wu,[2023](https://arxiv.org/html/2606.05315#bib.bib9); Chenet al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib4); Liuet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib5)\), linear semantic directions\(Parket al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib11)\), representation manifold structure\(Modellet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib13)\), and low\-rank behavior in output logits\(Golowichet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib10)\)\.
#### Reasoning Geometry\.
In reasoning tasks,Sunet al\.\([2026](https://arxiv.org/html/2606.05315#bib.bib12)\)show that reasoning trajectories pass through step\-specific subspaces that become increasingly separable with depth\.Zhouet al\.\([2026](https://arxiv.org/html/2606.05315#bib.bib17)\)characterize reasoning as smooth flows shaped by logical structure, whileLiet al\.\([2025a](https://arxiv.org/html/2606.05315#bib.bib16)\)propose a reasoning manifold where correct trajectories concentrate in low\-dimensional regions and errors arise from deviations\. Together, these studies suggest that high\-dimensional reasoning trajectories can be effectively approximated by low\-dimensional subspaces or manifolds\.
## 3TheLoRiMethod
We view the teacher’s reasoning process as a trajectory in hidden\-state space that lies near a low\-dimensional subspace\. From this perspective, distillation need not enforce step\-by\-step correspondence; instead, it should align the student’s latent reasoning dynamics with the dominant structure of the teacher’s trajectory\. Motivated by this view, we proposeLoRi\(Low\-Rank iCoT\), a distillation framework that transfers reasoning from explicit chain\-of\-thought into a compact implicit latent process\. LoRi combines two complementary objectives: rationale\-level alignment, which preserves the global geometry of the teacher’s trajectory, and anchor\-level alignment, which regularizes the transition from latent reasoning to answer generation\. Together with an efficient training scheme based on precomputed low\-rank representations, LoRi enables scalable, length\-invariant distillation from a CoT teacher to an iCoT student\.
Figure 2:Overview of low\-rank rationale\-level alignment in LoRi\.### 3\.1Low\-Rank iCoT Distillation
#### Distillation loss\.
The goal of distillation is to transfer the reasoning structure encoded in the teacher’s explicit rationalesrrinto the student’s latent reasoning dynamicszz, while preserving the student’s ability to generate explicit step\-by\-step solutions and final answers\. We train the student with the composite objective
ℒ=ℒLR\+λℒCE,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{LR\}\}\+\\lambda\\,\\mathcal\{L\}\_\{\\mathrm\{CE\}\},whereℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}is a cross entropy that supervises the student’s explicit output sequence,ℒLR\\mathcal\{L\}\_\{\\mathrm\{LR\}\}aligns the student’s latent reasoning states with the teacher’s hidden representations in a low\-rank subspace, andλ\\lambdabalances the two terms\. Intuitively,ℒLR\\mathcal\{L\}\_\{\\mathrm\{LR\}\}encourages both models to share the dominant low\-rank structure of the teacher’s reasoning trajectory, improving the student’s stability and generalization\.
#### Low\-rank term\.
We defineℒLR\\mathcal\{L\}\_\{\\mathrm\{LR\}\}as the sum of two complementary components,
ℒLR=ℒrationale\+βℒanchor,\\mathcal\{L\}\_\{\\mathrm\{LR\}\}=\\mathcal\{L\}\_\{\\mathrm\{rationale\}\}\+\\beta\\,\\mathcal\{L\}\_\{\\mathrm\{anchor\}\},whereℒrationale\\mathcal\{L\}\_\{\\mathrm\{rationale\}\}aligns the student’s latent reasoning dynamics with the teacher’s hidden states over the full reasoning trajectory, andℒanchor\\mathcal\{L\}\_\{\\mathrm\{anchor\}\}aligns the student’s representation at the answer prediction position with the teacher’s corresponding hidden state, whileβ\\betacontrolling its relative weight\. The former captures the global low\-rank structure of the teacher’s reasoning process, while the latter provides a localized signal that guides the transition from internal reasoning to answer generation\.
### 3\.2Rationale Level Alignment \(ℒrationale\\mathcal\{L\}\_\{\\mathrm\{rationale\}\}\)
To transfer reasoning capability, we align the student’s hidden states with the teacher’s reasoning trajectory in a low\-rank subspace\. Let𝓗𝓣∈ℛ𝓝×𝓛×𝓗\\mathbfcal\{H\}\_\{T\}\\in\\mathbb\{R\}^\{N\\times L\\times H\}denote the teacher’s hidden states for rationale tokens, whereLLis the number of layers andHHis the hidden dimension\. Similarly, let𝓗𝓢∈ℛ𝓚×𝓛×𝓗\\mathbfcal\{H\}\_\{S\}\\in\\mathbb\{R\}^\{K\\times L\\times H\}be the student’s hidden states for its implicit reasoning steps\. Instead of directly matching these high\-dimensional tensors, we project them into a shared low\-rank subspace and align their low\-dimensional representations through a statistical matching objective \(Fig\.[2](https://arxiv.org/html/2606.05315#S3.F2)\)\.
#### Tucker Representation for Reasoning Trajectories\.
According to the Tucker decomposition\(De Lathauweret al\.,[2000](https://arxiv.org/html/2606.05315#bib.bib36)\), a tensor𝓧∈ℛ𝓝×𝓛×𝓗\\mathbfcal\{X\}\\in\\mathbb\{R\}^\{N\\times L\\times H\}is factorized into a tensor core𝓠∈ℛ∇𝓝×∇𝓛×∇𝓗\\mathbfcal\{Q\}\\in\\mathbb\{R\}^\{r\_\{N\}\\times r\_\{L\}\\times r\_\{H\}\}and orthonormal factor matrices𝐔N∈ℝN×rN\\mathbf\{U\}\_\{N\}\\in\\mathbb\{R\}^\{N\\times r\_\{N\}\},𝐔L∈ℝL×rL\\mathbf\{U\}\_\{L\}\\in\\mathbb\{R\}^\{L\\times r\_\{L\}\},𝐔H∈ℝH×rH\\mathbf\{U\}\_\{H\}\\in\\mathbb\{R\}^\{H\\times r\_\{H\}\}:
𝓧≈𝓠×∞𝓤𝓝×∈𝓤𝓛×∋𝓤𝓗,\\mathbfcal\{X\}\\approx\\mathbfcal\{Q\}\\times\_\{1\}\\mathbf\{U\}\_\{N\}\\times\_\{2\}\\mathbf\{U\}\_\{L\}\\times\_\{3\}\\mathbf\{U\}\_\{H\},where×n\\times\_\{n\}denotes the mode\-nntensor product \(see Appendix[D](https://arxiv.org/html/2606.05315#A4)\)\. We construct a low\-rank Tucker\-style representation of the teacher’s hidden states along the layer and hidden dimensions by extracting low\-rank factor matrices via SVD \(top left singular vectors\) of unfolded𝓗𝓣\\mathbfcal\{H\}\_\{T\}along the layer and hidden dimensions, yielding orthonormal column matrices𝐔L∈ℝL×rL\\mathbf\{U\}\_\{L\}\\in\\mathbb\{R\}^\{L\\times r\_\{L\}\}and𝐔H∈ℝH×rH\\mathbf\{U\}\_\{H\}\\in\\mathbb\{R\}^\{H\\times r\_\{H\}\}, respectively, which capture the dominant subspaces of the teacher’s hidden\-state representations\. Each reasoning state is then projected as
𝓖=𝓧×∈𝓤𝓛⊤×∋𝓤𝓗⊤,\\mathbfcal\{G\}=\\mathbfcal\{X\}\\times\_\{2\}\\mathbf\{U\}\_\{L\}^\{\\top\}\\times\_\{3\}\\mathbf\{U\}\_\{H\}^\{\\top\},where𝓧\\mathbfcal\{X\}denotes either𝓗𝓣\\mathbfcal\{H\}\_\{T\}or𝓗𝓢\\mathbfcal\{H\}\_\{S\}\. This yields low\-dimensional representations𝓖𝓣∈ℛ𝓝×∇𝓛×∇𝓗\\mathbfcal\{G\}\_\{T\}\\in\\mathbb\{R\}^\{N\\times r\_\{L\}\\times r\_\{H\}\}for the teacher and𝓖𝓢∈ℛ𝓚×∇𝓛×∇𝓗\\mathbfcal\{G\}\_\{S\}\\in\\mathbb\{R\}^\{K\\times r\_\{L\}\\times r\_\{H\}\}for the student\. We then reshape these tensors along the last two modes to obtain matrix representations
𝐆T∈ℝN×\(rLrH\),𝐆S∈ℝK×\(rLrH\),\\mathbf\{G\}\_\{T\}\\in\\mathbb\{R\}^\{N\\times\(r\_\{L\}r\_\{H\}\)\},\\qquad\\mathbf\{G\}\_\{S\}\\in\\mathbb\{R\}^\{K\\times\(r\_\{L\}r\_\{H\}\)\},where each row corresponds to a reasoning step represented in the shared low\-rank subspace\.
Figure 3:Overview of low\-rank anchor alignment in LoRi\.
#### Statistical Alignment in Low\-rank Space\.
We align student and teacher representations by matching the first\- and second\-order statistics of their projections in a shared low\-rank subspace\. For𝐆∈ℝM×\(rLrH\)\\mathbf\{G\}\\in\\mathbb\{R\}^\{M\\times\(r\_\{L\}r\_\{H\}\)\}, whereM=NM=Nfor the teacher andM=KM=Kfor the student, we define
𝝁=1M∑m=1M𝐆m,:,𝐂=1M𝐆⊤𝐆\.\\boldsymbol\{\\mu\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{G\}\_\{m,:\},\\qquad\\mathbf\{C\}=\\frac\{1\}\{M\}\\mathbf\{G\}^\{\\top\}\\mathbf\{G\}\.Let\(𝐂T,𝝁T\)\(\\mathbf\{C\}\_\{T\},\\boldsymbol\{\\mu\}\_\{T\}\)and\(𝐂S,𝝁S\)\(\\mathbf\{C\}\_\{S\},\\boldsymbol\{\\mu\}\_\{S\}\)denote the statistics computed from𝐆T\\mathbf\{G\}\_\{T\}and𝐆S\\mathbf\{G\}\_\{S\}, respectively\. We define the rationale\-level loss as
ℒrationale=ωC‖𝐂S−𝐂T‖F2\+ωμ‖𝝁S−𝝁T‖22,\\mathcal\{L\}\_\{\\mathrm\{rationale\}\}=\\omega\_\{C\}\\,\\left\\\|\\mathbf\{C\}\_\{S\}\-\\mathbf\{C\}\_\{T\}\\right\\\|\_\{F\}^\{2\}\\,\+\\,\\omega\_\{\\mu\}\\,\\left\\\|\\boldsymbol\{\\mu\}\_\{S\}\-\\boldsymbol\{\\mu\}\_\{T\}\\right\\\|\_\{2\}^\{2\},whereωC\\omega\_\{C\}andωμ\\omega\_\{\\mu\}weight covariance and mean alignment\. Matching covariance aligns the principal low\-rank structure, while matching means aligns the trajectory centroids\. Together, these constraints guide the student toward the teacher’s reasoning geometry and discourage degenerate solutions such as collapsed latent trajectories\.
This objective transfers the global structure of the teacher’s reasoning process without token\-level alignment and remains invariant to reasoning length\. The low\-rank factors𝐔L\\mathbf\{U\}\_\{L\}and𝐔H\\mathbf\{U\}\_\{H\}are precomputed from the teacher and fixed during training, enabling efficient projection without repeated factorization\.
### 3\.3Anchor\-Level Alignment \(ℒanchor\\mathcal\{L\}\_\{\\mathrm\{anchor\}\}\)
While the rationale\-level objective captures global reasoning structure, it does not explicitly constrain the transition from latent reasoning to answer generation\. To address this, we introduce a localized alignment term at the answer prediction position\. This position corresponds to a fixed prompt\-aligned token \(e\.g\., “The final answer is”\), which serves as a consistent transition between reasoning and answer generation rather than part of the reasoning trajectory itself \(Figure[3](https://arxiv.org/html/2606.05315#S3.F3)\)\.
Let𝓗𝓣⊣\\⌋⟨≀∇∈ℛ𝓑×𝓛×𝓗\\mathbfcal\{H\}\_\{T\}^\{\\mathrm\{anchor\}\}\\in\\mathbb\{R\}^\{B\\times L\\times H\}denote the collection of teacher hidden states at the answer prediction position acrossBBtraining samples\. We construct a low\-rank Tucker decomposition of this tensor, yielding factor matrices
𝐔B\(a\)∈ℝB×rB,𝐔L\(a\)∈ℝL×rL,𝐔H\(a\)∈ℝH×rH\.\\mathbf\{U\}\_\{B\}^\{\(a\)\}\\in\\mathbb\{R\}^\{B\\times r\_\{B\}\},\\,\\mathbf\{U\}\_\{L\}^\{\(a\)\}\\in\\mathbb\{R\}^\{L\\times r\_\{L\}\},\\,\\mathbf\{U\}\_\{H\}^\{\(a\)\}\\in\\mathbb\{R\}^\{H\\times r\_\{H\}\}\.
In our formulation, we use the hidden\-mode factor𝐔H\(Aa\\mathbf\{U\}\_\{H\}^\{\(Aa\}, which captures the dominant subspace of the teacher’s hidden representations at the anchor position across the dataset\. For a given sample, let𝒉ℓ,𝒉ℓ′∈ℝH\\boldsymbol\{h\}\_\{\\ell\},\\boldsymbol\{h\}\_\{\\ell\}^\{\\prime\}\\in\\mathbb\{R\}^\{H\}, forℓ=1,…,L\\ell=1,\\ldots,L, denote the teacher and student hidden states at the answer token across layers\. The anchor\-level loss is defined as
ℒanchor=1L∑ℓ=1L‖𝐔H\(a\)⊤𝒉ℓ−𝐔H\(a\)⊤𝒉ℓ′‖22\.\\mathcal\{L\}\_\{\\mathrm\{anchor\}\}=\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\left\\\|\\mathbf\{U\}\_\{H\}^\{\(a\)\\top\}\\boldsymbol\{h\}\_\{\\ell\}\-\\mathbf\{U\}\_\{H\}^\{\(a\)\\top\}\\boldsymbol\{h\}\_\{\\ell\}^\{\\prime\}\\right\\\|\_\{2\}^\{2\}\.This formulation leverages a shared low\-rank subspace learned across training samples, ensuring that the student’s representation at the answer prediction point aligns with the dominant structure of the teacher’s hidden states at that location, while remaining complementary to the global alignment enforced byℒrationale\\mathcal\{L\}\_\{\\mathrm\{rationale\}\}\.
### 3\.4Implications for Reasoning
#### Reasoning compression\.
The low\-rank structure provides a geometric explanation for why long chain\-of\-thought reasoning can be compressed into a shorter latent trajectory\. Since the teacher’s hidden\-state trajectory lies near a low\-rank subspace, its dominant variation can be represented using a limited number of degrees of freedom, largely independent of sequence length\. By aligning the student within this subspace, the student is encouraged to reproduce the principal reasoning dynamics without explicitly matching each reasoning step\.
#### Length invariance\.
Importantly, our formulation is invariant to the length of the reasoning trajectory\. Since alignment is performed through aggregated statistics rather than step\-wise correspondence, the student can construct a shorter latent trajectory that matches the global geometry of the teacher\. This provides an intuitive justification for transferring reasoning from long explicit CoT sequences to compact implicit reasoning processes\.
Table 1:Accuracy \(%\) of different reasoning methods on math benchmarks across models\. The best iCoT method is shown in bold, and the second\-best method is underlined\. Explicit CoT results are shown in gray\. Parentheses denote CODI/KAVA results without their final\-step dropping heuristic\.
### 3\.5Relation to Prior Work
Our method differs from prior iCoT distillation approaches in both supervision and training\. In the following we compare LoRi with representative prior work CODI\(Shenet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib6)\), KAVA\(Kuzinaet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib7)\)and SIM\-CoT\(Weiet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib8)\):
- •CODI aligns representations at a single boundary token, reducing supervision to a point\-wise constraint and overlooking the structure of the reasoning trajectory\. As a result, it struggles to transfer multi\-step reasoning dynamics\. In contrast, our method captures the full reasoning trajectory through a low\-rank representation, supplemented by a localized anchor term that constrains the transition from reasoning to answer generation at a fixed answer\-boundary position\.
- •KAVA distills teacher KV\-cache trajectories through step\-wise alignment in KV space using sampled rationale tokens\. By contrast, LoRi models the reasoning process through low\-rank hidden\-state geometry, enabling length\-invariant distillation without token\-level sampling\.
- •SIM\-CoT\(Weiet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib8)\)applies step\-level supervision by aligning latent states with explicit reasoning tokens via an auxiliary decoder\. Our method instead avoids explicit step\-wise alignment and supervises the student through low\-rank trajectory structure\.
Finally, existing iCoT distillation methods rely on joint teacher–student training with online teacher inference\. In contrast, we adopt a two\-stage procedure: teacher\-derived low\-rank factors are precomputed once, after which the student is fine\-tuned independently on a subset of the training data\. This substantially reduces training cost while preserving effective reasoning transfer\.
## 4Results
### 4\.1Experimental Setup
Following prior work, we evaluate LoRi on standard mathematical reasoning benchmarks, which provide rigorous tests of multi\-step reasoning through structured problem solving and verifiable answers\(Spragueet al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib29)\)\.
We compare LoRi against NoCoT, CODI\(Shenet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib6)\), KAVA\(Kuzinaet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib7)\), SIM\-CoT\(Weiet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib8)\), PCCoT\(Wuet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib28)\), COCONUT\(Haoet al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib27)\), and SFT\-CoT across Qwen\(Bai and others,[2023](https://arxiv.org/html/2606.05315#bib.bib31)\)and LLaMA\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.05315#bib.bib30)\)model families\. To study scalability, we evaluate models ranging from 0\.5B to 8B parameters\. SFT\-CoT serves as an upper\-bound reference since it relies on explicit reasoning at inference time, whereas LoRi targets implicit reasoning\.
The teacher model is first fine\-tuned on the full GSM8K\-Aug dataset\(Denget al\.,[2024a](https://arxiv.org/html/2606.05315#bib.bib26),[b](https://arxiv.org/html/2606.05315#bib.bib32)\)and then frozen during distillation\. The student is initialized from the same base model and trained on a random 128\-sample subset of GSM8K\-Aug\. We report final\-answer accuracy based on extracted numerical outputs\. Additional implementation details and hyperparameters are provided in Appendix[C](https://arxiv.org/html/2606.05315#A3)\.
### 4\.2Accuracy Comparison
Table[1](https://arxiv.org/html/2606.05315#S3.T1)summarizes performance on several benchmarks and model scales\. Overall, LoRi consistently outperforms prior implicit reasoning methods and substantially narrows the gap to full CoT models\. This suggests that low\-rank alignment captures global reasoning dependencies that token\-level methods fail to model\.
#### Performance w\.r\.t\. Model Scale\.
For small models, LoRi provides strong gains over prior implicit reasoning methods\. On Qwen2\.5\-0\.5B, LoRi achieves an average accuracy of42\.7%42\.7\\%, substantially outperforming CODI \(30\.9%30\.9\\%\), KAVA \(36\.1%36\.1\\%\), and PCCoT \(19\.2%19\.2\\%\)\. On LLaMA\-3\.2\-1B, LoRi reaches the best average accuracy among iCoT methods \(44\.3%44\.3\\%\), although its advantage over CODI and KAVA is more nuanced: CODI and KAVA obtain slightly higher accuracy on GSM8K, while LoRi performs better on GSM8K\-Hard and SVAMP\. Moreover, when CODI and KAVA are evaluated without their final\-step dropping heuristic\(Kuzinaet al\.,[2025](https://arxiv.org/html/2606.05315#bib.bib7)\), LoRi outperforms them by a clear margin\. This is notable because LoRi does not use step\-level supervision or heuristic rationale truncation\.
For larger models, the gains are pronounced\. On LLaMA\-3\.2\-3B, LoRi achieves55\.6%55\.6\\%average accuracy, outperforming all implicit baselines by a clear margin\. On LLaMA\-3\.1\-8B, LoRi reaches62\.9%62\.9\\%, approaching the performance of full CoT models \(64\.0%64\.0\\%\)\. These results suggest that the proposed method scales effectively, substantially narrowing the gap between implicit and explicit reasoning\.
#### Hard Reasoning Tasks\.
The improvements are particularly notable on GSM8K\-Hard, which requires more complex multi\-step reasoning\. For example, on LLaMA\-3\.2\-3B, LoRi improves performance from15\.2%15\.2\\%\(KAVA\) to21\.9%21\.9\\%; on LLaMA\-3\.1\-8B, LoRi achieves26\.1%26\.1\\%, substantially outperforming CODI \(15\.5%15\.5\\%\) and SIM\-CoT \(16\.3%16\.3\\%\)\. This suggests that capturing the global structure of reasoning through low\-rank alignment is especially beneficial for harder reasoning tasks\.
#### Capability of Explicit Reasoning\.
To evaluate whether LoRi preserves explicit reasoning ability, we test LoRi\-E, which uses the same student model but performs standard CoT generation at inference time\. LoRi\-E achieves performance comparable to SFT\-CoT across all models and benchmarks\. For instance, on LLaMA\-3\.1\-8B, LoRi\-E attains an average accuracy of63\.6%63\.6\\%, close to64\.0%64\.0\\%for SFT\-CoT\. These results suggest that the proposed distillation method preserves explicit reasoning capability while enabling both implicit and explicit reasoning at inference time\.
### 4\.3Training Efficiency
Table 2:Training cost on NVIDIA RTX 6000 GPUs\. FT: teacher fine\-tuning\. Distill: reasoning distillation\. Other iCoT distillation methods \(e\.g\., PCCoT, KAVA, and SIM\-CoT\) have similar training cost with CODI\.Beyond accuracy, LoRi also improves training efficiency over prior implicit reasoning methods\. Existing iCoT distillation approaches, including CODI, KAVA, and SIM\-CoT, rely on joint teacher–student training with repeated teacher forward passes to obtain intermediate representations, resulting in substantial computational overhead\. In contrast, LoRi uses a two\-stage procedure: all teacher\-derived quantities are precomputed once, and the student is trained on only a subset of the training data\. Despite this simplified pipeline, LoRi maintains strong benchmark performance while significantly improving training efficiency and scalability as summarized in Table[2](https://arxiv.org/html/2606.05315#S4.T2)\. In practice, the distillation overhead becomes effectively negligible, making LoRi a more scalable and accessible approach for implicit reasoning distillation\. We use CODI as a representative baseline for training\-cost comparison since related iCoT distillation methods \(e\.g\., PCCoT, KAVA, and SIM\-CoT\) employ similar joint teacher–student training procedures with repeated teacher forward passes, leading to very similar computational overhead\.
Figure 4:Inference latency comparison\. Note: LoRi and prior iCoT methods have almost the same inference cost since they use the same latent\-step inference procedure\.

Figure 5:Ablation studies on reasoning steps \(left\) and training sample size \(right\)\.
### 4\.4Inference Latency
We compare the inference latency of LoRi\-based iCoT and explicit CoT\. Latency is measured as average wall\-clock time per sample\. CoT generates full reasoning sequences autoregressively, whereas iCoT uses a fixed number of latent steps \(K=5K=5\) followed by short answer generation, significantly reducing decoding cost\. As shown in Figure[4](https://arxiv.org/html/2606.05315#S4.F4), iCoT consistently achieves lower latency across all model scales\. Specifically, it is about 6\.9×\\timesfaster for LLaMA\-3\.2\-1B and 5\.1×\\timesfaster for both LLaMA\-3\.2\-3B and LLaMA\-3\.1\-8B\. These results confirm the efficiency advantages of iCoT while showing that LoRi preserves this benefit alongside improved reasoning performance\. Since LoRi and prior iCoT methods use the same latent\-step inference procedure, their inference complexity is almost the same; the main difference lies in the distillation strategy during training\.
### 4\.5Ablation Studies
We study the effect of the number of latent reasoning stepsKKand the number of training samples used for distillation on model performance\.
#### Ablation on Latent Reasoning Steps\.
Figure[5](https://arxiv.org/html/2606.05315#S4.F5)reports benchmark accuracy as the number of latent stepsKKvaries from 2 to 8\. Performance consistently improves fromK=2K=2toK=5K=5, suggesting that a minimum number of latent reasoning iterations is needed to capture the underlying reasoning dynamics\. The best results are achieved atK=5K=5across all datasets, indicating that this setting provides sufficient capacity to represent the dominant low\-rank reasoning structure\. IncreasingKKbeyond 5 yields no further gains and occasionally causes slight degradation, implying that additional steps introduce redundancy rather than useful computation\. Overall, these results support the view that reasoning trajectories can be compressed into a small number of latent steps without sacrificing essential structure\.
#### Ablation on Training Sample Size\.
Figure[5](https://arxiv.org/html/2606.05315#S4.F5)shows that accuracy improves rapidly in the low\-data regime and saturates after roughly 128 samples across all model scales, indicating that strong performance can be achieved with only a small subset of the training data\. One explanation is that the effective degrees of freedom of the distillation problem are limited: although the teacher produces long chain\-of\-thought trajectories, the information needed for the student may lie in a low\-dimensional subspace\. Under this view, distillation only needs to recover the dominant reasoning structure rather than the full trajectory\. Consequently, relatively few samples are sufficient to estimate this structure, leading to rapid gains followed by saturation\. This observation supports our low\-rank formulation and suggests that the reasoning signal relevant for distillation is highly compressible\.
## 5Conclusion
We have proposed LoRi, a low\-rank formulation for iCoT distillation\. Empirically, LoRi outperformed prior implicit reasoning methods across model families, scales, and benchmarks and reduced the gap of iCoT with explicit CoT while preserving the efficiency advantages of implicit reasoning\. In addition, the proposed method significantly reduced computational overhead, making distillation lightweight and scalable\. The results also support a geometric perspective of reasoning: long chain\-of\-thought trajectories can be effectively compressed by preserving their low\-rank structure\. This perspective provides a principled and efficient alternative to existing iCoT distillation approaches and suggests that the essential dynamics of reasoning are governed by a low\-dimensional subspace that can be exploited for both learning and inference\.
## Limitations
The proposed formulation is motivated by an empirical observation that hidden\-state reasoning trajectories exhibit strong low\-rank structures across models and benchmarks\. While LoRi improves reasoning accuracy and training efficiency, the theoretical relationship between low\-rank representation geometry and reasoning capability is still not fully understood\. Future work could further investigate the geometry of reasoning trajectories and the role of low\-dimensional structure in CoT reasoning from a theoretical perspective\.
## References
- Qwen technical report\.arXiv:2309\.16609\.External Links:2309\.16609Cited by:[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p2.1)\.
- Y\. Bengio, A\. Courville, and P\. Vincent \(2013\)Representation learning: a review and new perspectives\.IEEE Transactions on Pattern Analysis and Machine Intelligence35\(8\),pp\. 1798–1828\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2013.50)Cited by:[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Chen, Q\. Li, and Z\. Zhang \(2022\)Self\-healing robust neural networks via closed\-loop control\.Journal of machine learning research23\(319\),pp\. 1–54\.Cited by:[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Chen, Z\. Wang, Y\. Yang, Q\. Li, and Z\. Zhang \(2024\)PID control\-based self\-healing to improve the robustness of large language models\.Transactions on machine learning research\.Cited by:[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Appendix C](https://arxiv.org/html/2606.05315#A3.p2.6)\.
- L\. De Lathauwer, B\. De Moor, and J\. Vandewalle \(2000\)A multilinear singular value decomposition\.SIAM Journal on Matrix Analysis and Applications21\(4\),pp\. 1253–1278\.Cited by:[§3\.2](https://arxiv.org/html/2606.05315#S3.SS2.SSS0.Px1.p1.5)\.
- Y\. Deng, Y\. Choi, and S\. Shieber \(2024a\)From explicit cot to implicit cot: learning to internalize cot step by step\.arXiv preprint arXiv:2405\.14838\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p2.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.5),[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p3.1)\.
- Y\. Deng, K\. Prsad, R\. Fernandez, P\. Smolensky, V\. Chaudhary, and S\. Shieber \(2024b\)Implicit chain\-of\-thought reasoning via knowledge distillation\.arXiv preprint at arXiv:2311\.01460\.Cited by:[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.5),[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p3.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey,et al\.\(2024\)The llama 3 herd of models\.arXiv:2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p2.1)\.
- C\. Fefferman, S\. Mitter, and H\. Narayanan \(2016\)Testing the manifold hypothesis\.Journal of the American Mathematical Society29\(4\),pp\. 983–1049\.Cited by:[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2022\)PAL: program\-aided language models\.arXiv\.External Links:2211\.10435Cited by:[Appendix C](https://arxiv.org/html/2606.05315#A3.p2.6)\.
- N\. Golowich, A\. Liu, and A\. Shetty \(2025\)Sequences of logits reveal the low rank structure of language models\.arXiv preprint arXiv:2510\.24966\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p3.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian \(2024\)Training large language models to reason in a continuous latent space\.arXiv preprint arXiv:2412\.06769\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p2.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.5),[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p2.1)\.
- Z\. He, G\. Xiong, B\. Liu, S\. Sinha, and A\. Zhang \(2026\)Reasoning beyond chain\-of\-thought: a latent computational mode in large language models\.arXiv preprint arXiv:2601\.08058\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- A\. Kuzina, M\. Pioro, P\. N\. Whatmough, and B\. Ehteshami Bejnordi \(2025\)KaVa: latent reasoning via compressed kv\-cache distillation\.arXiv preprint arXiv:2510\.02312\.External Links:[Link](https://arxiv.org/abs/2510.02312)Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p2.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.5),[§3\.5](https://arxiv.org/html/2606.05315#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p2.1),[§4\.2](https://arxiv.org/html/2606.05315#S4.SS2.SSS0.Px1.p1.5)\.
- B\. Li, G\. Deng, R\. Chen, J\. Yue, S\. Zhang, Q\. Zhao, L\. Song, and L\. Wen \(2025a\)REMA: a unified reasoning manifold framework for interpreting large language models\.arXiv preprint arXiv:2509\.22518\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p3.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Li, Y\. Fu, L\. Fan, J\. Liu, Y\. Shu, C\. Qin, M\. Yang, I\. King, and R\. Ying \(2025b\)Implicit reasoning in large language models: a comprehensive survey\.arXiv preprint arXiv:2509\.02350\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- Z\. Li, H\. Liu, D\. Zhou, and T\. Ma \(2024\)Chain of thought empowers transformers to solve inherently serial problems\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- Z\. Lin, T\. Liang, J\. Xu, Q\. Liu, X\. Wang, R\. Luo, C\. Shi, S\. Li, Y\. Yang, and Z\. Tu \(2025\)Critical tokens matter: token\-level contrastive estimation enhances llm’s reasoning capability\.InProceedings of the 42nd International Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- Z\. Liu, R\. Zhang, Z\. Wang, M\. Yan, Z\. Yang, P\. D\. Hovland, B\. Nicolae, F\. Cappello, S\. Tang, and Z\. Zhang \(2025\)CoLA: compute\-efficient pre\-training of llms via low\-rank activation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 4627–4645\.Cited by:[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Modell, P\. Rubin\-Delanchy, and N\. Whiteley \(2025\)The origins of representation manifolds in large language models\.arXiv preprint arXiv:2505\.18235\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p3.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2024\)The linear representation hypothesis and the geometry of large language models\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.235,pp\. 39643–39666\.External Links:[Link](https://proceedings.mlr.press/v235/park24c.html)Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p3.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)SVAMP: a dataset of verbally perturbed math word problems\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[Appendix C](https://arxiv.org/html/2606.05315#A3.p2.6)\.
- Z\. Shen, H\. Yan, L\. Zhang, Z\. Hu, Y\. Du, and Y\. He \(2025\)CODI: compressing chain\-of\-thought into continuous space via self\-distillation\.arXiv preprint arXiv:2502\.21074\.External Links:[Link](https://arxiv.org/abs/2502.21074)Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p2.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.5),[§3\.5](https://arxiv.org/html/2606.05315#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p2.1)\.
- Z\. Sprague, F\. Yin, J\. D\. Rodriguez, D\. Jiang, M\. Wadhwa, P\. Singhal, X\. Zhao, X\. Ye, K\. Mahowald, and G\. Durrett \(2024\)To cot or not to cot? chain\-of\-thought helps mainly on math and symbolic reasoning\.arXiv preprint arXiv:2409\.12183\.Cited by:[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p1.1)\.
- L\. Sun, H\. Dong, B\. Qiao, Q\. Lin, D\. Zhang, and S\. Rajmohan \(2026\)LLM reasoning as trajectories: step\-specific representation geometry and correctness signals\.arXiv preprint arXiv:2604\.05655\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p3.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- X\. Wei, X\. Liu, Y\. Zang, X\. Dong, Y\. Cao, J\. Wang, X\. Qiu, and D\. Lin \(2025\)SIM\-cot: supervised implicit chain\-of\-thought\.arXiv preprint arXiv:2509\.20317\.External Links:[Link](https://arxiv.org/abs/2509.20317)Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p2.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.5),[3rd item](https://arxiv.org/html/2606.05315#S3.I1.i3.p1.1),[§3\.5](https://arxiv.org/html/2606.05315#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p2.1)\.
- H\. Wu, Z\. Teng, and K\. Tu \(2025\)Parallel continuous chain\-of\-thought with jacobi iteration\.arXiv preprint arXiv:2506\.18582\.Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p2.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px1.p1.5),[§4\.1](https://arxiv.org/html/2606.05315#S4.SS1.p2.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- H\. Yu and J\. Wu \(2023\)Compressing transformers: features are low\-rank, but weights are not\!\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 11007–11015\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v37i9.26304),[Link](https://ojs.aaai.org/index.php/AAAI/article/view/26304)Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p3.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2022\)STaR: bootstrapping reasoning with reasoning\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p1.1)\.
- Y\. Zhou, Y\. Wang, X\. Yin, S\. Zhou, and A\. R\. Zhang \(2026\)The geometry of reasoning: flowing logics in representation space\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.05315#S1.p3.1),[§2](https://arxiv.org/html/2606.05315#S2.SS0.SSS0.Px3.p1.1)\.
## Appendix ALow\-rank Analysis of CoT Hidden States
Figure 6:Normalized sum of singular values versus rank for stacked hidden states of all layers and all CoT tokens of LLaMA\-3\.2\-1B model, where each curve corresponds to a single GSM8K example\.For this analysis, we stack the hidden states corresponding to all transformer layers and all CoT rationale tokens into a single matrix and compute its singular value decomposition \(SVD\)\. Figure[6](https://arxiv.org/html/2606.05315#A1.F6)shows the normalized cumulative singular values across GSM8K examples for LLaMA\-3\.2\-1B\. Despite the model having a hidden dimension of 2048, the singular values decay rapidly and are largely concentrated within a relatively small rank\. This strong low\-rank structure suggests significant compressibility in hidden\-state reasoning dynamics, motivating the use of low\-rank representations for implicit reasoning distillation\.
## Appendix BAblation of the Anchor Loss Term
Table[3](https://arxiv.org/html/2606.05315#A2.T3)shows that the anchor\-level lossℒanchor\\mathcal\{L\}\_\{\\mathrm\{anchor\}\}generally improves performance, with the largest gains observed for smaller models\. For Qwen2\.5\-0\.5B, it yields substantial improvements on both GSM8K and SVAMP, indicating a strong additional supervision signal\. For larger models, the effect is less pronounced, with improvements on some benchmarks and minor degradations on others\. Notably, the magnitude of these degradations is considerably smaller than the observed gains, suggesting that the anchor term acts as a stable complementary signal to the low\-rank rationale alignment\. Overall, the largest benefits are observed in smaller models, while the effects become more nuanced at larger scales\. Importantly, the proposed distillation method remains effective even without the anchor term, indicating that the global low\-rank alignment alone captures a substantial portion of the reasoning signal\.
Table 3:Effect of the anchor loss term on accuracy across models\.
## Appendix CImplementation Details and Hyperparameter
Table 4:Hyperparameter settings\.We assume a pretrained teacher model that produces high\-quality chain\-of\-thought rationales\. The teacher is kept fixed and used to extract hidden states for both the rationale tokens and the answer prediction position\. From these, we precompute the low\-rank projection factors and anchor\-level targets as described in Sec\.[3\.2](https://arxiv.org/html/2606.05315#S3.SS2)and Sec\.[3\.3](https://arxiv.org/html/2606.05315#S3.SS3)\. The student model is then trained on sampled training examples using the proposed loss in Eq\. \([3\.1](https://arxiv.org/html/2606.05315#S3.Ex6)\)\.
We evaluate the model on standard mathematical reasoning benchmarks, including GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.05315#bib.bib33)\), GSM8K\-Hard\(Gaoet al\.,[2022](https://arxiv.org/html/2606.05315#bib.bib35)\), and SVAMP\(Patelet al\.,[2021](https://arxiv.org/html/2606.05315#bib.bib34)\)\. Hyperparameter settings for LoRi corresponding to the results in Table[1](https://arxiv.org/html/2606.05315#S3.T1)is reported in Table[4](https://arxiv.org/html/2606.05315#A3.T4)\. Here,RLR\_\{L\}andRHR\_\{H\}denote the ranks used for the low\-rank decomposition of rationale representations, whileRL\(a\)R\_\{L\}^\{\(a\)\},RH\(a\)R\_\{H\}^\{\(a\)\}, andRB\(a\)R\_\{B\}^\{\(a\)\}are the Tucker ranks for anchor factorization\. We use a shared hyperparameter configuration across all models, except forRL\(a\)R\_\{L\}^\{\(a\)\}, which is increased for larger models \(e\.g\., LLaMA 3B and 8B\) to 16 due to their greater depth, while all other parameters remain fixed\.
## Appendix DTensor Contraction
The operation×n\\times\_\{n\}denotes contraction of a tensor with a matrix along modenn\. In index form, the expression
𝓧=𝓠×∞𝓤𝓝×∈𝓤𝓛×∋𝓤𝓗\\mathbfcal\{X\}=\\mathbfcal\{Q\}\\times\_\{1\}\\mathbf\{U\}\_\{N\}\\times\_\{2\}\\mathbf\{U\}\_\{L\}\\times\_\{3\}\\mathbf\{U\}\_\{H\}corresponds to
Xn,t,h\\displaystyle X\_\{n,t,h\}=∑m=1rN∑i=1rL∑j=1rHQm,i,j\(𝐔N\)n,m\\displaystyle=\\sum\_\{m=1\}^\{r\_\{N\}\}\\sum\_\{i=1\}^\{r\_\{L\}\}\\sum\_\{j=1\}^\{r\_\{H\}\}Q\_\{m,i,j\}\\,\(\\mathbf\{U\}\_\{N\}\)\_\{n,m\}⋅\(𝐔L\)t,i\(𝐔H\)h,j\\displaystyle\\quad\\cdot\(\\mathbf\{U\}\_\{L\}\)\_\{t,i\}\\,\(\\mathbf\{U\}\_\{H\}\)\_\{h,j\}\(1\)which contracts𝓧\\mathbfcal\{X\}along the first, second and third modes\.Similar Articles
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.
LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation
LARK proposes a learnability-grounded method for selecting reasoning trajectories in LLM distillation, employing a learnability factor and χ²-regularized selection policy that balances efficiency and generalization, consistently outperforming baselines across models and tasks.
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
This paper identifies limitations in token-level supervision for on-policy distillation of LLMs and proposes TOPD, which uses near-future trajectory information to better identify divergent reasoning states and distribute guidance across multiple tokens, achieving gains on AIME benchmarks.
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
This paper introduces Motab, a new pipeline for LLM reasoning distillation that mitigates both off-policy and on-policy exposure biases by dynamically monitoring student generation and backtracking to safe states with teacher intervention, achieving ~3% average improvement.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.