SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
Summary
Introduces Hankel Reduced order Model (HRM) adapter, an SSM-based residual module initialized via Balanced Truncation for parameter-efficient fine-tuning, outperforming LoRA on long-context tasks.
View Cached Full Text
Cached at: 06/26/26, 05:17 AM
# SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
Source: [https://arxiv.org/html/2606.26290](https://arxiv.org/html/2606.26290)
###### Abstract
While parameter\-efficient fine\-tuning \(PEFT\) typically targets attention projectors, its efficacy for tasks requiring sequential state accumulation remains under\-explored\. We examine if PEFT for such tasks can benefit from state space model \(SSMs\) adapters, and if MLP blocks are better injection sites\. We introduce Hankel Reduced order Model \(HRM\) adapter, an SSM\-based residual module initialized via Balanced Truncation of empirical Hankel Grammians\. By leveraging the time\-invariance of the system matrixA¯\\bar\{A\}, HRM enables an exact FFT\-based parallel scan, achieving computational parity with LoRA across all context lengths\. In iso\-parametric evaluations on Mistral\-7B \(8\.4M trainable parameters\), HRM outperforms LoRA variants on LongBench tasks, including QuALITY \(\+34\.8% relative accuracy\) and QMSum \(\+71\.6% relative ROUGE\-1\)\. HRM further demonstrates consistent superiority across 18 configurations of synthetic state\-tracking \(DFA, Parity\) and character\-level language modeling \(enwik8\)\. Gate analysis reveals that HRM adapters effectively learn to modulate recurrence, providing a robust architectural alternative to low\-rank adaptation for long\-context sequence modeling\.
State Space Models, Controllability, Observability, Grammian, Hankel rank reduction
## 1Introduction
Parameter\-efficient fine\-tuning \(PEFT\) is a dominant paradigm in adapting large pre\-trained language models \(LLMs\) for downstream tasks\. Rather than updating entire model weights, PEFT methods insert adapters or modify a smaller subset of parameters, keeping the model backbone frozen\. Low\-Rank Adaptation\(Huet al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib1)\)is the most widely adopted PEFT method, achieving strong results across language understanding, generation, and instruction following tasks while adding∼\\sim0\.1–1% extra parameters\. LoRA parameterizes a weight update asΔW=BA\\Delta W=BA, whereB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{out\}\\times r\}andA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{in\}\}\. The matricesA,BA,Bare learned such that rankr≪min\(dout,din\)r\\ll\\min\{\(d\_\{out\},d\_\{in\}\)\}, where the full forward pass through the adapted layer becomes:
ht=W0xt\+BAxt=\(W0\+BA\)xth\_\{t\}=W\_\{0\}x\_\{t\}\+BAx\_\{t\}=\(W\_\{0\}\+BA\)x\_\{t\}\(1\)where the model weightsW0W\_\{0\}are kept frozen for inputxtx\_\{t\}, at positiontt\. We observe that the computation to adapt weights in LoRA \(and its related methods: DoRA\(Liuet al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib2)\), QloRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.26290#bib.bib5)\), AdaLoRA\(Zhanget al\.,[2023b](https://arxiv.org/html/2606.26290#bib.bib4)\)\) is a static linear function of the inputxtx\_\{t\}\. As a result, the adapter output at positiontthas no access to the prior positions:xt−1,xt−2,⋯\.x\_\{t\-1\},x\_\{t\-2\},\\cdots\.\. This is not a failure of reduced rank modeling as no choice ofrrwill give LoRA temporal memory access\.
To motivate this central issue, consider fine\-tuning a model to simulate a 4\-state Deterministic Finite Automaton \(DFA\)\. At each step, the correct output depends not on the current input symbol alone, but on the accumulated sequence of transitions since the start\. A DFA with 4 states can be in any of 4 configurations depending on the entire historyx1,x2,⋯,xt−1x\_\{1\},x\_\{2\},\\cdots,x\_\{t\-1\}\. LoRA, regardless of rank, collapses the current state to a static function ofxtx\_\{t\}, therefore, it structurally cannot represent a state that persists across positions\. Despite this, LoRA, DoRA, AdaLoRA, and QLoRA’s successes in achieving excellent results on tasks where adaptation is position\-independent, such as domain style transfer, factual knowledge injection, and instruction following is well established in literature\.
To this end, we investigate the following question:*is it possible to construct a PEFT adapter that \(1\) adds temporal recurrent state to a frozen transformer, \(2\) is provably compressible to a minimal state dimension, \(3\) and is computationally equivalent to LoRA, while achieving better performance on long\-range tasks across diverse domain?*
Q\\mathrm\{Q\}KKVVAttention\(frozen, unchanged\)SSMAdapterA¯∈ℝr×din\\overline\{A\}\\in\\mathbb\{R\}^\{r\\times d\_\{in\}\}MLP \(frozen\)GatingHankel Reduced Order ModelW0W\_\{0\}FrozenWeightsB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{out\}\\times r\}A∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{in\}\}Q\\mathrm\{Q\}KKVV
Figure 1:Architecture comparison\. LoRA modifies weight matrices; its output at positionttis a static function ofxtx\_\{t\}\. The HRM adapter inserts a parallel recurrent branch whose hidden state integrates all prior representations\.
## 2Related Works
All majorPEFTmethods share the common structural property of position\-independence \(or position agnostic weight fine\-tuning\)\. LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib1)\), AdaLoRA\(Zhanget al\.,[2023b](https://arxiv.org/html/2606.26290#bib.bib4)\), QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.26290#bib.bib5)\), LoRA\+\(Hayouet al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib6)\): all computeh=f\(xt\)h=f\(x\_\{t\}\)with no dependence on t or prior positions\. AdaLoRA adaptively allocates rank but the resulting update is still a static matrix product\. IA3\(Liuet al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib7)\)applies learned vectors to rescale hidden states, a multiplication by a position\-independent scalar, also resulting a static update \(see Fig\.[1](https://arxiv.org/html/2606.26290#S1.F1)\)\.
Some works prepend learned soft tokens to the input\(Lesteret al\.,[2021](https://arxiv.org/html/2606.26290#bib.bib8); Li and Liang,[2021](https://arxiv.org/html/2606.26290#bib.bib9)\)\. These tokens provide context at the input but do not define a recurrent state, and the transformer still processes each position independently after the prefix\. Foundational adapter methods for PEFT\(Houlsbyet al\.,[2019](https://arxiv.org/html/2606.26290#bib.bib10); Pfeifferet al\.,[2020](https://arxiv.org/html/2606.26290#bib.bib11)\)insert small MLP bottlenecks to pre\-trained transformer models\. The bottleneckh=W2⋅σ\(W1⋅xt\)h=W\_\{2\}\\cdot\\sigma\(W\_\{1\}\\cdot x\_\{t\}\)depends only onxtx\_\{t\}m without memory recurrence\. While\(Houlsbyet al\.,[2019](https://arxiv.org/html/2606.26290#bib.bib10)\)places two adapter bottleneck modules into each transformer layer, while\(Pfeifferet al\.,[2020](https://arxiv.org/html/2606.26290#bib.bib11)\)places a single adapter, halving the number of trainable parameters\.
On the other hand, State Space models \(SSMs\) have been shown promise to alleviate the quadratic attention costs over long\-contexts\. Structured State Space Sequence \(S4\) Models were introduced by\(Guet al\.,[2021](https://arxiv.org/html/2606.26290#bib.bib14)\)with state space layer with HiPPO\-based initialization and convolution\-mode inference, which was improved in S4D\(Guet al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib13)\)by restricting to diagonal state space matrixA¯\\bar\{A\}, losing expressiveness but enabling simpler inference\. Finally, Mamba models\(Gu and Dao,[2023](https://arxiv.org/html/2606.26290#bib.bib15)\)introduced input\-dependent state matrices\(At,Bt\)\(A\_\{t\},B\_\{t\}\), enabling selective memory\.
Hybrid model architecturessuch as Griffin\(Deet al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib16)\), MambaFormer\(Parket al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib17)\), and Jamba\(Lieberet al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib18)\)utilize SSM layers with transformers\. On the surface they seem similar to our proposed work \(HRM inserts ad=32d=32SSM at each MLP block; MambaFormer inserts Mamba layers between attention blocks\)\. The critical distinction is one of training regime: every hybrid architecture requires joint training from scratch on billions of tokens\. HRM is the first method that adds SSM\-style temporal memory in the PEFT setting; therefore, the backbone is frozen, the adapter has∼0\.1%\\sim 0\.1\\%parameters, and no pre\-training data beyond the fine\-tuning task is required\. As a result, a user with a frozen, pre\-trained GPT\-2 cannot apply MambaFormer to it, but they can apply HRM\.
The combination of \(a\) recurrent hidden state, \(b\) provable compression via model\-order reduction, and \(c\) computational parity with static adapters does not appear in the literature\. To the best of our knowledge, the closest related work is SLoRA \(\(Shenget al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib24)\)\) and related low\-rank SSM approaches that treat SSMs as a structured alternative to LoRA rank approximations\. However, these do not apply Balanced Truncation, do not provide error bounds, and do not address the computational overhead of the recurrence\.
## 3Background
#### LoRA
Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib1)\)relies on the observation that weight updates during fine\-tuningΔW∈ℝdout×din\\Delta W\\in\\mathbb\{R\}^\{d\_\{out\}\\times d\_\{in\}\}of pre\-trained models lie in a low intrinsic dimension\(Aghajanyanet al\.,[2021](https://arxiv.org/html/2606.26290#bib.bib33)\)\. This motivates parameterizing the update as a rank\-rrproduct:
ΔW=BA,B∈ℝdout×randA∈ℝr×din\\Delta W=BA,B\\in\\mathbb\{R\}^\{d\_\{out\}\\times r\}\\text\{ and \}A\\in\\mathbb\{R\}^\{r\\times d\_\{in\}\}\(2\)During trainingW0W\_\{0\}is frozen, and onlyBBandAAare updated\. At inference, the update is absorbed asWeff=W0\+BAW\_\{eff\}=W\_\{0\}\+BA, adding no inference latency, with the forward pass:
h=Weffx=\(W0\+BA\)x=W0x\+B\(Ax\)h=W\_\{eff\}x=\(W\_\{0\}\+BA\)x=W\_\{0\}x\+B\(Ax\)\(3\)LoRA is applied to theQQandVVprojection matrices of each self\-attention block in standard practice\. For a model withnlayers,dmodeln\_\{layers\},d\_\{model\}attention dimension, this contributes4r⋅nlayers⋅dmodel4r\\cdot n\_\{layers\}\\cdot d\_\{model\}trainable parameters\. The mappingh=\(W0\+BA\)xh=\(W\_\{0\}\+BA\)xis a linear function ofxxalone\. The matrixWeffW\_\{eff\}is fixed at all positions\. If we index the sequence position astt, the LoRA output at positionttishtLoRA=\(W0\+BA\)xth^\{LoRA\}\_\{t\}=\(W\_\{0\}\+BA\)x\_\{t\}with no dependence on the previous inputsxt−1x\_\{t\-1\}, etc\.
The adapter applies the same linear transformation B A to every token, independently of position or context, and is therefore memory less\. AdaLoRA\(Zhanget al\.,[2023b](https://arxiv.org/html/2606.26290#bib.bib4)\)addresses rank allocation but not memory either\. It parameterizesΔW=PΛQ\\Delta W=P\\Lambda QwherePP,QQare orthogonal andΛ\\Lambdais diagonal \(singular value decomposition structure\), pruning entries ofΛ\\Lambdabased on importance\. The result is still a static linear map of the current token\. QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.26290#bib.bib5)\)addresses memory efficiency \(4\-bit quantization ofW0W0\) and DoRA\(Liuet al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib2)\)decomposes into magnitude and direction components\. Both remain static functions of the current token\. The memory\-less property is therefore preserved in existing LoRA variants\.
#### SSMs
A continuous\-time linear state\-space model \(SSM\) is defined by the equations:
x˙\(t\)=Ax\(t\)\+Bu\(t\),y\(t\)=Cx\(t\)\+Du\(t\)\\dot\{x\}\(t\)=Ax\(t\)\+Bu\(t\),\\;y\(t\)=Cx\(t\)\+Du\(t\)\(4\)for hidden statex∈ℝdx\\in\\mathbb\{R\}^\{d\}, inputu∈ℝmu\\in\\mathbb\{R\}^\{m\}, outputy∈ℝpy\\in\\mathbb\{R\}^\{p\}, and the state\-transition \(or system\) matrixAA,BBthe input matrix,CCthe output matrix, andDDthe feed\-through, or skip matrix\. For sequence modeling, the continuous\-time system is discretized to obtain a recurrence relation\. Given a time stepΔt\\Delta t, the Zero\-Order Hold \(ZOH\) discretization yields:
xt=A¯xt−1\+B¯ut,yt=CxtA¯=eAΔt,B¯=A−1\(eAΔt−I\)B\\begin\{split\}x\_\{t\}&=\\bar\{A\}x\_\{t\-1\}\+\\bar\{B\}u\_\{t\},\\;y\_\{t\}=Cx\_\{t\}\\\\ \\bar\{A\}&=e^\{A\\Delta t\},\\bar\{B\}=A^\{\-1\}\(e^\{A\\Delta t\}\-I\)B\\end\{split\}\(5\)
The discrete SSM defines a linear map from the input sequence\{u1,…,uT\}\\\{u\_\{1\},\.\.\.,u\_\{T\}\\\}to the output sequence\{y1,…,yT\}\\\{y\_\{1\},\.\.\.,y\_\{T\}\\\}:
yt=∑k=0tCA¯t−kB¯⏟gt−kuk=\(g⋆u\)ty\_\{t\}=\\sum^\{t\}\_\{k=0\}\\underbrace\{C\\bar\{A\}^\{t\-k\}\\bar\{B\}\}\_\{g\_\{t\-k\}\}u\_\{k\}=\(g\\star u\)\_\{t\}\(6\)wheregkg\_\{k\}is the impulse response of the system\. This results in the output sequence be written as the causal convolution of the impulse response with the input\.
Structured State Spaces \(S4\)\(Guet al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib13)\)showed that whenA¯\\bar\{A\}is initialized as a specific Normal Plus Low\-Rank \(NPLR\) matrix, the SSM can model long\-range dependencies with a stable impulse response that decays slowly\. The key computational insight of S4 is that the causal convolution\(g⋆u\)\(g\\star u\)can be computed in𝒪\(\(TlogT\)\)\\mathcal\{O\}\{\\left\(\{\(T\\log\{T\}\)\}\\right\)\}via FFT\. We will also use this fact for our computation in the subsequent sections\.
Finally, the*stability of the discrete SSM*in \([5](https://arxiv.org/html/2606.26290#S3.E5)\) requires all eigenvalues ofA¯\\bar\{A\}to lie strictly within the unit circle, i\.e\.,max\|λi\(A¯\)\|<1\\max\{\\left\\lvert\\lambda\_\{i\}\(\\bar\{A\}\)\\right\\rvert\}<1\. For diagonalA¯\\bar\{A\}with real entries, this requires\|A¯ii\|<1\\left\\lvert\\bar\{A\}\_\{ii\}\\right\\rvert<1\. We will enforce this by parameterization, as our reduced\-order modeling requires stability of the underlying linear time invariant \(LTI\) system\.
#### Balanced Truncation in LTI Systems
Consider LTI dynamics \(i\.e\., fixedG≜\(A¯,B¯,C,D\)G\\triangleq\(\\bar\{A\},\\bar\{B\},C,D\)\) in \([5](https://arxiv.org/html/2606.26290#S3.E5)\), with state dimensiondd\. The reduced order modeling problem for LTI dynamical systemGGis then to find a reduced order systemG^\\hat\{G\}with state dimensionsd^<d\\hat\{d\}<d, such that the input\-output behaviors ofGGandG^\\hat\{G\}are as close as possible, with a quantified error bound\.*Balanced Truncation*\(BT\)\(Moore,[2003](https://arxiv.org/html/2606.26290#bib.bib32)\)is the canonical solution to this problem for stable LTI systems\.
The LTI system’s statev∈ℝdv\\in\\mathbb\{R\}^\{d\}is*controllable*if there exists an input sequence to driveGGfrom the origin tovv\. Controllability for the LTI system relies on the*Controllability Grammian*Wc∈ℝd×dW\_\{c\}\\in\\mathbb\{R\}^\{d\\times d\}, a positive semi\-definite matrix defined as:
Wc=∑k=0∞A¯kB¯B¯T\(A¯\)kW\_\{c\}=\\sum^\{\\infty\}\_\{k=0\}\\bar\{A\}^\{k\}\\bar\{B\}\\bar\{B\}^\{T\}\(\\bar\{A\}\)^\{k\}\(7\)Equivalently,WcW\_\{c\}is known to be the solution of the discrete time Lyapunov equation\(Corless and Frazho,[2003](https://arxiv.org/html/2606.26290#bib.bib30)\):
A¯WcA¯T−Wc\+B¯B¯T=0\\bar\{A\}W\_\{c\}\\bar\{A\}^\{T\}\-W\_\{c\}\+\\bar\{B\}\\bar\{B\}^\{T\}=0\(8\)
Conversely, the statevvis*observable*if the initial statevvcan be uniquely determined from the output sequence\{yk\}\\\{y\_\{k\}\\\}\. Similarly, observability for the LTI system relies on its*Observability Grammian*Wo∈ℝd×dW\_\{o\}\\in\\mathbb\{R\}^\{d\\times d\}, defined as:
Wo=∑k=0∞\(A¯T\)kCTCA¯kW\_\{o\}=\\sum^\{\\infty\}\_\{k=0\}\(\\bar\{A\}^\{T\}\)^\{k\}C^\{T\}C\\bar\{A\}^\{k\}\(9\)with its corresponding Lyapunov equation:
A¯TWoA¯−Wo\+CTC=0\\bar\{A\}^\{T\}W\_\{o\}\\bar\{A\}\-W\_\{o\}\+C^\{T\}C=0\(10\)
MatricesWcW\_\{c\}andWoW\_\{o\}play an important role in balanced truncation ofGGby forming a joint Hankel operatorℋ:\{past inputs\}→\{future outputs\}\\mathcal\{H\}:\\text\{\\\{past inputs\\\}\}\\to\\text\{\\\{future outputs\\\}\}\. For discrete time LTI system,ℋ\\mathcal\{H\}is the fixed matrixΓ≜WcWo\\Gamma\\triangleq W\_\{c\}W\_\{o\}, which maps the full causal history of inputs to all future outputs\. To perform truncation, we need to align the coordinate system so that directions are ordered by their joint controllability/observability\. The diagonal entriesσi\\sigma\_\{i\}’s are called the*Hankel singular values*\(HSVs\):
σi=λi\(WcWo\),σ1≥σ2≥⋯≥σd≥0\\sigma\_\{i\}=\\sqrt\{\\lambda\_\{i\}\(W\_\{c\}W\_\{o\}\)\},\\;\\;\\sigma\_\{1\}\\geq\\sigma\_\{2\}\\geq\\cdots\\geq\\sigma\_\{d\}\\geq 0\(11\)also theithi^\{\\text\{th\}\}singular value of the Hankel operator, characterizing a state direction that is irrelevant to the past\-to\-future input\-output map\. The balancing transformation is a coordinate transformT∈ℝd×dT\\in\\mathbb\{R\}^\{d\\times d\}such that the Grammians are simultaneously diagonalized:
TWcTT=T−TWoT−1=Σ=diag\(σ1,⋯,σd\)TW\_\{c\}T^\{T\}=T^\{\-T\}W\_\{o\}T^\{\-1\}=\\Sigma=\\text\{diag\}\(\\sigma\_\{1\},\\cdots,\\sigma\_\{d\}\)\(12\)As a result, the transformed system\(TA¯T−1,TB¯,CT−1\)\(T\\bar\{A\}T^\{\-1\},T\\bar\{B\},CT^\{\-1\}\)has the property that each state direction has equal controllability and observability, equal toσi\\sigma\_\{i\}\. Such a system is called a balanced system\.
A balanced truncationG^\\hat\{G\}of the systemGGcan now be formed by partitioning the balanced system into “important” \(1,⋯,d^1,\\cdots,\\hat\{d\}\) and “unimportant” blocks \(d^\+1,⋯,d\\hat\{d\}\+1,\\cdots,d\) as:
A^=\(T\[1:d^,1:d^\]A¯T\[1:d^,:\]−1\)\[1:d^,1:d^\],B^=\(TB¯\)\[1:d^,:\],C^=\(CT−1\)\[:,1:d^\]\\begin\{split\}\\hat\{A\}&=\\left\(T\_\{\[1:\\hat\{d\},1:\\hat\{d\}\]\}\\bar\{A\}T\_\{\[1:\\hat\{d\},:\]\}^\{\-1\}\\right\)\_\{\[1:\\hat\{d\},1:\\hat\{d\}\]\},\\\\ \\hat\{B\}&=\\left\(T\\bar\{B\}\\right\)\_\{\[1:\\hat\{d\},:\]\},\\hat\{C\}=\\left\(CT^\{\-1\}\\right\)\_\{\[:,1:\\hat\{d\}\]\}\\end\{split\}\(13\)Finally, Glover’s error bound\(Glover,[1984](https://arxiv.org/html/2606.26290#bib.bib31)\)dictates that the truncated system deviates from the original by at most twice the sum of the discarded HSVs:
‖G−G^‖ℋ∞≤2∑k=d^\+1dσk\\left\\lVert G\-\\hat\{G\}\\right\\rVert\_\{\\mathcal\{H\}\_\{\\infty\}\}\\leq 2\\sum^\{d\}\_\{k=\\hat\{d\}\+1\}\\sigma\_\{k\}\(14\)This is a worst case bound over all inputs and all frequencies\. Furthermore,G^\\hat\{G\}is stable, and Glover’s bound is tight\.
## 4Method: Hankel\-Reduced order Model Adapter
#### Empirical Grammians for SSMs
HSV\-based balanced truncation requires that systemGGbe LTI\. The proposed HRM adapter’sA¯\\bar\{A\}is time\-invariant, so the theorem applies directly\. However, the selective SSM extension \(input\-dependentBt,CtB\_\{t\},C\_\{t\}in selective scan\(Gu and Dao,[2023](https://arxiv.org/html/2606.26290#bib.bib15)\)\) violates the LTI assumption\. We address this extension via*empirical Grammians*approach\(Lallet al\.,[1999](https://arxiv.org/html/2606.26290#bib.bib29)\)\. This extends balanced truncation to time\-varying systems by approximation Grammians from observed state trajectories\. This involves running the system forward on a representative calibration dataset ofNNsequences\. At each time stepttof each sequencenn, we record the state vectorssn,ts\_\{n,t\}, to compute the empirical controllability Grammian as:
Wcemp=1N⋅tcal∑n=1N∑t=1tcalsn,tsn,tT∈ℝd×dW^\{emp\}\_\{c\}=\\frac\{1\}\{N\\cdot t\_\{cal\}\}\\sum^\{N\}\_\{n=1\}\\sum^\{t\_\{cal\}\}\_\{t=1\}s\_\{n,t\}s\_\{n,t\}^\{T\}\\in\\mathbb\{R\}^\{d\\times d\}\(15\)This estimates the covariance of state trajectories under typical inputs, a proxy for controllability\. A similar proxy for observability is found in the form of empirical observability Grammian as:
Woemp=1N⋅tcal∑n=1N∑t=1tcalyn,tTyn,t∈ℝd×dW^\{emp\}\_\{o\}=\\frac\{1\}\{N\\cdot t\_\{cal\}\}\\sum^\{N\}\_\{n=1\}\\sum^\{t\_\{cal\}\}\_\{t=1\}y\_\{n,t\}^\{T\}y\_\{n,t\}\\in\\mathbb\{R\}^\{d\\times d\}\(16\)Due to\(Lallet al\.,[1999](https://arxiv.org/html/2606.26290#bib.bib29)\),Wcemp→WcW^\{emp\}\_\{c\}\\to W\_\{c\}andWoemp→WoW^\{emp\}\_\{o\}\\to W\_\{o\}asN→∞N\\to\\infty, and the converges at𝒪\(1/N\)\\mathcal\{O\}\{\\left\(\{1/\\sqrt\{N\}\}\\right\)\}\.
For our case, this gives a𝒪\(N⋅Tcal⋅d2\)\\mathcal\{O\}\{\\left\(\{N\\cdot T\_\{cal\}\\cdot d^\{2\}\}\\right\)\}to compute HSVs, from which the balancing transform and truncation proceed exactly as in the LTI case\. However, for the time\-invariant HRM adapter, both the analytical Lyapunov and empirical Grammian approaches are available\.
#### HRM Adapter Architecture
Now we are ready to architect the HRM adapter based on Hankel order\-reduction for our SSM\. Consider a standard pre\-norm transformer layerllwith inputxt∈ℝdmodelx\_\{t\}\\in\\mathbb\{R\}^\{d\_\{model\}\}\. The layer applies self\-attention followed by an MLP sublayer, each with residual connections and layer normalization:
at=xt\+Attm\(LayerNorm\(xt\)\),htMLP=at\+MLP\(LN\(at\)\)\\begin\{split\}a\_\{t\}&=x\_\{t\}\+\\mathrm\{Attm\}\(\\mathrm\{LayerNorm\}\(x\_\{t\}\)\),\\\\ h^\{MLP\}\_\{t\}&=a\_\{t\}\+\\mathrm\{MLP\}\(\\mathrm\{LN\}\(a\_\{t\}\)\)\\end\{split\}\(17\)All weights Attn\(∙\\bullet\) and MLP\(∙\\bullet\) are frozen during adapter training\. The HRM adapter is inserted parallel to thelthl^\{\\text\{th\}\}MLP sub\-layer, adding a recurrent correction to the MLP output:
htout,\(l\)=htMLP,\(l\)\+α\(l\)⋅yt\(l\)h^\{out,\(l\)\}\_\{t\}=h^\{MLP,\(l\)\}\_\{t\}\+\\alpha^\{\(l\)\}\\cdot y^\{\(l\)\}\_\{t\}\(18\)whereα\(l\)∈ℝ\\alpha^\{\(l\)\}\\in\\mathbb\{R\}is a layer\-specific learnable gate scalar andyt\(l\)y\_\{t\}^\{\(l\)\}is the adapter output for layerll, at positiontt\.
The adapter at layerlldefines a recurrent hidden statest\(l\)∈ℝds\_\{t\}^\{\(l\)\}\\in\\mathbb\{R\}^\{d\}that integrates the token representationsh1,h2,⋯,hth\_\{1\},h\_\{2\},\\cdots,h\_\{t\}as they are processed:
st\(l\)=A¯st−1\(l\)\+B¯\(l\)htMLP,yt\(l\)=C\(l\)st\(l\)∈ℝddomel\\begin\{split\}s\_\{t\}^\{\(l\)\}&=\\bar\{A\}s\_\{t\-1\}^\{\(l\)\}\+\\bar\{B\}^\{\(l\)\}h\_\{t\}^\{MLP\},\\;\\;y\_\{t\}^\{\(l\)\}=C^\{\(l\)\}s\_\{t\}^\{\(l\)\}\\in\\mathbb\{R\}^\{d\_\{domel\}\}\\\\ \\end\{split\}\(19\)whereA¯∈ℝd×d\\bar\{A\}\\in\\mathbb\{R\}^\{d\\times d\}is a learnable \(diagonal\) state transition matrix,B¯∈ℝd×dmodel\\bar\{B\}\\in\\mathbb\{R\}^\{d\\times d\_\{model\}\}maps the current hidden state into the adapter’s state space,C∈ℝdmodel×dC\\in\\mathbb\{R\}^\{d\_\{model\}\\times d\}maps the adapter state back to the hidden representation, andα\\alphais a learnable scalar gate\. Using \([18](https://arxiv.org/html/2606.26290#S4.E18)\) and \([19](https://arxiv.org/html/2606.26290#S4.E19)\) the adapter’s output can be unrolled as:
yt\(l\)=C\(l\)∑k=0t\(A¯\(l\)\)t−kB¯\(l\)hkMLP,\(l\)y\_\{t\}^\{\(l\)\}=C^\{\(l\)\}\\sum\_\{k=0\}^\{t\}\\left\(\\bar\{A\}^\{\(l\)\}\\right\)^\{t\-k\}\\bar\{B\}^\{\(l\)\}h^\{MLP,\(l\)\}\_\{k\}\(20\)Therefore, the combined gated addition \([18](https://arxiv.org/html/2606.26290#S4.E18)\) gives the unrolled layer computation as:
htout,\(l\)=htMLP,\(l\)\+α\(l\)C\(l\)∑k=0t\(A¯\(l\)\)t−kB¯\(l\)hkMLP,\(l\)h^\{out,\(l\)\}\_\{t\}=h^\{MLP,\(l\)\}\_\{t\}\+\\alpha^\{\(l\)\}C^\{\(l\)\}\\sum\_\{k=0\}^\{t\}\\left\(\\bar\{A\}^\{\(l\)\}\\right\)^\{t\-k\}\\bar\{B\}^\{\(l\)\}h^\{MLP,\(l\)\}\_\{k\}\(21\)
The adapter is placed parallel to the MLP, with a learnable scalar weightα\\alpha\. There are two reasons to this\. First, the attention mechanism already computes a weighted sum over all past positions, providing global context\. Adding a recurrent branch to attention would interact with the causal mask in a non\-trivial way and could disturb the attention distribution\. Second, the MLP sub\-layer is the natural site of position\-independent computation as it applies the same learned function to each token representation independently\. A recurrent adapter at this site adds the missing dependence on prior positions\. On the other hand, a sequential insertion would mean the MLP receives adapter\-modified input, potentially causing large gradient flows through the frozen MLP\. The parallel insertion ensures the frozen MLP is always evaluated on the original attention output, while the adapter’s contribution is additive and controlled byα\\alpha\. The gating scalarα\\alphais initialized to a small value to ensure that at the start of the training the HRM adapter contribution is small, and the model starts from the behavior of a pre\-trained backbone\. The gate then proceeds to grow with training and the adapter learns a useful temporal correction\. Experiments show that initializing atα0=1\.0\\alpha\_\{0\}=1\.0causes divergence on all tested configurations\.
Another design choice we made was to have time\-invariantA¯\\bar\{A\}\. A natural concern is that Mamba\-style selective SSMs \(with input\-dependentA¯t,Bt,Ct\\bar\{A\}\_\{t\},B\_\{t\},C\_\{t\}\) are more expressive as they can selectively forget irrelevant tokens by adjusting the state decay on the fly\. We fix in our architecture a time\-invariantA¯\\bar\{A\}deliberately because the pre\-trained model’s attention already handles selectivity to a certain degree\. The frozen self\-attention mechanism performs global, content\-based retrieval at every layer, choosing which past tokens to attend to\. The HRM adapter’s role is complementary: it provides a continuous recurrent state that integrates the local MLP output stream, accumulating context that attention’s position\-independent MLP stream cannot represent\. Further, time\-invariantA¯\\bar\{A\}makesA¯k\\bar\{A\}^\{k\}a geometric sequence, enabling the exact FFT convolution shortcut This eliminates the compute overhead and makes the HRM adapter practical, and temporal causal\. Time\-invariance also makes the trained adapter an LTI system, for which Balanced Truncation with the Gloverℋ∞\\mathcal\{H\}\_\{\\infty\}bound applies analytically\. This also allows for an easier computation of Grammians using Lyapunov equation\. The expressivity trade\-off is knowingly accepted in exchange for theoretical tractability and computational efficiency\.
#### HRM State Transition Matrix Parameterization & Stability
We parameterize the state transition matrixA¯\\bar\{A\}in \([19](https://arxiv.org/html/2606.26290#S4.E19)\) as a fixed diagonal \(this is common in Mamba and S4D like SSMs\(Gu and Dao,[2023](https://arxiv.org/html/2606.26290#bib.bib15); Guet al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib13)\)\) to keep matrix\-vector product cost𝒪\(d\)\\mathcal\{O\}\{\\left\(\{d\}\\right\)\}, instead of𝒪\(d2\)\\mathcal\{O\}\{\\left\(\{d^\{2\}\}\\right\)\}for a generalA¯\\bar\{A\}:
A¯\(l\)=diag\(a¯1\(l\)\),⋯,a¯d\(l\)\)\),a¯i\(l\)\)=exp\(−exp\(logAi\(l\)\)\)\\bar\{A\}^\{\(l\)\}=\\mathrm\{diag\}\\left\(\\bar\{a\}^\{\(l\)\)\}\_\{1\},\\cdots,\\bar\{a\}^\{\(l\)\)\}\_\{d\}\\right\),\\bar\{a\}^\{\(l\)\)\}\_\{i\}=\\exp\{\(\-\\exp\{\(\\log\{A^\{\(l\)\}\_\{i\}\}\)\}\)\}\(22\)wherelogAi\(l\)∈ℝ\\log\{A^\{\(l\)\}\_\{i\}\}\\in\\mathbb\{R\}is the raw \(unconstrained\) learnable parameter\. This parameterization is to ensure that0<a¯i\(l\)<10<\\bar\{a\}\_\{i\}^\{\(l\)\}<1, therefore,maxi\|λi\(A¯\(l\)\)\|<1\\max\_\{i\}\{\\left\\lvert\\lambda\_\{i\}\(\\bar\{A\}^\{\(l\)\}\)\\right\\rvert\}<1for each layer\. As a result, the HRM dynamics are unconditionally stable for any parameters values during all stages of training\.
The parametrization above has a ZOH interpretation, asA¯=exp\(AΔt\)\\bar\{A\}=\\exp\{\(A\\Delta t\)\}for a continuous time system in \([4](https://arxiv.org/html/2606.26290#S3.E4)\) with ZOH discretization factorΔt\\Delta t\. The parametera¯i=exp\(−exp\(logAi\)\)\\bar\{a\}\_\{i\}=\\exp\{\(\-\\exp\{\(\\log\{A\_\{i\}\}\)\}\)\}corresponds toΔt⋅\|Ai\|∈\(0,∞\)\\Delta t\\cdot\\left\\lvert A\_\{i\}\\right\\rvert\\in\(0,\\infty\), withexp\(logAi\)\\exp\{\(\\log\{A\_\{i\}\}\)\}playing the role ofΔt⋅\|Ai\|\\Delta t\\cdot\\left\\lvert A\_\{i\}\\right\\rvert\. The combined parameterlogAi\\log A\_\{i\}absorbs both, the magnitude or the continuousAAeigenvalue, and the discretization stepΔt\\Delta t\.B¯\\bar\{B\}andCCare unconstrained dense matrices with learnable parametersB¯\(l\)∈ℝd×dmodel\\bar\{B\}^\{\(l\)\}\\in\\mathbb\{R\}^\{d\\times d\_\{model\}\}, andC\(l\)∈ℝdmodel×dC^\{\(l\)\}\\in\\mathbb\{R\}^\{d\_\{model\}\\times d\}, both initialized with small\-variance Gaussian entries\. Finally, a learnable parameterlogΔt\(l\)\\log\\Delta t^\{\(l\)\}associated with each layerllis used to compute the ZOH discretization stepΔt\(l\)\\Delta t^\{\(l\)\}\. This allows the adapter to learn an appropriate timescale for the task\. As a result, the total number of learnable parameters per layer are:B:d×dmodelB:d\\times d\_\{model\},C:dmodel×dC:d\_\{model\}\\times d,logAi:d\\log A\_\{i\}:d,logΔt:d\\log\\Delta t:d, and gateα:1\\alpha:1=2d⋅dmodel\+2d\+1=2d\\cdot d\_\{model\}\+2d\+1parameters\. Therefore, total number of parameters \(compared with LoRA parameters\) are:
PHRM=nlayers⋅\(2d⋅dmodel\+2d\+1\)PLoRA=nlayers⋯2⋯2⋅r⋅dmodel\\begin\{split\}P\_\{HRM\}&=n\_\{layers\}\\cdot\(2d\\cdot d\_\{model\}\+2d\+1\)\\\\ P\_\{LoRA\}&=n\_\{layers\}\\cdots 2\\cdots 2\\cdot r\\cdot d\_\{model\}\\end\{split\}\(23\)Following this, a state compression is done so that for a fixedrr, so that the iso\-parametricPHRMP\_\{HRM\}is compressed for a Glover boundε=0\.01\\varepsilon=0\.01\. That is, top90%90\\%of the HSVs are kept, and the remaining discarded\. This HSV\-based compression means thatPHRM\(d^\)≤PHRM≈PLoRAP\_\{HRM\}\(\\hat\{d\}\)\\leq P\_\{HRM\}\\approx P\_\{LoRA\}\.
#### Parallel Scan for HRM Adapter
Since ourA¯\\bar\{A\}is time\-invariant, the HRM recurrence computes a causal linear convolution\. This convolution can be evaluated in𝒪\(TlogT\)\\mathcal\{O\}\{\\left\(\{T\\log T\}\\right\)\}via the Fast Fourier Transform, replacing𝒪\(T\)\\mathcal\{O\}\{\\left\(\{T\}\\right\)\}sequential Python\-level dispatches with three FFT calls and achieving empirical compute parity with LoRA at all tested context lengths\. This parity is shown in the compute wall clock times for HRM and LoRA in Appendix\.[B](https://arxiv.org/html/2606.26290#A2)\.
Using FFT\-based parallel scan arguments from\(Gu and Dao,[2023](https://arxiv.org/html/2606.26290#bib.bib15),[2024](https://arxiv.org/html/2606.26290#bib.bib12)\), let\{ht\}t=0T−1⊂ℝdmodel\\\{h\_\{t\}\\\}\_\{t=0\}^\{T\-1\}\\subset\\mathbb\{R\}^\{d\_\{model\}\}be the input sequence to the HRM adapter\. Letsts\_\{t\}andyty\_\{t\}be the HRM sequential recurrence state and output, respectively, fors0=0s\_\{0\}=0\. For the impulse responsegkg\_\{k\}defined in \([5](https://arxiv.org/html/2606.26290#S3.E5)\), define the zero\-padded sequences:
g~\[k\]=\{gk0≤k≤T−10T≤k≤2T−1h~\[k\]=\{hk0≤k≤T−10T≤k≤2T−1\\begin\{split\}\\tilde\{g\}\[k\]&=\\begin\{cases\}g\_\{k\}&0\\leq k\\leq T\-1\\\\ 0&T\\leq k\\leq 2T\-1\\end\{cases\}\\\\ \\tilde\{h\}\[k\]&=\\begin\{cases\}h\_\{k\}&0\\leq k\\leq T\-1\\\\ 0&T\\leq k\\leq 2T\-1\\end\{cases\}\\end\{split\}\(24\)Then the output sequence can be computed as:
yt=∑k=0ttt−khk=\[IFFT\(FFT\(g~⊙FFT\(h~\)\)\)\]ty\_\{t\}=\\sum^\{t\}\_\{k=0\}t\_\{t\-k\}h\_\{k\}=\\left\[\\mathrm\{IFFT\}\\left\(\\mathrm\{FFT\}\(\\tilde\{g\}\\odot\\mathrm\{FFT\}\(\\tilde\{h\}\)\)\\right\)\\right\]\_\{t\}\(25\)for0≤t≤T−10\\leq t\\leq T\-1, for element\-wise product⊙\\odot, and FFT/IFFT operating on the length2T2Tdimensional sequence\. Since all operations in the FFT scan \(torch\.fft\.rfft, element\-wise multiply,torch\.fft\.irfft\) are differentiable in PyTorch, gradients flow back through the FFT toA¯,B¯,C\\bar\{A\},\\bar\{B\},Cwithout any custom CUDA kernels\.
## 5Experiments
### 5\.1Synthetic Task: DFA State Tracking
Consider a deterministic finite automaton \(DFA\), a 5\-tuple\(Q,Σ,δ,q0,F\)\(Q,\\Sigma,\\delta,q\_\{0\},F\), whereQQis a finite set of states,Σ\\Sigmaan input alphabet,δ:Q×Σ→Q\\delta:Q\\times\\Sigma\\to Qthe transition function,q0q\_\{0\}the initial state, andF⊆QF\\subseteq Qthe set of accepting states\. Suppose we are given an input sequence\(σ1,⋯,σT\)∈ΣT\(\\sigma\_\{1\},\\cdots,\\sigma\_\{T\}\)\\in\\Sigma^\{T\}, a DFA withkkstates and binary alphabetΣ=\{0,1\}\\Sigma=\\\{0,1\\\}\. The state tracking task is to predict the current DFA state at each positiontt, given\(σ1,⋯,σT\)\(\\sigma\_\{1\},\\cdots,\\sigma\_\{T\}\), predictqt=δ\(qt−1,σt\)q\_\{t\}=\\delta\(q\_\{t\-1\},\\sigma\_\{t\}\)\.
This requires exact state accumulation, sinceqtq\_\{t\}depends on the full history\(σ1,⋯,σt\)\(\\sigma\_\{1\},\\cdots,\\sigma\_\{t\}\)through the transition functions\. A model that cannot maintain state across positions will fail asTTgrows as it must somehow compress the DFA state into the current token representation alone\.
We experiment withk∈\{2,4,8\}k\\in\\\{2,4,8\\\}, binary alphabet, context lengthsT∈\{64,128,256,512\}T\\in\\\{64,128,256,512\\\}, with each DFA instance having a random fixed transition tableδ\\delta\. The model must output akk\-way classification at each position, with sequences sampled uniformly at random from all valid DFA paths, 10,000 training sequences \(1,000 validation sequences\), across 3 different seeds\.
Figure 2:DFA state tracking results: \(left\) HRM vs\. HRM with balanced truncation vs\. LoRA, \(right\) Hankel Singular Value decay rate for the task, with HSV cutoff threshold = 0\.01In Fig\.[2](https://arxiv.org/html/2606.26290#S5.F2), we observe that HRM\-BT dominates LoRA at all T values, and the gap grows with T, consistent with the memory hypothesis\. Additionally, balanced truncation outperforms no truncation due to BT regularization\. The HSV decay curveσi/σ1\\sigma\_\{i\}/\\sigma\_\{1\}drops below 0\.01 byi=7i=7, justifyingd^\\hat\{d\}=6\. The HSV spectrum confirms that*DFA dynamics are intrinsically 6\-dimensional, despite training with d=32 state dimensions*\.
## 6MAESTRO Piano Language Modeling
MAESTRO v2\(Hawthorneet al\.,[2018](https://arxiv.org/html/2606.26290#bib.bib3)\)is a dataset of∼\\sim200 hours of professional piano performances in symbolic MIDI format\. We treat it as a character\-level language modeling task: each MIDI event \(note\-on, note\-off, time\-shift, velocity\) is encoded as a single token, and the model is trained to predict the next token given the context\. The vocabulary has∼\\sim300 distinct event tokens\. Piano music is an ideal testbed for long\-range temporal modeling: \(1\) melodic phrases span dozens to hundreds of notes; \(2\) harmonic progressions follow conventions \(ii\-V\-I, etc\.\) that span 8–16 measures; \(3\) rhythmic structure repeats at multiple timescales\. A model that can only attend to recent tokens \(or a static adapter at each position\) will fail to capture these structural regularities\. Finally, audio processing tasks are generally suitable for SSM models over transformers, a benefit we expect to observe in the HRM adapter\.
For this task, we had a backbone frozen TinyGPT \(4\-layer, d\_model=128\), context length T=512 events, with 80K events training, 5K validation\.
Figure 3:MAESTRO piano language modeling\. \(left\) HRM vs\. LoRA, \(right\) Final BPC\.The results are shown in Fig\.[3](https://arxiv.org/html/2606.26290#S6.F3)\. HRM achieves a lower BPC at convergence and with substantially smaller variance across seeds \(band is nearly invisible for HRM\)\. \(Right\) HRM accuracy 0\.3966±0\.0003 vs LoRA 0\.3843±0\.0033 at epoch 40, t=7\.01, p¡0\.001\. Both adapters use identical parameter budgets \(Tier 2, 33K parameters\)\.
### 6\.1enwik8Character Language Modeling
enwik8is a widely used character level language modeling consisting of 100 million bytes of XML formatted Wikipedia text\(Mahoney,[2013](https://arxiv.org/html/2606.26290#bib.bib26)\)\. The task is to minimize bits\-per\-character \(BPC\), the number of bits required to encode each character on average\. Since English text has word\-level, sentence\-level, and paragraph\-level structure,enwik8is appropriate to test long\-range adapter capability for PEFT tasks\. Standard benchmarks use context lengths of 512–8192 characters, which are long by transformer standards\.
For comparing HRM with LoRA, we utilize a backbone frozen TinyGPT \(4\-layer,dmodel=d\_\{model\}=128\), with context lengthsT∈\{512,1024,2048\}T\\in\\\{512,1024,2048\\\}, 3 tiers of model capacity \(rrandd^\\hat\{d\}values\. Due to different convergence rates, we use 25 epochs forT=512T=512and 40 epochs forT=1024T=1024and20482048, with a batch size of 32, 10,000 training examples, and 1,000 validation examples\. The findings across these cases are in Table[3](https://arxiv.org/html/2606.26290#A4.T3)with HRM achieving a lower BPC than LoRA adapter on*all configurations*\. The BPC\-TTrelation is detailed in Appendix[D](https://arxiv.org/html/2606.26290#A4)\.
## 7Mistral\-7B LongBench
In this final experiment, we evaluate HRM against four different LoRA family baselines: LoRA, AdaLORA, DoRA, and QLoRA\. LongBench is a comprehensive benchmark for evaluating LLMs on their ability to understand and process long\-context information across various tasks\(Baiet al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib25)\)\. We do this over three different LongBench tasks, using a Mistral\-7B\-v0\.1 pre\-trained model with 7\.25B parameters\(Jianget al\.,[2023](https://arxiv.org/html/2606.26290#bib.bib23)\)\. The three tasks chosen in LongBench are QuALITY\(Panget al\.,[2022](https://arxiv.org/html/2606.26290#bib.bib19)\), QMSum\(Zhonget al\.,[2021](https://arxiv.org/html/2606.26290#bib.bib21)\), and NarrativeQA\(Kočiskỳet al\.,[2018](https://arxiv.org/html/2606.26290#bib.bib22)\)\.
QuALITY is a multiple\-choice reading comprehension dataset over long articles \(avg\.∼\\sim4,000 tokens\), where each example presents an article with a 4\-answer multiple choice question\. The fine\-tuned model must learn to select the correct option, with the cognitive bottleneck of*sequential evidence integration*\. The model is judged on the top\-1 accuracy, i\.e\., exact match of predicted option letter to gold\.
Figure 4:Mistral\-7B HRM: Hankel Singular Value decay curves \(left\) QuALITY, \(middle\) QMSum, \(right\) VarrativeQA\.QMSum is a query\-focused summarization dataset of meeting transcripts \(avg\.∼\\sim10,000 tokens per meeting, truncated to 4096 tokens\)\. Given a query and a transcript, the fine\-tuned model must learn to generate a paragraph\-length summary addressing the query\. Since meeting transcripts are inherently sequential, turn\-taking, speaker contributions, and topic shifts follow a temporal order that is meaningful for summarization\. We measure ROUGE\-1, ROUGE\-2, and ROUGE\-L against reference summaries\.
Finally, NarrativeQA is an open\-ended question answering dataset over full books and movie scripts \(avg\.∼\\sim50,000 tokens per document\)\. For each document\-question pair, the answer is a specific phrase or sentence from the document\. The fine\-tuned model must identify and generate the exact answer phrase from within the document context\. We measure token\-level F1 score \(case\-insensitive, following standard NarrativeQA evaluation\)\.
All methods are iso\-parametric, with HRM state dimensiond=32d=32and LoRA rankr=16r=16, both yielding≈8\.4\\approx 8\.4M trainable parameters, roughly 0\.116% of the total\. Training protocol was identical across all methods: 5 epochs, lr=5×10−4=5\\times 10^\{\-4\}, batch size=1, gradient accumulation=8 \(effective batch=8\), max input length=4096=4096, with AdamW optimizer\. Evaluation uses the held\-out 10% test split for QuALITY and QMSum, and the official test split for NarrativeQA\.
We found that of the 3 tasks, HRM outperformed all: LoRA, DoRA, QLoRA, and AdaLoRA on two tasks – QuALITY and QMSum\. While it drastically underperformed all methods on NarativeQA\. HRM exceeded the best baseline \(LoRA & AdaLoRA\) on QuALITY on the accuracy metric by \+34\.8% \(HRM acc\. 47\.43% to 35\.18% for LoRA/AdaLoRA\)\. HRM exceeded the best baseline \(QLoRA\) on QMSum on all 3 metrics: ROUGE\-1, ROUGE\-2, ROUGE\-L\. The ROUGE\-1 metric was improved by \+54\.3% relative \(HRM R\-1 0\.2531 to 0\.1641% for QLoRA\), by \+73\.59% relative against DoRA, and by \+71\.36% relative for both LoRA and AdaLoRA\. However, it drastically underperformed on NarrativeQA, with an F1 score of 0\.0391 against the best model \(LoRA, F1=0\.1592\), under performing by a whole \-75\.4% \(see Appendix \.[F](https://arxiv.org/html/2606.26290#A6)for details\)\.
QuALITY requires the model to read a∼\\sim4K\-token article and answer a multiple\-choice question, a prototypical sequential integration problem\.Gevaet al\.\([2021](https://arxiv.org/html/2606.26290#bib.bib20)\)establish that transformer MLP blocks function as key\-value memories to perform content integration and associative recall, while attention performs positional retrieval\. Fine\-tuning with LoRA modifies the attention’sQ,K,VQ,K,Vprojections, improving the model’s ability to attend to relevant spans\. However, for MCQ reasoning, the pre\-trained model can already attend to relevant spans and the bottleneck is integrating evidence across multiple spans into a coherent conclusion\. The HRM adapter provides a recurrent SSM residual that maintains a running integration of the MLP’s content representations across sequence positions\. The relative improvement is therefore consistent with this mechanistic prediction\.
QMSum requires generating a focused summary of a meeting transcript in response to a query\. Due to similar reasoning as above, we can explain HRM’s relative gains over the LoRA family for PEFT\.
NarrativeQA on the other hand fails completely\. Part of this was because of the hardware constraints of a dual NVIDIA GeForce RTX 4090 GPUs setup, with a total VRAM of 48GB\. Due to this, the complete NarrativeQA context of 50,000 was not used at all, and truncated to 5,000 for training\. As a result, all methods in comparison failed the benchmark, as the NarrativeQA task was not suitable for the hardware\. As a result, HRM vastly underperformed all LoRA family on the task\.
Figure 5:HRM gate values per layer for Mistral\-7B, all three LongBench tasks \(gate\_init = 0\.1\)\.At the coarser thresholdε=\\varepsilon=0\.10, the complexity ordering extends to LLM\-scale tasks\. From BT on trained Mistral\-7B checkpoints \(32 layers\),d^\\hat\{d\}\(DFA,ε=\\varepsilon=0\.01\)=5<d^<\\hat\{d\}\(QuALITY,ε=\\varepsilon=0\.10\)≈\\approx11 ¡d^\\hat\{d\}\(QMSum,ε=\\varepsilon=0\.10\)≈\\approx13<d^<\\hat\{d\}\(enwiki8,ε=\\varepsilon=0\.01\)=32≤d^\\leq\\hat\{d\}\(MAESTRO,ε=\\varepsilon=0\.01\)=32\. The HSV decay plot for the three tasks is shown in Fig\.[4](https://arxiv.org/html/2606.26290#S7.F4)\. Gate analysis reveals that HRM gates converge near zero despite non\-zero initialization and explicit weight\-decay exclusion: the gradient itself drives closure\. Yet frozen\-gate probing confirms that the trained SSM weights\(A,B,C\)\(A,B,C\)retain 1180×\\timeslarger contribution capacity at gate=0\.1 This is shown in the final gate values learned for each layer, for each of the three tasks in Fig\.[5](https://arxiv.org/html/2606.26290#S7.F5)\.
## 8Conclusion and Discussion
In this work, we proposed Hankel reduced order model \(HRM\) adapter for parameter efficient fine\-tuning \(PEFT\)\. HRM adds a provably compressible recurrent temporal state to any frozen pre\-trained transformer backbone, and unlike prior PEFT methods, the adapter is temporal causal\. The model reduction is achieved via balanced truncation of the underlying linear system, where we utilize controllability and observability Grammians from control theory\. This allows us to have a \(tight\) error bound making model reduction a certified compression, and not a heuristic\. PEFT literature has inadvertently restricted itself to the class of zero\-memory adapters\. HRM results show that a gated SSM, inserted parallel to a frozen transformer’s MLP blocks, can be trained efficiently, compressed with theoretical guarantees, and consistently outperforms the best static alternative on temporal tasks\. We demonstrate that HRM outperforms LoRA, QLoRA, AdaLoRA, and DoRA on 6 different tasks, from three qualitatively different task families that share the requirement of causal state accumulation We also observe as a by\-product that Hankel singular values and associated Grammians are a strong metric for the training task’s memory requirements\.
#### Open Problems & Next Steps
Numerous immediate next steps emerge from the presented HRM works\. 1\. Extension of Hankel singular value\-based balanced truncation to selective SSMs \(Mamba\-type, with input\-dependentA¯t\)\\bar\{A\}\_\{t\}\)benefits the proposed adapter\. This would involve computationally tractable parameter\-varying empirical Grammian computations that do not cause overhead larger than the intermediate FFT/IFFT calculations\. 2\. Currently the BT compression takes place after an initial phase’s SSM adapter training\. An enhancement would be to*adaptively allocate Hankel ranks for each layer*, i\.e\.,d^\(l\)\\hat\{d\}^\{\(l\)\}for layerll\. This is analogous to AdaLoRA’s rank allocation, but would be informed by per\-layerσi\(l\)\\sigma\_\{i\}^\{\(l\)\}HSV spectra\. 3\. More complete benchmarking on LongBench of existing results is needed, for more tasks, across deeper context lengths\. 4\. Mixed injection of the SSM adapter needs to be ablated: attention injection site vs\. MLP injection site, and more importantly, finding out task signatures suitable for each injection site for the adapter, and examine if simultaneous injection would benefit retrieval \+ integration tasks\. 5\. Investigate frozen gate training scenarios \(e\.g\.,gate\.requires\_grad=False\) to enforce sustained HRM adapter contribution, and directly test whether larger training times translate to performance gains\.
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.
## References
- A\. Aghajanyan, S\. Gupta, and L\. Zettlemoyer \(2021\)Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing \(volume 1: long papers\),pp\. 7319–7328\.Cited by:[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px1.p1.2)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou,et al\.\(2024\)Longbench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 3119–3137\.Cited by:[§7](https://arxiv.org/html/2606.26290#S7.p1.1)\.
- M\. J\. Corless and A\. Frazho \(2003\)Linear systems and control: an operator perspective\.CRC Press\.Cited by:[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px3.p2.5)\.
- S\. De, S\. L\. Smith, A\. Fernando, A\. Botev, G\. Cristian\-Muraru, A\. Gu, R\. Haroun, L\. Berrada, Y\. Chen, S\. Srinivasan,et al\.\(2024\)Griffin: mixing gated linear recurrences with local attention for efficient language models\.arXiv preprint arXiv:2402\.19427\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p4.2)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[§1](https://arxiv.org/html/2606.26290#S1.p1.13),[§2](https://arxiv.org/html/2606.26290#S2.p1.2),[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px1.p2.6)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[§7](https://arxiv.org/html/2606.26290#S7.p7.2)\.
- K\. Glover \(1984\)All optimal hankel\-norm approximations of linear multivariable systems and theirL∞L^\{\\infty\}\-error bounds\.International journal of control39\(6\),pp\. 1115–1193\.Cited by:[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px3.p5.6)\.
- A\. Gu and T\. Dao \(2023\)Mamba: linear\-time sequence modeling with selective state spaces\.arXiv preprint arXiv:2312\.00752\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p3.2),[§4](https://arxiv.org/html/2606.26290#S4.SS0.SSS0.Px1.p1.7),[§4](https://arxiv.org/html/2606.26290#S4.SS0.SSS0.Px3.p1.4),[§4](https://arxiv.org/html/2606.26290#S4.SS0.SSS0.Px4.p2.5)\.
- A\. Gu and T\. Dao \(2024\)Mamba: linear\-time sequence modeling with selective state spaces\.InFirst conference on language modeling,Cited by:[§4](https://arxiv.org/html/2606.26290#S4.SS0.SSS0.Px4.p2.5)\.
- A\. Gu, K\. Goel, A\. Gupta, and C\. Ré \(2022\)On the parameterization and initialization of diagonal state space models\.Advances in neural information processing systems35,pp\. 35971–35983\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p3.2),[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px2.p3.3),[§4](https://arxiv.org/html/2606.26290#S4.SS0.SSS0.Px3.p1.4)\.
- A\. Gu, K\. Goel, and C\. Ré \(2021\)Efficiently modeling long sequences with structured state spaces\.arXiv preprint arXiv:2111\.00396\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p3.2)\.
- Y\. Hao, Y\. Cao, and L\. Mou \(2024\)Flora: low\-rank adapters are secretly gradient compressors\.arXiv preprint arXiv:2402\.03293\.Cited by:[Appendix C](https://arxiv.org/html/2606.26290#A3.p1.6)\.
- C\. Hawthorne, A\. Stasyuk, A\. Roberts, I\. Simon, C\. A\. Huang, S\. Dieleman, E\. Elsen, J\. Engel, and D\. Eck \(2018\)Enabling factorized piano music modeling and generation with the maestro dataset\.arXiv preprint arXiv:1810\.12247\.Cited by:[§6](https://arxiv.org/html/2606.26290#S6.p1.2)\.
- S\. Hayou, N\. Ghosh, and B\. Yu \(2024\)Lora\+: efficient low rank adaptation of large models\.arXiv preprint arXiv:2402\.12354\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p1.2)\.
- N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly \(2019\)Parameter\-efficient transfer learning for nlp\.InInternational conference on machine learning,pp\. 2790–2799\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p2.2)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§1](https://arxiv.org/html/2606.26290#S1.p1.6),[§2](https://arxiv.org/html/2606.26290#S2.p1.2),[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px1.p1.2)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier,et al\.\(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.External Links:[Link](https://arxiv.org/abs/2310.06825)Cited by:[§7](https://arxiv.org/html/2606.26290#S7.p1.1)\.
- T\. Kočiskỳ, J\. Schwarz, P\. Blunsom, C\. Dyer, K\. M\. Hermann, G\. Melis, and E\. Grefenstette \(2018\)The narrativeqa reading comprehension challenge\.Transactions of the Association for Computational Linguistics6,pp\. 317–328\.Cited by:[§7](https://arxiv.org/html/2606.26290#S7.p1.1)\.
- S\. Lall, J\. E\. Marsden, and S\. Glavaški \(1999\)Empirical model reduction of controlled nonlinear systems\.IFAC Proceedings Volumes32\(2\),pp\. 2598–2603\.Cited by:[§4](https://arxiv.org/html/2606.26290#S4.SS0.SSS0.Px1.p1.11),[§4](https://arxiv.org/html/2606.26290#S4.SS0.SSS0.Px1.p1.7)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 conference on empirical methods in natural language processing,pp\. 3045–3059\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p2.2)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 4582–4597\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p2.2)\.
- O\. Lieber, B\. Lenz, H\. Bata, G\. Cohen, J\. Osin, I\. Dalmedigos, E\. Safahi, S\. Meirom, Y\. Belinkov, S\. Shalev\-Shwartz,et al\.\(2024\)Jamba: a hybrid transformer\-mamba language model\.arXiv preprint arXiv:2403\.19887\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p4.2)\.
- H\. Liu, D\. Tam, M\. Muqeeth, J\. Mohta, T\. Huang, M\. Bansal, and C\. A\. Raffel \(2022\)Few\-shot parameter\-efficient fine\-tuning is better and cheaper than in\-context learning\.Advances in Neural Information Processing Systems35,pp\. 1950–1965\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p1.2)\.
- S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen \(2024\)Dora: weight\-decomposed low\-rank adaptation\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.26290#S1.p1.13),[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px1.p2.6)\.
- M\. Mahoney \(2013\)Large text compression benchmark, 2011\.URL http://www\. mattmahoney\. net/dc/text\. html\.Cited by:[§6\.1](https://arxiv.org/html/2606.26290#S6.SS1.p1.1)\.
- B\. Moore \(2003\)Principal component analysis in linear systems: controllability, observability, and model reduction\.IEEE transactions on automatic control26\(1\),pp\. 17–32\.Cited by:[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px3.p1.7)\.
- R\. Y\. Pang, A\. Parrish, N\. Joshi, N\. Nangia, J\. Phang, A\. Chen, V\. Padmakumar, J\. Ma, J\. Thompson, H\. He,et al\.\(2022\)QuALITY: question answering with long input texts, yes\!\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 5336–5358\.Cited by:[§7](https://arxiv.org/html/2606.26290#S7.p1.1)\.
- J\. Park, J\. Park, Z\. Xiong, N\. Lee, J\. Cho, S\. Oymak, K\. Lee, and D\. Papailiopoulos \(2024\)Can mamba learn how to learn? a comparative study on in\-context learning tasks\.arXiv preprint arXiv:2402\.04248\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p4.2)\.
- J\. Pfeiffer, A\. Rücklé, C\. Poth, A\. Kamath, I\. Vulić, S\. Ruder, K\. Cho, and I\. Gurevych \(2020\)Adapterhub: a framework for adapting transformers\.InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations,pp\. 46–54\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p2.2)\.
- Y\. Sheng, S\. Cao, D\. Li, C\. Hooper, N\. Lee, S\. Yang, C\. Chou, B\. Zhu, L\. Zheng, K\. Keutzer,et al\.\(2024\)Slora: scalable serving of thousands of lora adapters\.Proceedings of Machine Learning and Systems6,pp\. 296–311\.Cited by:[§2](https://arxiv.org/html/2606.26290#S2.p5.1)\.
- M\. Zhang, H\. Chen, C\. Shen, Z\. Yang, L\. Ou, X\. Yu, and B\. Zhuang \(2023a\)Loraprune: pruning meets low\-rank parameter\-efficient fine\-tuning\.Cited by:[Appendix C](https://arxiv.org/html/2606.26290#A3.p1.6)\.
- Q\. Zhang, M\. Chen, A\. Bukharin, N\. Karampatziakis, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao \(2023b\)Adalora: adaptive budget allocation for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2303\.10512\.Cited by:[§1](https://arxiv.org/html/2606.26290#S1.p1.13),[§2](https://arxiv.org/html/2606.26290#S2.p1.2),[§3](https://arxiv.org/html/2606.26290#S3.SS0.SSS0.Px1.p2.6)\.
- M\. Zhong, D\. Yin, T\. Yu, A\. Zaidi, M\. Mutuma, R\. Jha, A\. Hassan, A\. Celikyilmaz, Y\. Liu, X\. Qiu,et al\.\(2021\)QMSum: a new benchmark for query\-based multi\-domain meeting summarization\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 5905–5921\.Cited by:[§7](https://arxiv.org/html/2606.26290#S7.p1.1)\.
## Appendix AIso\-parameter Comparison
In order to compare HRM and LoRA on an equal footing, we ensure thatrrandd^\\hat\{d\}are chosen such that\|PLoRA−PHRM\|≤0\.1%\\left\\lvert P\_\{LoRA\}\-P\_\{HRM\}\\right\\rvert\\leq 0\.1\\%\. Such an iso\-parametric table to chooserrandd^\\hat\{d\}is shown below\. All experiments in the paper use all three tiers to demonstrate consistency, and conclusions are drawn only from the pattern across tiers rather than individual tier performance\.
Table 1:Comparison of parameter counts between LoRA and HRM configurations\.
## Appendix BParallel Scan for HRM
SinceA¯\\bar\{A\}is diagonal with entriesa¯1,⋯,a¯d\\bar\{a\}\_\{1\},\\cdots,\\bar\{a\}\_\{d\}, the impulse response factorizes:gk=CA¯kB¯=Cdiag\(a¯1d,⋯,a¯dd\)B¯g\_\{k\}=C\\bar\{A\}^\{k\}\\bar\{B\}=C\\mathrm\{diag\}\(\\bar\{a\}\_\{1\}^\{d\},\\cdots,\\bar\{a\}\_\{d\}^\{d\}\)\\bar\{B\}\. Theithi^\{\\mathrm\{th\}\}column ofA¯kB¯=a¯ikcdotB¯i\\bar\{A\}^\{k\}\\bar\{B\}=\\bar\{a\}\_\{i\}^\{k\}cdot\\bar\{B\}\_\{i\}, so the full convolution can be implemented asddscalar convolutions \(one per state dimension\), each with a geometric impulse response\. This is the efficient form used in the code\. On the other hand, the FFT forward pass computes for each state dimensionii,gk\(i\)=a¯ikg^\{\(i\)\}\_\{k\}=\\bar\{a\}\_\{i\}^\{k\}fork=0,⋯,T−1k=0,\\cdots,T\-1, being𝒪\(T\)\\mathcal\{O\}\{\\left\(\{T\}\\right\)\}per dimension\. Next, FFT\(g~\(i\)\\tilde\{g\}^\{\(i\)\}\) andh~\(i\)\\tilde\{h\}^\{\(i\)\}is𝒪\(TlogT\)\\mathcal\{O\}\{\\left\(\{T\\log T\}\\right\)\}per dimension, giving a total of𝒪\(TlogT\)\\mathcal\{O\}\{\\left\(\{T\\log T\}\\right\)\}\. The element\-wise product⊙\\odotand IFFT computations in \([25](https://arxiv.org/html/2606.26290#S4.E25)\) are𝒪\(dTlogT\)\\mathcal\{O\}\{\\left\(\{dT\\log T\}\\right\)\}, and the finalB¯,C\\bar\{B\},Cprojections are𝒪\(d⋅dmodel⋅T\)\\mathcal\{O\}\{\\left\(\{d\\cdot d\_\{model\}\\cdot T\}\\right\)\}\. The net complexity for the FFT forward pass is therefore,𝒪\(dmodeldT\+dTlogT\)\\mathcal\{O\}\{\\left\(\{d\_\{model\}dT\+dT\\log T\}\\right\)\}\. For example, if the model orderdd, ord^\\hat\{d\}is 16, thend≪dmodel=128d\\ll d\_\{model\}=128, the next complexity if dominated by𝒪\(dmodel⋅d⋅T\)\\mathcal\{O\}\{\\left\(\{d\_\{model\}\\cdot d\\cdot T\}\\right\)\}\.
Table 2:Epoch training times on MacBook Pro M2 \(MPS backend\)\. HRM\-sequential is 10–14×\\timesslower than LoRA\. HRM\-FFT matches LoRA exactly at tested context lengths\.Context LengthTTLoRA Wall clock time/epoch \(s\)HRM\-seq\. Wall clock time/epochHRM\-seq\. / LoRAHRM\-FFT Wall clock time/epochHRM\-FFT / LoRA5129∼\\sim9010×10\\times91×1\\times102470∼\\sim98014×14\\times701×1\\times2048265∼\\sim371014×14\\times2651×1\\timesFigure 6:Maximum absolute error between FFT and sequential scan outputs over 100 random input sequences at varying T\.Since FFT operations infloat32introduce rounding errors of order10−710^\{\-7\}\. We verify empirically compare sequential and FFT outputs across 100 random sequences\. We find that the maximum absolute error is<5×10−6<5\\times 10^\{\-6\}, negligible for gradient computation, confirming the FFT equivalence is exact up to floating\-point rounding \(Fig\.[6](https://arxiv.org/html/2606.26290#A2.F6)\)\.
We compared the compute for with and without the FFT\-based parallel scan over varying context lengths\. Training at T=2048 with batch=32 required issuing 2048 Python\-level dispatcher calls per layer per forward pass\. At 4 layers and batch=32, this is 4 × 2048 × 32 = 262,144 sequential operations — completely saturating the Python\-PyTorch overhead\.
## Appendix CIs there a Case for SVD over HSVs?
A singular value decomposition to reduce model order seems like a natural alternative to balanced truncation\. Several works apply this idea to LoRA matrices: FLORA\(Haoet al\.,[2024](https://arxiv.org/html/2606.26290#bib.bib28)\), LoRA pruning\(Zhanget al\.,[2023a](https://arxiv.org/html/2606.26290#bib.bib27)\)\. We argue that SVD is the wrong tool for a dynamical system, and BT is the correct one\. For static matrix order/rank reduction, SVD is a more natural choice\. For instance, a state direction may have a large singular value inB¯\\bar\{B\}\(i\.e\., the input strongly drives that direction\) yet be completely unobservable, i\.e\., producing zero output at channel corresponding toCC\. SVD ofB¯\\bar\{B\}would retain this direction as important, whereas its Hankel singular value \(HSV\) would beσi=0\\sigma\_\{i\}=0, thereby getting discarded\. Conversely, a direction with smallB¯\\bar\{B\}singular value might be the only one observable atCC, and SVD would drop it while its HSV would preserves it\.
Figure 7:HSV decay for: DFA,enwik8, MAESTRO, QuALITY, and QMSum\.SVD identifies directions that are large in individual matrices\. HSVs identify directions that are simultaneously reachable and observable in the complete input\-output system\. Due to the causal input\-output relation encoded by the Hankel operator, a singular value ofB¯\\bar\{B\}alone would not be able to drive model reduction\. Only the Grammian productWcWoW\_\{c\}W\_\{o\}gives the correct importance measure in this case, since accounts for the full dynamics\. We empirically confirm this on the DFA task, where the largest singular value ofB¯\\bar\{B\}does not correspond to the dominant Hankel singular valueσ1\\sigma\_\{1\}\. If we had truncated by SVD\(B¯\\bar\{B\}\), we would have retained different state directions than BT and achieved a worse compression ratio\. The BT resultd^\\hat\{d\}=6 with<0\.3%<0\.3\\%accuracy loss would likely not have been achievable by SVD of any single weight matrix\.
### C\.1Further Insights from HSVs
Since the Hankel operator encodes the dynamical input\-output relation asℋ:\{past inputs\}→\{future outputs\}\\mathcal\{H\}:\\text\{\\\{past inputs\\\}\}\\to\\text\{\\\{future outputs\\\}\}, the HSV decay curveσi/σ1\\sigma\_\{i\}/\\sigma\_\{1\}characterizes the task’s temporal complexity\. For instance, a steep decay curve signifies that most of the temporal complexity of the dynamics are contained in very few dynamical modes, therefore,d→d^d\\to\\hat\{d\}HRM is highly compressible\. Conversely, a gradual decay indicates a genuine high\-dimensional memory requirement\. To this end,*HSVs, and associated \(empirical\) Grammians are a strong metric for the training task’s memory requirement*\. Fig\.[7](https://arxiv.org/html/2606.26290#A3.F7)collects all available HSV decay results presented in this work\.
## Appendix Denwik8Task
Figure 8:enwiki8BPC learning curves at Tier 2 \(HRMd=32d=32vs LoRAr=16r=16\) for T∈\\in512, 1024, 2048\. The region between curves represents the BPC advantage of HRM over LoRA\.The HRM adapter’s dominant state mode has a learned eigenvaluea¯max\\bar\{a\}\_\{\\max\}\(a¯max≈\\bar\{a\}\_\{\\max\}\\approx0\.97–0\.99 after training\)\. The fraction of signal retained from a token k steps ago isa¯maxk\\bar\{a\}\_\{\\max\}^\{k\}\. At T=512, the adapter retainsa¯max256≈0\.97256≈0\.0006\\bar\{a\}\_\{\\max\}^\{256\}\\approx 0\.97^\{256\}\\approx 0\.0006of signal from the midpoint of the context window\. This means that only the most recent∼\\sim70 tokens contribute substantially \(at 1% threshold\)\. AtT=T=2048, the same adapter covers proportionally more of the window — nowa¯max1024≈10−13\\bar\{a\}\_\{\\max\}^\{1024\}\\approx 10^\{\-13\}, buta¯max70≈\\bar\{a\}\_\{\\max\}^\{7\}0\\approx0\.11 still holds\. The key insight is that the adapter’s effective memory depth is fixed by the trained eigenvalues, but the fraction of context this covers grows withTT\. AtTT=2048, the 70\-step memory covers 3\.4% of context; atTT=512, it covers 13\.7%\. The LoRA staticΔW\\Delta Wcovers none\. AsTTgrows, HRM’s absolute reach stays roughly constant while LoRA’s relative disadvantage grows\.
Table 3:Comparison of BPC across different adapter capacities and context lengths\.
## Appendix EParity Task
The parity task requires predicting, at each positiontt, the parity bitpt=⊕k≤tσk\(mod2\)p\_\{t\}=\\oplus\_\{k\\leq t\}\\sigma\_\{k\}\(\\text\{mod \}2\)whereσk∈\{0,1\}\\sigma\_\{k\}\\in\\\{0,1\\\}\. Unlike DFA state tracking \(which requires trackingk=4k=4states\), parity has a minimal state of exactly 1 bit\. An ideal adapter would learnd^\\hat\{d\}=1\. In practice, HRM converges tod^≈\\hat\{d\}\\approx6\-7 \(BT thresholdε\\varepsilon=0\.01\), which is higher than expected, suggesting the model learns redundant but numerically stable representations of the parity state\. The task serves as a lower bound: if HRM cannot outperform LoRA here, it is unlikely to help on any sequential task\.

Figure 9:\(left\) Parity accuracy at medium model capacity, 3\-seed mean±\\pmstd\. All adapters near chance \(0\.50\), parity is near\-intractable for a small frozen backbone atT=256T=256\. \(middle\) HRM advantage \(HRM mean\-LoRA mean\) on DFA vs\. parity, \(right\) training curve\.
Our observations \(shown in Fig\.[9](https://arxiv.org/html/2606.26290#A5.F9)\) show that DFA exhibits a large, advantage with T while parity shows essentially zero HRM benefit\. This contrast validates the memory hypothesis: HRM helps when multi\-dimensional state is required, not when the task can be solved by single\-bit counting\.
## Appendix FLongBench Tasks
Table 4:Comparison of HRM against baselines on LongBench: QuALITY, QMSum, NarrativeQA
## Appendix GAblations
#### BT Thresholdε\\varepsilon
The BT thresholdε\\varepsiloncontrols the trade\-off between compression ratio \(d^/d\\hat\{d\}/d\) and accuracy loss\. A smallerε\\varepsilonmeans fewer dimensions pruned, and a higherd^\\hat\{d\}implies a better accuracy but less compression\. Conversely, a largerε\\varepsilonresults in more aggressive pruning\. We capture the ablation in the Figure below\. Fig\.[10](https://arxiv.org/html/2606.26290#A7.F10)shows thatd^\\hat\{d\}as a function ofd^\\hat\{d\}for DFA at T∈\\in64,128,256,512:d^\\hat\{d\}drops from 30–32 atd^\\hat\{d\}=0\.001 to 2–5 atd^\\hat\{d\}=0\.2\. The elbow in thed^\\hat\{d\}vsd^\\hat\{d\}curve neard^\\hat\{d\}=0\.01 is the natural compression point, further pruning causes accuracy degradation \(panel b\)\. Defaultd^\\hat\{d\}=0\.01 is chosen at this elbow\.


Figure 10:BT threshold ablation on DFA \(left\) DFA:d^\\hat\{d\}vsd^\\hat\{d\}threshold \(layer 0\), \(right\) DFA: accuracy vsε\\varepsilonthreshold\.Figure 11:DFA T=128 and T=256: LoRA \(rank\) vs HRM \(state\_dim\)\.
#### Rank vs\. Accuracy
To verify that HRM’s advantage is not merely a parameter count artifact, we sweep LoRA rankr∈r\\in\{4,8,16,32,64,128\} and HRM state\_dim d∈\\in\{4,8,16,32,64\} on DFA at T=128 and T=256\. From Fig\.[11](https://arxiv.org/html/2606.26290#A7.F11), HRM \(blue\) shows monotonically improving accuracy with state\_dim on DFA\. LoRA \(orange\) plateaus and even degrades at high rank \(r=r=64,128\), a sign of overfitting to position\-level features\. The HRM accuracy atd=d=32 \(0\.562\) exceeds the best LoRA at any rank \(0\.522 atr=r=16\), confirming the advantage is structural not parametric\.
Figure 12:Data efficiency for DFA and Parity T=128: val accuracy vs\. n\_train\.
#### Data Efficiency
HRM requires training to discover useful dynamics, followed by balanced truncation\. We test whether HRM is less data\-efficient than LoRA at small n\_train\. Fig\.[12](https://arxiv.org/html/2606.26290#A7.F12)shows that on DFA task, HRM requires n\_train≥\\geq2500 to match LoRA; below that, LoRA’s simpler parameterization is more data\-efficient\. At n\_train=10000 \(paper default\), HRM substantially outperforms LoRA\. On parity task, both adapters are near\-chance at all data sizes\. As a result, the practical guidance from the ablation \(for these tasks\) is that HRM requires a minimum number of \(∼\\sim2000 in this case\) training sequences to be competitive with LoRA on DFA\-difficulty tasks\.Similar Articles
CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs
CRMA introduces a spectrally-bounded residual adapter that enables continual fine-tuning of LLMs without catastrophic forgetting by enforcing a doubly-stochastic mixing matrix via Sinkhorn normalization. Experimental results on Mistral-7B and Gemma-2-9B show improved backward transfer and reduced forgetting compared to frozen-substrate baselines.
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.
HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
HELLoRA introduces activation-aware adapter placement for MoE models, attaching LoRA only to hot experts to reduce parameters and FLOPs while improving performance on reasoning, code, and safety tasks.
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
Introduces Holistic Data Scheduler (HDS), a reinforcement learning-based framework that dynamically adjusts data mixtures during LLM pre-training using a multi-objective reward function, achieving 44% fewer iterations to reach target perplexity and a 7.2% improvement on MMLU.
Parameter-Efficient Fine-Tuning with Learnable Rank
Researchers from Adelaide University introduce LR-LoRA (Learnable Rank LoRA), a parameter-efficient fine-tuning method that dynamically learns the adapter rank for each transformer layer during training rather than using a fixed global rank. LR-LoRA achieves state-of-the-art performance on language understanding and commonsense reasoning benchmarks, outperforming fixed-rank LoRA baselines.