DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training
Summary
Introduces Duplicated Latent Residual (DLR), a training-only, parameter-free plug-in for low-rank pre-training that improves perplexity across LLaMA models from 60M to 7B parameters, and can be folded into the model after training with zero inference cost.
View Cached Full Text
Cached at: 06/30/26, 05:30 AM
# Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training
Source: [https://arxiv.org/html/2606.28932](https://arxiv.org/html/2606.28932)
Dong Wang⋄\\diamond, Wenwu Tang⋄\\diamond, Yun Cheng†, Olga Saukh⋄\\diamond ⋄\\diamondGraz University of Technology, Austria †Swiss Data Science Center, Switzerland \{dong\.wang, wenwu\.tang, saukh\}@tugraz\.at,yun\.cheng@sdsc\.ethz\.ch
###### Abstract
Large language models have driven recent progress in language and multimodal AI, yet pre\-training them at scale is prohibitively expensive\. Low\-rank pre\-training, which factorizes each weight matrix into a rank\-rrproduct to reduce both parameters and FLOPs, is a promising response but typically lags full\-rank training in quality\. We propose Duplicated Latent Residual \(DLR\), a training\-only, parameter\-free, foldable plug\-in for low\-rank pre\-training\.DLRaugments the standard low\-rank outputBzBzwith a fixed structured residualαKExpandK\(z\)\\tfrac\{\\alpha\}\{\\sqrt\{K\}\}\\,\\mathrm\{Expand\}\_\{K\}\(z\)that replicates each latent coordinateK=⌈dout/r⌉K\{=\}\\lceil d\_\{\\mathrm\{out\}\}/r\\rceiltimes across the output\. Withα\\alphafixed,DLRadds zero learnable parameters per layer; after training, it is absorbed into the up\-projection in closed form,B⋆=B\+αKR⊤B^\{\\star\}=B\+\\tfrac\{\\alpha\}\{\\sqrt\{K\}\}R^\{\\top\}, so deployment parameter count, FLOPs and memory match the underlying low\-rank backbone exactly\. Across LLaMA models from 60M to 7B parameters,DLRstrengthens low\-rank pre\-training on C4 validation perplexity in most settings, with the clearest gains at 130M and above; folded checkpoints transfer cleanly to supervised fine\-tuning on standard benchmarks\.
## 1Introduction
Large language models \(LLMs\) have transformed machine learning across language, vision, and multimodal domains, yet this success comes at an unsustainable cost\. Pre\-training modern LLMs requires vast compute, memory, and energy resources, with models like LLaMA\-3\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib1)\), Qwen\-3\(Yanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib39)\), DeepSeek\-V3\(Liuet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib40)\), and GPT\-4\(OpenAIet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib2)\)consuming thousands of GPU\-months\. This has spurred significant research into efficient pre\-training methods that preserve model quality while reducing training cost\.
Figure 1:Overview and perplexity–throughput trade\-off\.\(a\) Duplicated Latent Residual \(DLR\) is a training\-only residual branch for low\-rank decoders that is removed by folding before inference\. \(b\)DLRimproves the perplexity–throughput trade\-off at LLaMA\-1B; full results are in Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1)\.Existing efficient pre\-training methods take three forms: extending LoRA\-style low\-rank updates to pre\-training\(Lialinet al\.,[2023](https://arxiv.org/html/2606.28932#bib.bib8); Zhouet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib9); Moet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib16)\), projecting gradients into low\-rank subspaces to compress optimizer states\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib10); Chenet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib11)\), and modifying the network structure with low\-rank parameterization possibly augmented by sparse compensation or latent\-space mixing\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13); Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15); Liuet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib12); Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\)\. The third line is the most direct route to reducing both trainable parameters and forward FLOPs, but many additional pathways introduced to compensate for the rank\-rrbottleneck \(sparse residuals, channel\-sparse compensation, or inter\-layer latent gates\)*persist at inference*\. Deploying these augmented variants therefore requires shipping a modified inference graph rather than the vanilla rank\-rrdecoder, forfeiting one of low\-rank pre\-training’s most attractive practical properties: drop\-in compatibility with the existing low\-rank inference stack\.
A natural question arises:*can we improve the training dynamics of low\-rank backbones without sacrificing their parameter and compute efficiency?*
We proposeDuplicated Latent Residual\(DLR\), a*training\-only*,*parameter\-free*,*foldable*plug\-in for low\-rank pre\-training\.DLRaugments the standard low\-rank outputBzBzwith a fixed structured residual that replicates each latent coordinateK=⌈dout/r⌉K=\\lceil d\_\{\\mathrm\{out\}\}/r\\rceiltimes across the output dimension and rescales it byα/K\\alpha/\\sqrt\{K\}, providing an additional gradient pathway that does not depend on the learned decoder \(Fig\.[1](https://arxiv.org/html/2606.28932#S1.F1)\)\. Withα\\alphaheld fixed, no learnable parameters are added per layer at training time\. After training, the residual is absorbed into the up\-projection in closed form viaB⋆=B\+\(α/K\)R⊤B^\{\\star\}=B\+\(\\alpha/\\sqrt\{K\}\)\\,R^\{\\top\}, whereR∈\{0,1\}r×doutR\\in\\\{0,1\\\}^\{r\\times d\_\{\\mathrm\{out\}\}\}is the structured binary replication matrix induced by the duplication \(ExpandK\(z\)=R⊤z\\mathrm\{Expand\}\_\{K\}\(z\)=R^\{\\top\}z, see Eq\.[10](https://arxiv.org/html/2606.28932#S3.E10)\)\. The folded checkpoint runs on the underlying low\-rank inference stack with the same parameter count, FLOPs, and memory footprint as the base backbone\.
Empirically,DLRstrengthens four representative low\-rank backbones \(linear Low\-Rank, CoLA\(Liuet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib12)\), LOST\(Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15)\), and LaX\(Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\)\) in most settings across LLaMA pre\-training scales from 60M to 7B using a single untuned configuration, with mixed results only at the smallest 60M scale; the DLR\+LOST combination at 1B improves PPL from 15\.02 to 14\.74, and in our single\-seed 7B comparison DLR\+CoLA outperforms reported LOST baselines from 40K steps onward while using2\.3×2\.3\{\\times\}less per\-GPU memory \(Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1), Tab\.[2](https://arxiv.org/html/2606.28932#S4.T2)\)\. Folded checkpoints transfer cleanly to supervised fine\-tuning on Alpaca\-cleaned, where DLR\+CoLA yields a slightly higher average accuracy among the three checkpoints we compare, despite deploying through the same inference graph as vanilla CoLA\.
We summarize our contributions as follows:
- •Foldable training residual\.We give a closed\-form identity,B⋆=B\+αKR⊤B^\{\\star\}=B\+\\tfrac\{\\alpha\}\{\\sqrt\{K\}\}\\,R^\{\\top\}, that absorbsDLRinto the up\-projection of any low\-rank parameterization that exposes a latent\-to\-output decoder, yielding the same deployed graph shape, parameter count, FLOPs, and memory footprint as the base method\.
- •Parameter\-free training overhead\.Withα=1\\alpha=1fixed,DLRadds zero learnable parameters per layer\. Its training\-time forward cost is a singlerepeat\_interleaveplus add, with small overhead undertorch\.compileand zero inference overhead after folding\.
- •Backbone\-agnostic plug\-in\.A singleDLRconfiguration improves four low\-rank backbones \(linear Low\-Rank, CoLA, LOST, LaX\) across LLaMA pre\-training scales from 130M to 7B on C4 perplexity, with mixed results at 60M \(Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1), Tab\.[2](https://arxiv.org/html/2606.28932#S4.T2)\); folded checkpoints transfer cleanly to Alpaca\-cleaned SFT, whereDLR\+ CoLA yields a slightly higher average accuracy among the compared 1B checkpoints \(Tab\.[4](https://arxiv.org/html/2606.28932#S4.T4)\)\.
## 2Low\-rank efficient pre\-training
### 2\.1Setup and a Unified View
Consider a linear map in a transformer block \(bias omitted\):
y=Wx,W∈ℝdout×din,x∈ℝdin\.y=Wx,\\quad W\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times d\_\{\\mathrm\{in\}\}\},\\ x\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\}\.\(1\)Many efficient pre\-training methods can be expressed as
y=ΦLR\(x;θ\)⏟low\-rank backbone\+ΦEX\(x;ψ\)⏟extra compensation,y\\;=\\;\\underbrace\{\\Phi\_\{\\mathrm\{LR\}\}\(x;\\theta\)\}\_\{\\text\{low\-rank backbone\}\}\\;\+\\;\\underbrace\{\\Phi\_\{\\mathrm\{EX\}\}\(x;\\psi\)\}\_\{\\text\{extra compensation\}\},\(2\)whereΦLR\\Phi\_\{\\mathrm\{LR\}\}is a low\-rank backbone \(possibly nonlinear in latent space\), andΦEX\\Phi\_\{\\mathrm\{EX\}\}is an additional compensation pathway\.
We use a shared notation throughout\. Letr≪min\(din,dout\)r\\ll\\min\(d\_\{\\mathrm\{in\}\},d\_\{\\mathrm\{out\}\}\)be the rank,A∈ℝdin×rA\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\\times r\}be the down\-projection,B∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\}be the up\-projection, and define the latent
z=s⋅ϕ\(A⊤x\)∈ℝr,z\\;=\\;s\\cdot\\phi\(A^\{\\top\}x\)\\in\\mathbb\{R\}^\{r\},\(3\)whereϕ\(⋅\)\\phi\(\\cdot\)is an optional activation andssis a scalar scaling \(fixed or learnable\)\. Unless stated otherwise, we focus on*low\-rank pre\-training*where the linear map is parameterized by\(A,B\)\(A,B\)\(rather than an adapter added on top of a separate frozen base weight\)\.
This unified decomposition allows us to compare seemingly different efficient pre\-training methods through the lens of where and how they introduce additional learning signals beyond a rank\-rrbackbone\. In particular, it makes clear whether improvements stem from modifying the latent representation or from adding new weight\-space pathways\.
### 2\.2Baselines in the Unified Notation
Low\-rank \(linear\)\(Huet al\.,[2021](https://arxiv.org/html/2606.28932#bib.bib7)\)\. The linear low\-rank parameterization replacesWWbyBA⊤BA^\{\\top\}:
ΦLR\(x\)=B\(A⊤x\),ΦEX\(x\)≡0\.\\Phi\_\{\\mathrm\{LR\}\}\(x\)=B\(A^\{\\top\}x\),\\qquad\\Phi\_\{\\mathrm\{EX\}\}\(x\)\\equiv 0\.\(4\)This cuts multiplies fromdoutdind\_\{\\mathrm\{out\}\}d\_\{\\mathrm\{in\}\}tor\(dout\+din\)r\(d\_\{\\mathrm\{out\}\}\{\+\}d\_\{\\mathrm\{in\}\}\)and reduces optimizer states to the size of\(A,B\)\(A,B\), but the fixed rank often underperforms full\-rank pre\-training\(Kamalakaraet al\.,[2022](https://arxiv.org/html/2606.28932#bib.bib27); Khodaket al\.,[2021](https://arxiv.org/html/2606.28932#bib.bib25)\)\.
SLTrain \(low\-rank \+ element\-wise sparse residual\)\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13)\)\. SLTrain augments the low\-rank factorization with a fixed\-support sparse residual:
W=BA⊤\+S,supp\(S\)=Ω,\|Ω\|=k,W=BA^\{\\top\}\+S,\\qquad\\mathrm\{supp\}\(S\)=\\Omega,\\ \|\\Omega\|=k,\(5\)whereΩ⊆\[dout\]×\[din\]\\Omega\\subseteq\[d\_\{\\mathrm\{out\}\}\]\\times\[d\_\{\\mathrm\{in\}\}\]is sampled once at initialization\. Equivalently,
ΦLR\(x\)=B\(A⊤x\),ΦEX\(x\)=Sx\.\\Phi\_\{\\mathrm\{LR\}\}\(x\)=B\(A^\{\\top\}x\),\\qquad\\Phi\_\{\\mathrm\{EX\}\}\(x\)=Sx\.\(6\)Thus SLTrain improves expressivity through an explicit sparse weight\-space path, unlikeDLR’s fixed latent\-space residual that is folded intoBB\.
LOST \(low\-rank \+ channel\-sparse compensation\)\(Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15)\)\. LOST combines a low\-rank branch with an SVD\-guided channel\-sparse component intended to cover complementary directions\.
ΦLR\(x\)=Bz,ΦEX\(x\)=WsPIx,\\Phi\_\{\\mathrm\{LR\}\}\(x\)=Bz,\\qquad\\Phi\_\{\\mathrm\{EX\}\}\(x\)=W\_\{s\}\\,P\_\{I\}x,\(7\)wherePI∈\{0,1\}k×dinP\_\{I\}\\in\\\{0,1\\\}^\{k\\times d\_\{\\mathrm\{in\}\}\}selects a subsetIIofkkinput channels, soPIx∈ℝkP\_\{I\}x\\in\\mathbb\{R\}^\{k\}contains exactly the channels feeding the learned compensation matrixWs∈ℝdout×kW\_\{s\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times k\}; equivalently, the columns ofWsW\_\{s\}correspond to thePIP\_\{I\}\-selected channels\. The latentzzis defined in Eq\.[3](https://arxiv.org/html/2606.28932#S2.E3)\. The compensation path is learned and remains part of the deployed graph, whereasDLRadds a fixed latent residual during training and folds it intoBB\.
CoLA \(nonlinear latent low\-rank\)\(Liuet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib12)\)\. CoLA keeps a pure low\-rank decoder but makes the latent representation nonlinear:
ΦLR\(x\)=Bz,z=s⋅ϕ\(A⊤x\),ΦEX\(x\)≡0\.\\Phi\_\{\\mathrm\{LR\}\}\(x\)=Bz,\\qquad z=s\\cdot\\phi\(A^\{\\top\}x\),\\qquad\\Phi\_\{\\mathrm\{EX\}\}\(x\)\\equiv 0\.\(8\)It enriches trainability through latent gating without adding a separate weight\-space compensation path, making it a natural backbone forDLR\.
LaX \(inter\-layer latent residual\)\(Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\)\. LaX improves low\-rank training by introducing an*inter\-layer*residual in latent space\. For a layer latentzi=s⋅ϕ\(Ai⊤xi\)z\_\{i\}=s\\cdot\\phi\(A\_\{i\}^\{\\top\}x\_\{i\}\), it forms
z~i=zi\+G\(zi−1\),yi=LN\(Biz~i\),\\tilde\{z\}\_\{i\}=z\_\{i\}\+G\(z\_\{i\-1\}\),\\qquad y\_\{i\}=\\mathrm\{LN\}\(B\_\{i\}\\tilde\{z\}\_\{i\}\),\(9\)whereGGis a lightweight alignment gate and the output LayerNorm follows the original formulation\. In the unified view, LaX keeps the same low\-rank decoder form but adds information flow from the previous layer’s latent, improving trainability without increasing the target rankrr\.
Taken together, these methods illustrate two dominant strategies for improving low\-rank pre\-training: \(i\) adding auxiliary weight\-space pathways to compensate for rank deficiency \(*e\.g\.*, SLTrain, LOST\), or \(ii\) enriching the latent representation while keeping a single low\-rank decoder \(*e\.g\.*, CoLA, LaX\)\.DLRis most naturally described as a latent\-space plug\-in, but it can attach to any backbone that exposes a low\-rank latent\-to\-output decoderBzBz, including the low\-rank branch of hybrid methods such as LOST\. It therefore targets training dynamics through a fixed, non\-learned expansion without tying the design to one particular low\-rank backbone\.
## 3Duplicated Latent Residual \(DLR\)
Key idea\. Given an existing low\-rank layer with latentz∈ℝrz\\in\\mathbb\{R\}^\{r\}and outputBzBz,DLRattaches an*intra\-layer latent\-space*residual that expands the same latent to output width with a*fixed structured decoder*as shown in Fig\.[1](https://arxiv.org/html/2606.28932#S1.F1)\. LetK=⌈dout/r⌉K=\\lceil d\_\{\\mathrm\{out\}\}/r\\rceiland define an expansion operator
ExpandK:ℝr→ℝdout,\[ExpandK\(z\)\]i=z⌊i/K⌋,\\mathrm\{Expand\}\_\{K\}:\\mathbb\{R\}^\{r\}\\to\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\},\\;\[\\mathrm\{Expand\}\_\{K\}\(z\)\]\_\{i\}=z\_\{\\lfloor i/K\\rfloor\},\(10\)with the last block truncated ifrK\>doutrK\>d\_\{\\mathrm\{out\}\}\. Equivalently,ExpandK\(z\)=R⊤z\\mathrm\{Expand\}\_\{K\}\(z\)=R^\{\\top\}zfor a fixed binary replication matrixR∈\{0,1\}r×doutR\\in\\\{0,1\\\}^\{r\\times d\_\{\\mathrm\{out\}\}\}withRj,i=𝟙\{j=⌊i/K⌋\}R\_\{j,i\}=\\mathbb\{1\}\\\{j=\\lfloor i/K\\rfloor\\\}\.
The plug\-in form used during training is:
y=Bz\+αK⋅ExpandK\(z\)\+b,y\\;=\\;Bz\\;\+\\;\\frac\{\\alpha\}\{\\sqrt\{K\}\}\\cdot\\mathrm\{Expand\}\_\{K\}\(z\)\\;\+\\;b,\(11\)whereα∈ℝ\\alpha\\in\\mathbb\{R\}controls the residual strength \(fixed in all experiments\)\. Duplicated Latent Residual \(DLR\) therefore augments, rather than replaces, a standard low\-rank layer: the backbone still suppliesAA,BB, the activation, and any other latent\-space structure, whileDLRcontributes only the fixed residual termαKExpandK\(z\)\\frac\{\\alpha\}\{\\sqrt\{K\}\}\\,\\mathrm\{Expand\}\_\{K\}\(z\)during training\. Intuitively,DLRviews the output space as partially redundant and leverages structured replication to expose each latent coordinate to multiple output channels\. Unlike sparse or channel\-selective compensation, the resulting residual is dense, fixed, and highly structured, leading to uniform coverage of the output space without introducing irregular sparsity\. As a result,DLRprovides an additional learning signal that bypasses the conditioning of the learned decoder, while retaining the computational characteristics of the underlying low\-rank backbone\.
Foldability: zero inference overhead\. A defining property ofDLRis that, after training, the duplicated\-residual branch can be*exactly*absorbed into the up\-projectionBBin closed form, leaving the inference graph identical to the underlying low\-rank decoder\. Recall from Eq\.[10](https://arxiv.org/html/2606.28932#S3.E10)thatR∈\{0,1\}r×doutR\\in\\\{0,1\\\}^\{r\\times d\_\{\\mathrm\{out\}\}\}withExpandK\(z\)=R⊤z\\mathrm\{Expand\}\_\{K\}\(z\)=R^\{\\top\}z, soR⊤∈\{0,1\}dout×rR^\{\\top\}\\in\\\{0,1\\\}^\{d\_\{\\mathrm\{out\}\}\\times r\}has the same shape asBB\(each column ofR⊤R^\{\\top\}is the indicator of one latent group\)\. Substituting into Eq\.[11](https://arxiv.org/html/2606.28932#S3.E11)and collecting linear terms inzzyields the algebraic identity
y=Bz\+αKR⊤z\+b=\(B\+αKR⊤\)⏟B⋆z\+b=B⋆z\+b,y\\;=\\;Bz\\;\+\\;\\tfrac\{\\alpha\}\{\\sqrt\{K\}\}\\,R^\{\\top\}z\\;\+\\;b\\;=\\;\\underbrace\{\\Big\(B\+\\tfrac\{\\alpha\}\{\\sqrt\{K\}\}\\,R^\{\\top\}\\Big\)\}\_\{\\displaystyle B^\{\\star\}\}\\,z\\;\+\\;b\\;=\\;B^\{\\star\}\\,z\\;\+\\;b,\(12\)whereB⋆∈ℝdout×rB^\{\\star\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\}has*the same shape*as the original up\-projectionBB\. Foldability follows by construction: the operationB←B\+\(α/K\)R⊤B\\leftarrow B\+\(\\alpha/\\sqrt\{K\}\)\\,R^\{\\top\}is performed once at training termination, and the resulting checkpoint is a drop\-in replacement that runs on any code path implementing standard rank\-rrinference \(theExpandK\\mathrm\{Expand\}\_\{K\}branch, the buffer storingRR, and the residual scaleα\\alphaare all discarded\)\. Hence the deployment artifact has the same graph structure, parameter count, FLOPs, and peak\-memory footprint as the corresponding low\-rank baseline, while still benefiting from the better\-trainedB⋆B^\{\\star\}\. This contrasts with prior residual or auxiliary\-pathway approaches over low\-rank backbones \(e\.g\., LoR2C\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib18)\), ResLoRA\(Wanget al\.,[2025b](https://arxiv.org/html/2606.28932#bib.bib20)\), sparse\-additive variants such as SLTrain\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13)\), or inter\-layer latent residuals such as LaX\(Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\)\): they retain auxiliary parameters or non\-mergeable nonlinearities at inference time and therefore modify the deployed graph\. Appendix[G](https://arxiv.org/html/2606.28932#A7)further situatesDLRamong residual\-style methods and clarifies why foldability differs from mergeable adapter residuals \(Tab\.[11](https://arxiv.org/html/2606.28932#A7.T11)\)\. A referencefold\(\)implementation in∼\\sim10 lines is provided in Appendix[A](https://arxiv.org/html/2606.28932#A1.SSx1); the operation is the onlyDLR\-specific code that runs in the deployment pipeline, after which inference is indistinguishable from a vanilla rank\-rrdecoder\.
Variance\-preserving correction\. A naive duplication would amplify the residual energy by a factor ofKjK\_\{j\}in thejj\-th block\. We therefore use the global scalingβ=α/K\\beta=\\alpha/\\sqrt\{K\}in Eq\.[11](https://arxiv.org/html/2606.28932#S3.E11)\. Whendout=rKd\_\{\\mathrm\{out\}\}=rK\(all blocks have sizeKK\), this choice preserves per\-latent energy:‖βExpandK\(z\)‖22=α2‖z‖22\\\|\\beta\\,\\mathrm\{Expand\}\_\{K\}\(z\)\\\|\_\{2\}^\{2\}=\\alpha^\{2\}\\\|z\\\|\_\{2\}^\{2\}\. Equivalently, ifVar\(zj\)=σz2\\mathrm\{Var\}\(z\_\{j\}\)=\\sigma\_\{z\}^\{2\}, then each duplicated coordinate has varianceVar\(βzj\)=α2σz2/K\\mathrm\{Var\}\(\\beta z\_\{j\}\)=\\alpha^\{2\}\\sigma\_\{z\}^\{2\}/K, so the*total*variance mass assigned to the block remainsα2σz2\\alpha^\{2\}\\sigma\_\{z\}^\{2\}\. If the last block is truncated \(Kr<KK\_\{r\}<K\), the same scaling yields a slightly smaller energy contribution for that block, which is benign and avoids data\- or shape\-dependent re\-scaling\. This correction mirrors the variance\-preserving “repair” principle used in folding/merging operations\(Wanget al\.,[2025a](https://arxiv.org/html/2606.28932#bib.bib3); Saukhet al\.,[2026](https://arxiv.org/html/2606.28932#bib.bib41)\)and is essential for stable pre\-training\(Jordanet al\.,[2022](https://arxiv.org/html/2606.28932#bib.bib6)\)\.
Duplication map\.DLRemploys a*deterministic, uniform*duplication structure\. The expansion operatorExpandK\\mathrm\{Expand\}\_\{K\}partitions thedoutd\_\{\\mathrm\{out\}\}output coordinates intorrconsecutive blocks of sizeKK\(with the last block possibly truncated todout−\(r−1\)Kd\_\{\\mathrm\{out\}\}\-\(r\-1\)Kelements\)\. Each block𝒢j=\{jK,jK\+1,…,min\(\(j\+1\)K−1,dout−1\)\}\\mathcal\{G\}\_\{j\}=\\\{jK,jK\{\+\}1,\\ldots,\\min\(\(j\{\+\}1\)K\{\-\}1,d\_\{\\mathrm\{out\}\}\{\-\}1\)\\\}receives a copy of the latent coordinatezjz\_\{j\}\. Formally, the replication matrixR∈\{0,1\}r×doutR\\in\\\{0,1\\\}^\{r\\times d\_\{\\mathrm\{out\}\}\}has rowjjcontaining ones at columns𝒢j\\mathcal\{G\}\_\{j\}and zeros elsewhere\. This fixed, balanced structure ensures that each latent dimension contributes equally to the residual pathway and requires no hyperparameter tuning beyond the rankrr\. We implementExpandK\(z\)\\mathrm\{Expand\}\_\{K\}\(z\)as a singlerepeat\_interleavefollowed by truncation; the*contiguous*form is our hardware\-efficient choice, since more complex fixed maps \(*e\.g\.*, permuted masks or non\-contiguous/grouped duplication\) force indexed gather/scatter operations \(e\.g\.,index\_select\) that cannot be fused as cleanly\. We quantify this implementation choice via a random\-duplication ablation in Sec\.[4](https://arxiv.org/html/2606.28932#S4)\.
Design rationale\.DLRis built around two deliberately simple choices: a fixed latent\-to\-output duplication map and a shape\-dependent scaleβ=α/K\\beta=\\alpha/\\sqrt\{K\}\. The scale prevents the residual branch from growing merely because each latent coordinate is copied multiple times\. For the group𝒢j\\mathcal\{G\}\_\{j\}that receives copies ofzjz\_\{j\}, the duplicated residual block satisfies
‖β\(R⊤z\)𝒢j‖22=β2Kjzj2=α2KjKzj2,\\\|\\beta\(R^\{\\top\}z\)\_\{\\mathcal\{G\}\_\{j\}\}\\\|\_\{2\}^\{2\}=\\beta^\{2\}K\_\{j\}z\_\{j\}^\{2\}=\\alpha^\{2\}\\frac\{K\_\{j\}\}\{K\}z\_\{j\}^\{2\},so when the group is full \(Kj=KK\_\{j\}=K\) andα=1\\alpha=1, the residual assigns the block the same squared energy as the original latent coordinate\. This normalization is purely shape\-based: it keeps the forward residual on the same scale aszzwithout introducing data\-dependent statistics, per\-layer tuning, or additional learned parameters\.
The same fixed branch also changes the backward path in a controlled way\. Letgy=∂ℒ/∂y∈ℝdoutg\_\{y\}=\\partial\\mathcal\{L\}/\\partial y\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\}\. Differentiating Eq\.[11](https://arxiv.org/html/2606.28932#S3.E11)with respect to the latent variable gives
gz=∂ℒ∂z=B⊤gy\+βRgy\.g\_\{z\}\\;=\\;\\frac\{\\partial\\mathcal\{L\}\}\{\\partial z\}\\;=\\;B^\{\\top\}g\_\{y\}\\;\+\\;\\beta Rg\_\{y\}\.\(13\)The first term is the usual gradient route through the learned decoderBB, while the second term is a fixed, decoder\-independent route that aggregates output gradients according to the same groups used in the forward duplication\. This is the mechanism we intendDLRto provide: during training, the encoderAAreceives an additional calibrated signal that does not have to pass through the current state ofBB; after training, the entire route is folded intoB⋆B^\{\\star\}by Eq\.[12](https://arxiv.org/html/2606.28932#S3.E12)\. The effectiveness of this design is evaluated empirically in the gradient\-path measurements of Sec\.[4](https://arxiv.org/html/2606.28932#S4), where the residual\-induced component is strong early in training and remains nearly orthogonal to the learned\-decoder component, and in the variance\-correction ablation of Tab\.[7](https://arxiv.org/html/2606.28932#A2.T7)\.
Computational complexity\.DLRdoes not replace the low\-rank backbone; it adds a training\-time residual branch on top of the backbone’s existing latentzzand up\-projectionBB\. For any backbone that outputsBzBz, the only extra forward work during training is the structured expansion\-and\-add termαKR⊤z\\frac\{\\alpha\}\{\\sqrt\{K\}\}\\,R^\{\\top\}z\(equivalentlyαKExpandK\(z\)\\frac\{\\alpha\}\{\\sqrt\{K\}\}\\,\\mathrm\{Expand\}\_\{K\}\(z\)\), implemented asrepeat\_interleavefollowed by truncation \(Appendix[A](https://arxiv.org/html/2606.28932#A1), ListingLABEL:lst:dlr\_reference\)\. Crucially, the plug\-in introduces no extra large GEMMs, no learned matrices, and no sparse/indexed matrix multiplications as in Eq\.[5](https://arxiv.org/html/2606.28932#S2.E5)–Eq\.[7](https://arxiv.org/html/2606.28932#S2.E7); undertorch\.compile, our profiler traces do not show the expansion as a separateaten::repeat\_interleavekernel, suggesting fusion into surrounding computation\. After folding, even this residual branch is removed from the inference graph; end\-to\-end throughput and peak memory are reported in Sec\.[4](https://arxiv.org/html/2606.28932#S4)\(Tab\.[3](https://arxiv.org/html/2606.28932#S4.T3)\)\.
## 4Experiments
### 4\.1Evaluation Setup and Protocol
Pre\-training data\. We pre\-train all models on the Colossal Clean Crawled Corpus \(C4\)\(Raffelet al\.,[2020](https://arxiv.org/html/2606.28932#bib.bib21)\), a cleaned and deduplicated snapshot of Common Crawl that is widely used for language modeling\.
Model family and token budgets\. Following the experimental protocols used byHanet al\.\([2024](https://arxiv.org/html/2606.28932#bib.bib13)\),Glentiset al\.\([2025](https://arxiv.org/html/2606.28932#bib.bib14)\), andLiet al\.\([2025](https://arxiv.org/html/2606.28932#bib.bib15)\), we adopt the same LLaMA\-style architecture family and match the token budget for each model scale\. Concretely, we pre\-train models with 60M, 130M, 350M, 1B, and 7B parameters\(Touvronet al\.,[2023b](https://arxiv.org/html/2606.28932#bib.bib28),[a](https://arxiv.org/html/2606.28932#bib.bib29)\), using the same number of training tokens as prior work at each size \(see Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1)\)\. Training details are provided in Appendix[A](https://arxiv.org/html/2606.28932#A1)\.
Baselines\. We compare against*Full\-Rank*training and representative efficient pre\-training methods, including*Low\-rank*\(Huet al\.,[2021](https://arxiv.org/html/2606.28932#bib.bib7)\)*GaLore*\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib10)\),*Fira*\(Chenet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib11)\),*LORO*\(Moet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib16)\),*SLTrain*\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13)\),*LOST*\(Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15)\),*CoLA*\(Liuet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib12)\), and*LaX*\(Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\),
Unless otherwise stated, all baselines use the authors’ recommended hyperparameters and the same optimizer/schedule\.
PluggingDLRinto LLaMA backbones\. For each low\-rank LLaMA backbone, we attach theDLRresidual to every low\-rank linear projection in the Transformer blocks, including self\-attention projections \(WQW\_\{Q\},WKW\_\{K\},WVW\_\{V\},WOW\_\{O\}\) and the MLP/FFN projections\. The backbone still determines the learned matrices, rank, activation, and any inter\-layer latent structure\.DLRonly adds the fixed duplicated\-latent branch during training and folds it into the corresponding up\-projection before deployment \(Sec\.[3](https://arxiv.org/html/2606.28932#S3)\)\.
To ensure a fair comparison, we use the same AdamW optimizer and closely follow the training recipe in prior work, including learning\-rate schedule, warmup ratio, and packed\-data training, whenever applicable\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib10); Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14)\)\. Additional implementation details \(including rank settings, scaling choices, and compilation/profiling configurations\) are provided in Appendix[A](https://arxiv.org/html/2606.28932#A1)\.
### 4\.2End\-to\-End Performance
Main results\. We focus on the trade\-off between validation perplexity \(PPL\) and training efficiency under matched token budgets\. Improvements are meaningful only if they preserve the throughput and memory profile of the underlying low\-rank backbone\. Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1)reports validation PPL on C4 across LLaMA scales 60M / 130M / 350M / 1B, together with parameter count and estimated memory, paired by backbone \(Low\-Rank, CoLA, LOST, Low\-Rank\+\+LaX, CoLA\+\+LaX\) so that each base method can be read off against its\+\+DLR counterpart\. A single untunedDLRconfiguration \(α=1\\alpha\{=\}1, fixed, uniformExpandK\\mathrm\{Expand\}\_\{K\}\) consistently improves all four backbones at 130M and above while keeping the same parameter budget as the corresponding backbone\. In particular, CoLA\+\+DLRachieves substantial PPL reductions over CoLA at fixed rank \(*e\.g\.*,15\.76→14\.2615\.76\{\\rightarrow\}14\.26at 1B;25\.61→23\.8025\.61\{\\rightarrow\}23\.80at 130M\), andDLRalso improves the strong LOST baseline \(*e\.g\.*,15\.02→14\.7415\.02\{\\rightarrow\}14\.74at 1B\), yielding the best perplexity–efficiency trade\-off among the foldable plug\-in variants we report\.
Table 1:Main C4 validation results across LLaMA scales\.We report validation perplexity \(PPL\), parameter count \(M\), and estimated optimizer\-state memory \(GB\) for 60M–1B models under matched token budgets\. Rows are paired by backbone;boldmarks the better PPL within each base/\+\+DLRpair\.DLRadds no learned parameters and folds into the base decoder, so each\+\+DLRrow has the same deployed parameter count, FLOPs, and memory footprint as its base row\. Non\-DLRbaselines are reproduced by us or taken from prior work\(Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14); Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15); Moet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib16); Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13); Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17); Liuet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib12); Zhaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib10); Chenet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib11)\); memory follows the optimizer\-state estimate convention of\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib10); Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13)\)\.60M130M350M1Br/dr\\penalty 10000\\ /\\penalty 10000\\ d128 / 512256 / 768256 / 1024512 / 2048Tokens1\.4B2\.6B7\.8B13\.1BPPLParamMemPPLParamMemPPLParamMemPPLParamMemFull\-Model30\.27580\.3523\.131340\.8118\.763682\.2113\.4013398\.04LoRA\(Huet al\.,[2021](https://arxiv.org/html/2606.28932#bib.bib7)\)35\.30430\.3625\.07940\.8419\.131851\.8515\.836096\.34GaLore\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib10)\)34\.58580\.2825\.311340\.6119\.373681\.5915\.5713394\.76Fira\(Chenet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib11)\)30\.34580\.2822\.961340\.6116\.823681\.5915\.1013394\.76SLTrain\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13)\)32\.58470\.3024\.171040\.6718\.592151\.5415\.407325\.33LORO\(Moet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib16)\)33\.87430\.2424\.78940\.5719\.661851\.1115\.536093\.66Low\-Rank35\.13430\.2426\.71940\.5721\.771851\.1118\.226093\.66\+\+DLR35\.01±\\pm0\.18430\.2425\.00±\\pm0\.35940\.5718\.75±\\pm0\.031851\.1115\.72±\\pm0\.746093\.66CoLA\(Liuet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib12)\)34\.10430\.2425\.61940\.5719\.751851\.1115\.766093\.66\+\+DLR32\.96±\\pm0\.06430\.2423\.80±\\pm0\.40940\.5718\.38±\\pm0\.031851\.1114\.26±\\pm0\.026093\.66LOST\(Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15)\)32\.25430\.2424\.05940\.5718\.951851\.1115\.026093\.66\+\+DLR33\.15±\\pm0\.12430\.2424\.03±\\pm0\.03940\.5718\.88±\\pm0\.021851\.1114\.74±\\pm0\.016093\.66Low\-Rank\+\+LaX\(Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\)33\.54440\.3324\.63940\.7018\.901851\.3815\.516094\.54\+\+DLR33\.71±\\pm0\.05440\.3324\.38±\\pm0\.02940\.7018\.85±\\pm0\.011851\.3815\.29±\\pm0\.486094\.54CoLA\+\+LaX\(Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\)33\.21440\.3324\.21990\.7418\.511961\.4614\.786094\.54\+\+DLR34\.89±\\pm0\.07440\.3324\.08±\\pm0\.07990\.7418\.37±\\pm0\.021961\.4614\.67±\\pm0\.006094\.54
The 60M scale shows a few sub\-cells \(LOST\+\+DLR, Low\-Rank\+\+LaX\+\+DLR, CoLA\+\+LaX\+\+DLR\) where the\+\+DLR row matches or slightly trails the corresponding base baseline\. We attribute this to multi\-seed noise at the smallest scale: with the lowest token budget in our study \(1\.4B vs\. 13\.1B at 1B\) and the smallest hidden width, each backbone’s run\-to\-run PPL spread is wider relative to the absolute PPL gap a single\-layer plug\-in can move\. The±\\pmstd reported on the\+\+DLR rows \(0\.050\.05–0\.180\.18\) is consistent with this regime, and the deviations vanish from 130M onward where the token\-to\-parameter ratio is larger\. We therefore do not draw conclusions from the 60M sub\-cells in isolation, and report 60M primarily as the small end of a 4\-scale trajectory rather than as a stand\-alone benchmark\.
Scaling\-up results at 7B are reported separately in Tab\.[2](https://arxiv.org/html/2606.28932#S4.T2), and downstream results after instruction fine\-tuning are presented in[Section˜4\.3](https://arxiv.org/html/2606.28932#S4.SS3)\(Tab\.[4](https://arxiv.org/html/2606.28932#S4.T4)\)\. We additionally report pre\-fine\-tuning zero\-shot evaluation for the 1B models in Appendix[F](https://arxiv.org/html/2606.28932#A6)\(Tab\.[10](https://arxiv.org/html/2606.28932#A6.T10)\)\.
Variance/scale correction \(1/K1/\\sqrt\{K\}\)\. The duplicated residual copies each latent coordinate intoKKoutput coordinates, so we scale it by1/K1/\\sqrt\{K\}to keep the residual energy comparable across ranks and output widths\. Ablating this correction consistently worsens validation perplexity for DLR\+CoLA across scales, indicating that the normalization is important in our pre\-training setting\. Full results are reported in Appendix[B](https://arxiv.org/html/2606.28932#A2)\(Tab\.[7](https://arxiv.org/html/2606.28932#A2.T7)\)\.
Effective\-rank / target\-rank sensitivity\. We also test whetherDLRremains useful under tighter rank budgets by reducing the default rank to0\.75r00\.75r\_\{0\}and0\.5r00\.5r\_\{0\}\. Perplexity degrades smoothly rather than catastrophically, indicating a predictable quality–efficiency trade\-off rather than dependence on a single tuned rank\. The full rank\-sensitivity table is in Appendix[B](https://arxiv.org/html/2606.28932#A2)\(Tab\.[9](https://arxiv.org/html/2606.28932#A2.T9)\)\.
Scaling\-law token budget \(1B\)\. To check whether the 1B gains persist beyond the standard 13\.1B\-token budget, we additionally train the 1B setting for 26B tokens\. The low\-rank backbone equipped withDLRcontinues to closely track the full\-rank trajectory throughout training, with a small but consistent convergence gap, matching the final ordering in Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1)\. The full trajectory is shown in Appendix[B](https://arxiv.org/html/2606.28932#A2)\(Fig\.[2](https://arxiv.org/html/2606.28932#A2.F2)\)\.
Gradient\-path analysis \(LLaMA\-350M\)\. To understand whyDLRhelps low\-rank pre\-training, we decompose the latent gradient into the learned\-decoder pathB⊤gyB^\{\\top\}g\_\{y\}and the duplicated\-residual pathβRgy\\beta Rg\_\{y\}, whereβ=α/K\\beta=\\alpha/\\sqrt\{K\}\(cf\. Eq\.[13](https://arxiv.org/html/2606.28932#S3.E13)\)\. During LLaMA\-350M pre\-training, we track their relative magnitudeρ\(t\)=‖βRgy‖/‖B⊤gy‖\\rho\(t\)=\\\|\\beta Rg\_\{y\}\\\|/\\\|B^\{\\top\}g\_\{y\}\\\|and directional alignmentcost\\cos\_\{t\}across representative projections \(details in Appendix[C](https://arxiv.org/html/2606.28932#A3)\)\. The residual\-induced component is strongest early in training and then decays as the learned decoder takes over, while its cosine with the decoder path remains close to zero\. This supports the view thatDLRimproves optimization by adding a calibrated, complementary gradient route rather than merely rescaling the existing low\-rank update\. A complementary initialization\-offset ablation in Appendix[B\.2](https://arxiv.org/html/2606.28932#A2.SS2)separates this active training\-time effect from simply initializingBBwith the folded block\-constant offset\.
Choosing the residual scaleα\\alpha\. Our formulation allows at most one scalarα\\alphaper layer to control the strength of the duplicated\-latent residual in Eq\.[11](https://arxiv.org/html/2606.28932#S3.E11)\. In practice, however, we find thatα\\alphais not a sensitive knob in our pre\-training setting: fixingα=1\.0\\alpha\{=\}1\.0yields essentially identical training dynamics and final perplexity compared to makingα\\alphalearnable, and a coarse hyperparameter sweep consistently selects values nearα=1\.0\\alpha\{=\}1\.0as optimal\. We therefore use a fixed, non\-trainableα=1\.0\\alpha\{=\}1\.0in all experiments, which keeps the plug\-in parameter\-free in practice\.
The role ofBBand why we use a fixed contiguous duplication\. We ablate two design choices on LLaMA\-60M: whether the learnable decoderBBremains necessary, and whether the fixed duplication must be implemented as a contiguousExpandK\\mathrm\{Expand\}\_\{K\}\. RemovingBBsubstantially degrades convergence, showing thatDLRis a complement to the learned low\-rank decoder rather than a replacement for it\. Replacing contiguous duplication with a fixed random per\-output mapping reaches similar loss but is much slower undertorch\.compile, because indexing\-based gathers do not fuse as cleanly asrepeat\_interleave\. Thus the default design keepsBB, uses a fixed contiguous map, and remains both foldable at inference and cheap during training; details are in Appendix[D](https://arxiv.org/html/2606.28932#A4)\.
Scaling to LLaMA\-7B: DLR\+CoLA enables low\-memory pre\-training at scale\. To probe scaling beyond 1B, we attachDLRto a CoLA LLaMA\-7B backbone and compare against Full\-Rank Adam, 8\-bit SLTrain, and LOST on the same C4 schedule \(Tab\.[2](https://arxiv.org/html/2606.28932#S4.T2)\)\. Baselines are taken from prior work\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13); Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15); Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14)\)\. Our DLR\+CoLA run uses the matched protocol described here \(rankr=1024r\{=\}1024, batch size 8, BF16, Adam states, single seed\)\. From 40K steps onward in this comparison, DLR\+CoLA achieves the lowest perplexity at every milestone \(12\.08 at 150K vs\. LOST 12\.80\) while using2\.3×2\.3\{\\times\}less per\-GPU memory than LOST \(27\.03GB vs\. 62\.15GB\)\. Full\-Rank Adam and 8\-bit SLTrain encounter OOM at this batch size\. This shows that the foldable plug\-in extends to a memory\-bound 7B setting while achieving the best perplexity among the compared runs\.
Table 2:LLaMA\-7B pre\-training on C4\.Validation perplexity at 10K–150K steps and steady\-state per\-GPU peak memory\. Baselines are reported from\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13); Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15); Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14)\); DLR\+CoLA uses the matched protocol described in this paper\. “OOM” marks runs that cannot advance at the listed batch size;boldindicates the best PPL at each milestone\.MethodBatchMem \(GB\)10K40K80K120K150KFull\-Rank Adam449\.5324\.9520\.05—OOM—8\-bit SLTrain860\.9127\.59—OOM——LOST862\.1524\.4116\.4814\.0112\.9312\.80DLR\+CoLA \(ours\)827\.0324\.9815\.8213\.7112\.6912\.08System measurements \(LLaMA 1B\)\. To corroborate thatDLRimproves perplexity without introducing a practical efficiency bottleneck, we report end\-to\-end system measurements on 1B models withtorch\.compileenabled in Tab\.[3](https://arxiv.org/html/2606.28932#S4.T3)\. Each entry reports validation PPL together with peak GPU memory after initialization and training throughput after warmup, under the same multi\-node H100 setup described in the table caption\. On the nonlinear low\-rank backbone, addingDLRimproves CoLA from PPL 15\.76 to 14\.26 while keeping memory and throughput nearly unchanged \(13\.10→\\rightarrow13\.09GB; 1,079,350→\\rightarrow1,044,264 tok/s\)\. On the linear low\-rank backbone,DLRimproves PPL 18\.22→\\rightarrow15\.72 with only a small memory increase \(12\.58→\\rightarrow13\.09GB\) and a modest throughput decrease \(1,099,699→\\rightarrow1,020,316 tok/s\)\. In contrast, sparse\-compensation baselines show substantially lower throughput \(SLTrain 640,132 tok/s; LOST 515,971 tok/s\) and/or higher peak memory \(14\.55GB and 19\.33GB\), highlighting thatDLRretains the efficiency profile of pure low\-rank training while recovering much of the quality gap\.
Table 3:System measurements on 1B models withtorch\.compileenabled\. All runs use DDP on 8 nodes×\\times4 NVIDIA H100 SXM5 GPUs \(94GB per GPU\), a local 7\.68TB NVMe drive, and 4×\\timesInfiniband NDR200\. We report validation PPL, total parameter count, max GPU memory after initialization \(max\_memory\), and throughput \(global tokens/s\) after warmup\.MethodPPLParamsMemoryThroughput↓\\downarrow\(M\)\(GB\)\(tok/s\)Full\-Rank13\.40133917\.45700697Low\-Rank18\.2260912\.581099699Low\-Rank \+DLR15\.7260913\.091020316Low\-Rank \+ LaX \+DLR15\.2961014\.96878076CoLA15\.7660913\.101079350CoLA \+DLR14\.2660913\.091044264CoLA \+ LaX14\.7861015\.58846626CoLA \+ LaX \+DLR14\.6761015\.59841800SLTrain15\.4060914\.55640132LOST15\.0260919\.33515971LOST \+DLR14\.7460919\.33595978
### 4\.3Folded checkpoints transfer cleanly to supervised fine\-tuning
A core appeal of foldability is that the training\-only residual leaves no trace at deployment\. We test this end\-to\-end by taking the 1B Full\-Rank, CoLA, and DLR\+CoLA pre\-trained checkpoints, folding theDLRbranch into the up\-projection \(Sec\.[3](https://arxiv.org/html/2606.28932#S3)\), running full\-parameter SFT on Alpaca\-cleaned, and evaluating six standard benchmarks withlm\-evaluation\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib42)\)\(details in Appendix[E](https://arxiv.org/html/2606.28932#A5)\)\.
Tab\.[4](https://arxiv.org/html/2606.28932#S4.T4)reports mean±\\pmstd accuracy across three SFT seeds applied to the same fixed pre\-trained checkpoint per method\.DLR\+CoLA attains a slightly higher average \(39\.81%39\.81\\%, vs\.39\.55%39\.55\\%Full\-Rank and39\.39%39\.39\\%CoLA\), narrows the HellaSwag gap to Full\-Rank, and achieves the highest PIQA accuracy\. This supports the*train withDLR→\\rightarrowfold→\\rightarrowdeploy*workflow: the downstream gain appears after folding, while deployment keeps the same graph, parameter count, FLOPs, and memory footprint as vanilla CoLA\.
Table 4:Downstream evaluation after Alpaca\-cleaned SFT \(1B models\)\.Each cell reports zero\-shot accuracy as mean±\\pmstandard deviation across three SFT seeds applied to the same fixed pre\-trained checkpoint\.DLR\+CoLA achieves a slightly higher average while folding to the same deployed graph as vanilla CoLA;boldindicates the best per\-task mean\.TaskDLR\+CoLACoLAFull\-RankARC\-Challenge0\.2327±\\pm0\.00270\.2361±\\pm0\.00100\.2190±\\pm0\.0013BoolQ0\.3783±\\pm0\.00000\.3783±\\pm0\.00000\.3783±\\pm0\.0000HellaSwag0\.3519±\\pm0\.00110\.3296±\\pm0\.00090\.3564±\\pm0\.0001MMLU0\.2295±\\pm0\.00000\.2295±\\pm0\.00000\.2295±\\pm0\.0000PIQA0\.6750±\\pm0\.00220\.6649±\\pm0\.00050\.6674±\\pm0\.0027WinoGrande0\.5212±\\pm0\.00120\.5254±\\pm0\.00610\.5222±\\pm0\.0020Average0\.3981±\\pm0\.00040\.3939±\\pm0\.00110\.3955±\\pm0\.0006
## 5Conclusion, Limitations and Outlook
We introducedDLR, a training\-only, parameter\-free residual plug\-in for low\-rank pre\-training that is folded into the up\-projection after training, leaving the deployed low\-rank graph unchanged\. Across LLaMA scales from 60M to 7B,DLRimproves representative low\-rank backbones in most matched\-token\-budget settings, retains the efficiency profile of the base decoder, and transfers cleanly after supervised fine\-tuning\. Together, these results support foldable latent residuals as a practical way to improve low\-rank pre\-training dynamics without adding inference\-time complexity\.
Limitations\.DLRimproves training dynamics but does not remove the approximation ceiling imposed by a fixed rank\-rrbottleneck\. Its scale is shape\-based and fixed atα=1\\alpha=1in our experiments\. We do not yet characterize how to retune it for architectures or rank regimes with substantially different latent statistics\. Our evaluation focuses on LLaMA\-style language models trained on C4 and instruction\-tuned on Alpaca\-cleaned, leaving other modalities, architectures, and tasks for future work\.
Outlook\.More broadly, any training\-time latent intervention that remains linear inzzmay admit a closed\-form fold, suggesting a design space of inference\-free training plug\-ins beyond the particular contiguous duplication map studied here\.
## Acknowledgements
This work has been supported by the FFG COMET K1 Center “Pro2Future II” \(Cognitive and Sustainable Products and Production Systems of the Future\), Contract No\. 911655\. The results presented in this paper were computed using the computational resources of Pro2Future GmbH, the Central IT Services of Graz University of Technology \(ZID\), and the Austrian Scientific Computing \(ASC\) infrastructure\.
## References
- Fira: can we achieve full\-rank training of llms under low\-rank constraint?\.External Links:2410\.01623,[Link](https://arxiv.org/abs/2410.01623)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.32.28.35.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§4\.3](https://arxiv.org/html/2606.28932#S4.SS3.p1.1)\.
- A\. Glentis, J\. Li, Q\. Shang, A\. Han, I\. Tsaknakis, Q\. Wei, and M\. Hong \(2025\)Scalable parameter and memory efficient pretraining for llm: recent algorithmic advances and benchmarking\.External Links:2505\.22922,[Link](https://arxiv.org/abs/2505.22922)Cited by:[Appendix A](https://arxiv.org/html/2606.28932#A1.p2.1),[Appendix E](https://arxiv.org/html/2606.28932#A5.p1.2),[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[Appendix H](https://arxiv.org/html/2606.28932#A8.SS0.SSS0.Px2.p1.1),[Appendix I](https://arxiv.org/html/2606.28932#A9.p1.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p6.1),[§4\.2](https://arxiv.org/html/2606.28932#S4.SS2.p10.2),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2),[Table 2](https://arxiv.org/html/2606.28932#S4.T2),[Table 2](https://arxiv.org/html/2606.28932#S4.T2.5.2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§1](https://arxiv.org/html/2606.28932#S1.p1.1)\.
- A\. Han, J\. Li, W\. Huang, M\. Hong, A\. Takeda, P\. Jawanpuria, and B\. Mishra \(2024\)SLTrain: a sparse plus low\-rank approach for parameter and memory efficient pretraining\.External Links:2406\.02214,[Link](https://arxiv.org/abs/2406.02214)Cited by:[Appendix A](https://arxiv.org/html/2606.28932#A1.p2.1),[Appendix E](https://arxiv.org/html/2606.28932#A5.p1.2),[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2),[§2\.2](https://arxiv.org/html/2606.28932#S2.SS2.p2.3.1),[§3](https://arxiv.org/html/2606.28932#S3.p3.17),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.28932#S4.SS2.p10.2),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.32.28.36.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2),[Table 2](https://arxiv.org/html/2606.28932#S4.T2),[Table 2](https://arxiv.org/html/2606.28932#S4.T2.5.2.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2015\)Deep residual learning for image recognition\.External Links:1512\.03385,[Link](https://arxiv.org/abs/1512.03385)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§2\.2](https://arxiv.org/html/2606.28932#S2.SS2.p1.2.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.32.28.33.1)\.
- Y\. Hu, K\. Zhao, W\. Huang, J\. Chen, and J\. Zhu \(2024\)Accelerating transformer pre\-training with 2:4 sparsity\.External Links:2404\.01847,[Link](https://arxiv.org/abs/2404.01847)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5)\.
- K\. Jordan, H\. Sedghi, O\. Saukh, R\. Entezari, and B\. Neyshabur \(2022\)Repair: renormalizing permuted activations for interpolation repair\.arXiv preprint arXiv:2211\.08403\.Cited by:[§3](https://arxiv.org/html/2606.28932#S3.p4.10)\.
- S\. R\. Kamalakara, A\. Locatelli, B\. Venkitesh, J\. Ba, Y\. Gal, and A\. N\. Gomez \(2022\)Exploring low rank training of deep neural networks\.arXiv preprint arXiv:2209\.13569\.Cited by:[§2\.2](https://arxiv.org/html/2606.28932#S2.SS2.p1.5)\.
- M\. Khodak, N\. Tenenholtz, L\. Mackey, and N\. Fusi \(2021\)Initialization and regularization of factorized neural layers\.arXiv preprint arXiv:2105\.01029\.Cited by:[§2\.2](https://arxiv.org/html/2606.28932#S2.SS2.p1.5)\.
- J\. Li, L\. Yin, L\. Shen, J\. Xu, L\. Xu, T\. Huang, W\. Wang, S\. Liu, and X\. Wang \(2025\)LOST: low\-rank and sparse pre\-training for large language models\.External Links:2508\.02668,[Link](https://arxiv.org/abs/2508.02668)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2),[§1](https://arxiv.org/html/2606.28932#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.28932#S2.SS2.p3.10.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.28932#S4.SS2.p10.2),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.32.28.40.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2),[Table 2](https://arxiv.org/html/2606.28932#S4.T2),[Table 2](https://arxiv.org/html/2606.28932#S4.T2.5.2.1)\.
- V\. Lialin, N\. Shivagunde, S\. Muckatira, and A\. Rumshisky \(2023\)Relora: high\-rank training through low\-rank updates\.arXiv preprint arXiv:2307\.05695\.Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§1](https://arxiv.org/html/2606.28932#S1.p1.1)\.
- Z\. Liu, R\. Zhang, Z\. Wang, Z\. Yang, P\. Hovland, B\. Nicolae, F\. Cappello, and Z\. Zhang \(2025\)CoLA: compute\-efficient pre\-training of llms via low\-rank activation\.External Links:2502\.10940,[Link](https://arxiv.org/abs/2502.10940)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2),[§1](https://arxiv.org/html/2606.28932#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.28932#S2.SS2.p4.1.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.32.28.39.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2)\.
- R\. Miles, P\. Reddy, I\. Elezi, and J\. Deng \(2024\)VeLoRA: memory efficient training using rank\-1 sub\-token projections\.External Links:2405\.17991,[Link](https://arxiv.org/abs/2405.17991)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5)\.
- Z\. Mo, L\. Huang, and S\. J\. Pan \(2025\)Parameter and memory efficient pretraining via low\-rank riemannian optimization\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.32.28.37.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2)\.
- M\. Mozaffari, A\. Yazdanbakhsh, Z\. Zhang, and M\. M\. Dehnavi \(2025\)SLoPe: double\-pruned sparse plus lazy low\-rank adapter pretraining of llms\.External Links:2405\.16325,[Link](https://arxiv.org/abs/2405.16325)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5)\.
- L\. Nguyen, A\. Quélennec, E\. Tartaglione, S\. Tardieu, and V\. Nguyen \(2024\)Activation map compression through tensor decomposition for deep learning\.External Links:2411\.06346,[Link](https://arxiv.org/abs/2411.06346)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5)\.
- OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat, R\. Avila, I\. Babuschkin, S\. Balaji, V\. Balcom, P\. Baltescu, H\. Bao, M\. Bavarian, J\. Belgum, I\. Bello, J\. Berdine, G\. Bernadett\-Shapiro, C\. Berner, L\. Bogdonoff, O\. Boiko, M\. Boyd, A\. Brakman, G\. Brockman, T\. Brooks, M\. Brundage, K\. Button, T\. Cai, R\. Campbell, A\. Cann, B\. Carey, C\. Carlson, R\. Carmichael, B\. Chan, C\. Chang, F\. Chantzis, D\. Chen, S\. Chen, R\. Chen, J\. Chen, M\. Chen, B\. Chess, C\. Cho, C\. Chu, H\. W\. Chung, D\. Cummings, J\. Currier, Y\. Dai, C\. Decareaux, T\. Degry, N\. Deutsch, D\. Deville, A\. Dhar, D\. Dohan, S\. Dowling, S\. Dunning, A\. Ecoffet, A\. Eleti, T\. Eloundou, D\. Farhi, L\. Fedus, N\. Felix, S\. P\. Fishman, J\. Forte, I\. Fulford, L\. Gao, E\. Georges, C\. Gibson, V\. Goel, T\. Gogineni, G\. Goh, R\. Gontijo\-Lopes, J\. Gordon, M\. Grafstein, S\. Gray, R\. Greene, J\. Gross, S\. S\. Gu, Y\. Guo, C\. Hallacy, J\. Han, J\. Harris, Y\. He, M\. Heaton, J\. Heidecke, C\. Hesse, A\. Hickey, W\. Hickey, P\. Hoeschele, B\. Houghton, K\. Hsu, S\. Hu, X\. Hu, J\. Huizinga, S\. Jain, S\. Jain, J\. Jang, A\. Jiang, R\. Jiang, H\. Jin, D\. Jin, S\. Jomoto, B\. Jonn, H\. Jun, T\. Kaftan, Ł\. Kaiser, A\. Kamali, I\. Kanitscheider, N\. S\. Keskar, T\. Khan, L\. Kilpatrick, J\. W\. Kim, C\. Kim, Y\. Kim, J\. H\. Kirchner, J\. Kiros, M\. Knight, D\. Kokotajlo, Ł\. Kondraciuk, A\. Kondrich, A\. Konstantinidis, K\. Kosic, G\. Krueger, V\. Kuo, M\. Lampe, I\. Lan, T\. Lee, J\. Leike, J\. Leung, D\. Levy, C\. M\. Li, R\. Lim, M\. Lin, S\. Lin, M\. Litwin, T\. Lopez, R\. Lowe, P\. Lue, A\. Makanju, K\. Malfacini, S\. Manning, T\. Markov, Y\. Markovski, B\. Martin, K\. Mayer, A\. Mayne, B\. McGrew, S\. M\. McKinney, C\. McLeavey, P\. McMillan, J\. McNeil, D\. Medina, A\. Mehta, J\. Menick, L\. Metz, A\. Mishchenko, P\. Mishkin, V\. Monaco, E\. Morikawa, D\. Mossing, T\. Mu, M\. Murati, O\. Murk, D\. Mély, A\. Nair, R\. Nakano, R\. Nayak, A\. Neelakantan, R\. Ngo, H\. Noh, L\. Ouyang, C\. O’Keefe, J\. Pachocki, A\. Paino, J\. Palermo, A\. Pantuliano, G\. Parascandolo, J\. Parish, E\. Parparita, A\. Passos, M\. Pavlov, A\. Peng, A\. Perelman, F\. de Avila Belbute Peres, M\. Petrov, H\. P\. de Oliveira Pinto, Michael, Pokorny, M\. Pokrass, V\. H\. Pong, T\. Powell, A\. Power, B\. Power, E\. Proehl, R\. Puri, A\. Radford, J\. Rae, A\. Ramesh, C\. Raymond, F\. Real, K\. Rimbach, C\. Ross, B\. Rotsted, H\. Roussez, N\. Ryder, M\. Saltarelli, T\. Sanders, S\. Santurkar, G\. Sastry, H\. Schmidt, D\. Schnurr, J\. Schulman, D\. Selsam, K\. Sheppard, T\. Sherbakov, J\. Shieh, S\. Shoker, P\. Shyam, S\. Sidor, E\. Sigler, M\. Simens, J\. Sitkin, K\. Slama, I\. Sohl, B\. Sokolowsky, Y\. Song, N\. Staudacher, F\. P\. Such, N\. Summers, I\. Sutskever, J\. Tang, N\. Tezak, M\. B\. Thompson, P\. Tillet, A\. Tootoonchian, E\. Tseng, P\. Tuggle, N\. Turley, J\. Tworek, J\. F\. C\. Uribe, A\. Vallone, A\. Vijayvergiya, C\. Voss, C\. Wainwright, J\. J\. Wang, A\. Wang, B\. Wang, J\. Ward, J\. Wei, C\. Weinmann, A\. Welihinda, P\. Welinder, J\. Weng, L\. Weng, M\. Wiethoff, D\. Willner, C\. Winter, S\. Wolrich, H\. Wong, L\. Workman, S\. Wu, J\. Wu, M\. Wu, K\. Xiao, T\. Xu, S\. Yoo, K\. Yu, Q\. Yuan, W\. Zaremba, R\. Zellers, C\. Zhang, M\. Zhang, S\. Zhao, T\. Zheng, J\. Zhuang, W\. Zhuk, and B\. Zoph \(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§1](https://arxiv.org/html/2606.28932#S1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2023\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.External Links:1910\.10683,[Link](https://arxiv.org/abs/1910.10683)Cited by:[Appendix A](https://arxiv.org/html/2606.28932#A1.p2.1),[Appendix H](https://arxiv.org/html/2606.28932#A8.SS0.SSS0.Px2.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[Appendix H](https://arxiv.org/html/2606.28932#A8.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p1.1)\.
- O\. Saukh, D\. Wang, H\. Šikić, Y\. Cheng, and L\. Thiele \(2026\)Cut less, fold more: model compression through the lens of projection geometry\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JV9CEtKLQF)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§3](https://arxiv.org/html/2606.28932#S3.p4.10)\.
- Y\. Shamshoum, N\. Hodos, Y\. Sieradzki, and A\. Schuster \(2025\)CompAct: compressed activations for memory\-efficient llm training\.External Links:2410\.15352,[Link](https://arxiv.org/abs/2410.15352)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample \(2023a\)LLaMA: open and efficient foundation language models\.External Links:2302\.13971,[Link](https://arxiv.org/abs/2302.13971)Cited by:[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p2.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale, D\. Bikel, L\. Blecher, C\. C\. Ferrer, M\. Chen, G\. Cucurull, D\. Esiobu, J\. Fernandes, J\. Fu, W\. Fu, B\. Fuller, C\. Gao, V\. Goswami, N\. Goyal, A\. Hartshorn, S\. Hosseini, R\. Hou, H\. Inan, M\. Kardas, V\. Kerkez, M\. Khabsa, I\. Kloumann, A\. Korenev, P\. S\. Koura, M\. Lachaux, T\. Lavril, J\. Lee, D\. Liskovich, Y\. Lu, Y\. Mao, X\. Martinet, T\. Mihaylov, P\. Mishra, I\. Molybog, Y\. Nie, A\. Poulton, J\. Reizenstein, R\. Rungta, K\. Saladi, A\. Schelten, R\. Silva, E\. M\. Smith, R\. Subramanian, X\. E\. Tan, B\. Tang, R\. Taylor, A\. Williams, J\. X\. Kuan, P\. Xu, Z\. Yan, I\. Zarov, Y\. Zhang, A\. Fan, M\. Kambadur, S\. Narang, A\. Rodriguez, R\. Stojnic, S\. Edunov, and T\. Scialom \(2023b\)Llama 2: open foundation and fine\-tuned chat models\.External Links:2307\.09288,[Link](https://arxiv.org/abs/2307.09288)Cited by:[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p2.1)\.
- D\. Wang, H\. Šikić, L\. Thiele, and O\. Saukh \(2025a\)Forget the data and fine\-tuning\!
- \(4\)just fold the network to compress
- Z\. Wang, J\. Liang, R\. He, Z\. Wang, and T\. Tan \(2025b\)LoRA\-pro: are low\-rank adapters properly optimized?\.External Links:2407\.18242,[Link](https://arxiv.org/abs/2407.18242)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§3](https://arxiv.org/html/2606.28932#S3.p3.17)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.28932#S1.p1.1)\.
- R\. Zhang, Z\. Liu, Z\. Wang, and Z\. Zhang \(2025\)LaX: boosting low\-rank training of foundation models via latent crossing\.External Links:2505\.21732,[Link](https://arxiv.org/abs/2505.21732)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2),[§1](https://arxiv.org/html/2606.28932#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.28932#S2.SS2.p5.1.1),[§3](https://arxiv.org/html/2606.28932#S3.p3.17),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.21.17.17.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.27.23.23.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2)\.
- Z\. Zhang, A\. Jaiswal, L\. Yin, S\. Liu, J\. Zhao, Y\. Tian, and Z\. Wang \(2024\)Q\-galore: quantized galore with int4 projection and layer\-adaptive low\-rank gradients\.arXiv preprint arXiv:2407\.08296\.Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5)\.
- J\. Zhao, X\. Yu, Y\. Zhang, and Z\. Yang \(2025\)LoR2C : low\-rank residual connection adaptation for parameter\-efficient fine\-tuning\.External Links:2503\.00572,[Link](https://arxiv.org/abs/2503.00572)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§3](https://arxiv.org/html/2606.28932#S3.p3.17)\.
- J\. Zhao, Z\. Zhang, B\. Chen, Z\. Wang, A\. Anandkumar, and Y\. Tian \(2024\)GaLore: memory\-efficient llm training by gradient low\-rank projection\.External Links:2403\.03507,[Link](https://arxiv.org/abs/2403.03507)Cited by:[Appendix A](https://arxiv.org/html/2606.28932#A1.p2.1),[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p3.1),[§4\.1](https://arxiv.org/html/2606.28932#S4.SS1.p6.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.32.28.34.1),[Table 1](https://arxiv.org/html/2606.28932#S4.T1.4.2.2)\.
- K\. Zhou, S\. Wang, and J\. Xu \(2025\)SwitchLoRA: switched low\-rank adaptation can learn full\-rank information\.External Links:2406\.06564,[Link](https://arxiv.org/abs/2406.06564)Cited by:[Appendix G](https://arxiv.org/html/2606.28932#A7.p1.5),[§1](https://arxiv.org/html/2606.28932#S1.p2.2)\.
## Appendix
The following sections provide supplementary information omitted from the main text:
- •Appendix[H](https://arxiv.org/html/2606.28932#A8): Impact statement and existing assets\.
- •Appendix[A](https://arxiv.org/html/2606.28932#A1): Implementation details\.
- •Appendix[B](https://arxiv.org/html/2606.28932#A2): Additional ablations and long\-horizon results\.
- •Appendix[C](https://arxiv.org/html/2606.28932#A3): Gradient\-path analysis results\.
- •Appendix[D](https://arxiv.org/html/2606.28932#A4): The impact ofDLRon training dynamics\.
- •Appendix[E](https://arxiv.org/html/2606.28932#A5): Supervised fine\-tuning details\.
- •Appendix[F](https://arxiv.org/html/2606.28932#A6): Zero\-shot evaluation\.
- •Appendix[G](https://arxiv.org/html/2606.28932#A7): Related work\.
- •Appendix[I](https://arxiv.org/html/2606.28932#A9): Use of Large Language Models\.
## Appendix AImplementation details
We trained over 100 models to evaluate the performance of DLR presented in this work\. Experiments were run on an NVIDIA H100 SLURM cluster where each node has 4 NVIDIA H100 GPUs \(each with 94GB memory\), a local 7\.68TB NVMe drive, and 4×\\timesInfiniband NDR200 adapters\. Training time varies from 20 minutes \(LLaMA 60M\) to 5 hours \(LLaMA 1B\) depending on the model size\.Huggingface Hub111https://huggingface\.co/docs/hub/indexis used to load the datasets\. Weights & Biases \(W&B\)222https://wandb\.aiis used to log training history, training result, and evaluation metrics\. The source code is available at[https://github\.com/nanguoyu/DLR](https://github.com/nanguoyu/DLR)\.
This section outlines the LLaMA architectures and the hyperparameters used during pre\-training\. To ensure fair comparison, we follow the same experimental settings asZhaoet al\.\([2024](https://arxiv.org/html/2606.28932#bib.bib10)\); Glentiset al\.\([2025](https://arxiv.org/html/2606.28932#bib.bib14)\)\. Tab\.[6](https://arxiv.org/html/2606.28932#A1.T6)summarizes the hyperparameters for different model scales\. Across all architectures, we adopt a maximum sequence length of 256 and a batch size of 131,072 tokens\. The learning rate is linearly warmed up during the first 10% of training steps, followed by a cosine annealing schedule that decays to 10% of the initial value\. We use the T5\-base tokenizer\(Raffelet al\.,[2023](https://arxiv.org/html/2606.28932#bib.bib31)\), consistent with prior work\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13); Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14)\)\.
Table 5:Pre\-training hyperparameters for LLaMA architectures\.ParamsHiddenIntermediateHeadsLayersStepsTokens \(B\)60M51213768811K1\.4130M7682048121222K2\.6350M10242736162465K7\.81B204854613224140K13\.1Table 6:Training hyperparameters used in our DLR experiments \(C4 pre\-training\)\. Common settings across these runs: C4 dataset; sequence length 256; token batch size512×256=131,072512\\times 256=131\{,\}072tokens; tokenizer T5\-base; optimizer AdamW with\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\), weight decay0\.10\.1, and gradient clipping0\.50\.5; cosine learning\-rate schedule with minimum LR ratio0\.10\.1; evaluation every 1,000 update steps; dtype bfloat16; seeds\{41,42,43,44,45\}\\\{41,42,43,44,45\\\}\.ModelWorld SizeBatchTotal BatchrrLRStepsWarmupAdamWϵ\\epsilon60M8645121280\.0111,0001,10010−810^\{\-8\}130M16325122560\.00522,0002,20010−610^\{\-6\}350M16325122560\.00365,0006,50010−610^\{\-6\}1B32165125120\.002140,00010,00010−610^\{\-6\}### Reference implementation ofDLR
To make the definition of the expansion operatorExpandK\(⋅\)\\mathrm\{Expand\}\_\{K\}\(\\cdot\)concrete, we provide a short Python\-style reference implementation of aDLRlayer below\. The paper writes the up\-projection asB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times r\}andBzBz; the code below uses the common PyTorch row\-major conventionBcode∈ℝr×doutB\_\{\\mathrm\{code\}\}\\in\\mathbb\{R\}^\{r\\times d\_\{\\mathrm\{out\}\}\}and computesz @ B, which is the transpose convention of the mathematicalBB\. The key operation is to replicate each latent coordinateK=⌈dout/r⌉K=\\lceil d\_\{\\text\{out\}\}/r\\rceiltimes along the last dimension and truncate todoutd\_\{\\text\{out\}\}, matching Eq\.[10](https://arxiv.org/html/2606.28932#S3.E10)–Eq\.[11](https://arxiv.org/html/2606.28932#S3.E11)\.
Listing 1:Python\-style reference code for theDLRlayer \(Expand via repeat\+truncate\)\.importmath
defexpand\_k\(z,d\_out\):
r=z\.shape\[\-1\]
K=math\.ceil\(d\_out/r\)
u=repeat\_interleave\(z,K,axis=\-1\)
returnu\[\.\.\.,:d\_out\]
defdlr\_layer\(x,A,B,alpha,bias=None,act=None\):
z=x@A
ifactisnotNone:
z=act\(z\)
y\_lr=z@B
d\_out=B\.shape\[\-1\]
u=expand\_k\(z,d\_out\)
K=math\.ceil\(d\_out/z\.shape\[\-1\]\)
y=y\_lr\+\(alpha/math\.sqrt\(K\)\)\*u
ifbiasisnotNone:
y=y\+bias
returny
#### Reference implementation of the fold operation\.
Folding a trainedDLRlayer reduces to a single in\-place update ofBcodeB\_\{\\mathrm\{code\}\}that absorbs the structured residual; equivalently, in the paper notation, it appliesB←B\+\(α/K\)R⊤B\\leftarrow B\+\(\\alpha/\\sqrt\{K\}\)R^\{\\top\}in Eq\.[12](https://arxiv.org/html/2606.28932#S3.E12)\. The resulting checkpoint is a drop\-in replacement for the underlying rank\-rrdecoder\. The operation is exact \(no learning, no approximation\) and runs once at training termination\.
Listing 2:Python\-style reference code forfold\(\): absorb theDLRresidual intoBBin closed form, see Eq\.[12](https://arxiv.org/html/2606.28932#S3.E12)\.importmath
deffold\_\(B,alpha\):
r,d\_out=B\.shape
K=math\.ceil\(d\_out/r\)
scale=alpha/math\.sqrt\(K\)
forjinrange\(r\):
lo=j\*K
hi=min\(\(j\+1\)\*K,d\_out\)
B\[j,lo:hi\]\.add\_\(scale\)
returnB
The loop is the transpose\-layout version of the paper updateBmath\+=\(α/K\)R⊤B\_\{\\mathrm\{math\}\}\\mathrel\{\+\{=\}\}\(\\alpha/\\sqrt\{K\}\)R^\{\\top\}in Eq\.[12](https://arxiv.org/html/2606.28932#S3.E12)\. Equivalently, for the code layoutBcode=Bmath⊤B\_\{\\mathrm\{code\}\}=B\_\{\\mathrm\{math\}\}^\{\\top\}, rowjjofBreceives the constantα/K\\alpha/\\sqrt\{K\}on the output coordinates in group𝒢j\\mathcal\{G\}\_\{j\}\. It runs in𝒪\(dout\)\\mathcal\{O\}\(d\_\{\\mathrm\{out\}\}\)time and𝒪\(1\)\\mathcal\{O\}\(1\)extra memory beyond the in\-place buffer ofB⋆B^\{\\star\}\. After folding, the originalα\\alpha,KK, theExpandKbranch, and any buffer storingRRare discarded; what remains is a vanilla rank\-rrlow\-rank layer parameterized by\(A,B⋆\)\(A,B^\{\\star\}\)\.
#### Fold consistency on a trained checkpoint\.
We verify the closed\-form fold operation on one trained 1BDLR\+CoLA checkpoint used for this fold\-consistency diagnostic\. The checkpoint loads without missing or unexpected keys, and folding updates 168DLRlayers while leaving the parameter count unchanged \(609\.31M before and after folding\)\. Under BF16 evaluation on C4 validation with the same 10M\-token budget and sequence length 256 as our training\-time evaluation, perplexity changes from 14\.6295 before folding to 14\.6301 after folding \(Δ\\DeltaPPL=\+0\.0006=\+0\.0006\)\. The folded and unfolded logits match within mean absolute error1\.91×10−21\.91\\times 10^\{\-2\}and safe mean relative error1\.44×10−21\.44\\times 10^\{\-2\}, passing our BF16 tolerance\. Folding also removes the training\-only residual branch in our single\-GPU diagnostic, reducing median forward latency from 29\.59 ms to 21\.18 ms with no change in peak memory\. These results confirm that the fold identity in Eq\.[12](https://arxiv.org/html/2606.28932#S3.E12)holds up to floating\-point roundoff in trained checkpoints\.
## Appendix BAdditional ablations and long\-horizon results
### B\.1Variance\-preserving correction ablation
We ablate the variance/scale correction of the duplicated residual by trainingDLR\+CoLA with the default scalingα/K\\alpha/\\sqrt\{K\}versus an uncorrected variant \(no1/K1/\\sqrt\{K\}\), across model scales\. Tab\.[7](https://arxiv.org/html/2606.28932#A2.T7)reports validation perplexity and shows that the correction consistently improves performance, supporting the variance\-preserving design used in Eq\.[11](https://arxiv.org/html/2606.28932#S3.E11)\.
Table 7:Variance\-preserving correction ablation \(CoLA backbone\)\.Validation perplexity across model scales forDLR\+CoLA with vs\. without the1/K1/\\sqrt\{K\}correction; mean±\\pmstd across seeds\{41,42,43,44,45\}\\\{41,42,43,44,45\\\}\.DLR\+ CoLA60M130M350M1Bw/ correction32\.96±\\pm0\.0623\.80±\\pm0\.4018\.38±\\pm0\.0314\.26±\\pm0\.02w/o correction33\.80±\\pm0\.1424\.46±\\pm0\.0519\.02±\\pm0\.0314\.60±\\pm0\.02
### B\.2Initialization\-offset ablation
From the folded perspective,DLRadds a structured block\-constant offset to the learned decoderBB\. To separate this static initialization effect from the active residual path during training, we compare fullDLR\+CoLA with an*Init\-offset CoLA*variant: at initialization, we fold the same\(α/K\)R⊤\(\\alpha/\\sqrt\{K\}\)R^\{\\top\}offset intoBB, then disable theDLRbranch and train the resulting CoLA model normally\. Tab\.[8](https://arxiv.org/html/2606.28932#A2.T8)shows that the initialization offset is a useful structured prior, improving over CoLA, but it does not match fullDLRtraining\. The remaining gap indicates that the active training\-time residual path provides additional optimization benefit beyond a static folded initialization ofBB\.
Table 8:Initialization\-offset ablation at LLaMA\-1B\.Validation perplexity on C4\. Init\-offset CoLA folds theDLRoffset intoBBat initialization and then trains without theDLRbranch\.MethodPPL↓\\downarrowCoLA15\.76Init\-offset CoLA14\.79±\\pm0\.04DLR\+CoLA14\.26±\\pm0\.02
### B\.3Target\-rank sensitivity
To probe sensitivity to the target rank, we trainDLR\+CoLA models with reduced ranksr∈\{0\.5r0,0\.75r0\}r\\in\\\{0\.5r\_\{0\},\\,0\.75r\_\{0\}\\\}, wherer0r\_\{0\}is the default rank used throughout the paper \(Tab\.[6](https://arxiv.org/html/2606.28932#A1.T6)\)\. All settings follow the main recipe and use five random seeds\{41,42,43,44,45\}\\\{41,42,43,44,45\\\}\. Tab\.[9](https://arxiv.org/html/2606.28932#A2.T9)shows a smooth, monotonic trade\-off: decreasingrrdegrades perplexity gradually rather than catastrophically\. For instance, at 1B scale, reducing the rank fromr0=512r\_\{0\}=512to384384\(0\.75r0r\_\{0\}\) increases PPL from 14\.26 to 14\.77, and further to 15\.67 at256256\(0\.5r0r\_\{0\}\)\. Similar trends hold at 350M \(18\.38→\\rightarrow19\.18→\\rightarrow20\.59\) and 130M \(23\.80→\\rightarrow24\.57→\\rightarrow26\.18\), indicating thatDLRremains effective under tighter rank budgets\.
Table 9:Rank sensitivity forDLR\+CoLA\. Target rankrris varied as a fraction of the default rankr0r\_\{0\}at each scale\. We report mean±\\pmstd validation perplexity over seeds\{41,42,43,44,45\}\\\{41,42,43,44,45\\\}on C4\.Model paramsRank ratiorrPPL↓\\downarrow60M0\.506437\.26±\\pm0\.1260M0\.759634\.98±\\pm0\.0860M1\.0012832\.96±\\pm0\.06130M0\.5012826\.18±\\pm0\.11130M0\.7519224\.57±\\pm0\.22130M1\.0025623\.80±\\pm0\.40350M0\.5012820\.59±\\pm0\.01350M0\.7519219\.18±\\pm0\.03350M1\.0025618\.38±\\pm0\.031B0\.5025615\.67±\\pm0\.021B0\.7538414\.77±\\pm0\.031B1\.0051214\.26±\\pm0\.02
### B\.4Scaling\-law token budget at 1B
To complement the final validation perplexities in Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1), we plot the full training trajectory for the 1B setting at a scaling\-law token budget \(26B tokens\) in Fig\.[2](https://arxiv.org/html/2606.28932#A2.F2)\. The low\-rank backbone equipped withDLRclosely tracks the full\-rank baseline throughout training and exhibits a small but consistent gap at convergence, matching the relative performance reported in Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1)\.
Figure 2:Scaling\-law token budget at LLaMA\-1B with 26B tokens\.Validation loss trajectories on C4 for LLaMA\-1B trained for 26B tokens, comparing full\-rank pre\-training to a low\-rank backbone equipped withDLR\(r=512\)\. The full\-rank model converges to a slightly lower loss, consistent with the 1B perplexity ordering reported in Tab\.[1](https://arxiv.org/html/2606.28932#S4.T1)\.
## Appendix CGradient\-path analysis results
We instrument the training loop to probe the gradient paths induced by the low\-rank decoder and the DLR expansion operator at selected update steps\. For each probed module outputy∈ℝdouty\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\}, we captureyyvia a forward hook and computegy=∂ℒ/∂yg\_\{y\}=\\partial\\mathcal\{L\}/\\partial yusingtorch\.autograd\.grad, avoiding any modification of parameter gradients\. We then form the two latent\-gradient componentsB⊤gyB^\{\\top\}g\_\{y\}andβRgy\\beta Rg\_\{y\}\(withβ=α/K\\beta=\\alpha/\\sqrt\{K\}\), whereRRcorresponds to the DLR “expand\-then\-sum” operator \(implemented as a scatter\-add over duplicated groups\)\. Following our implementation, we reshapegyg\_\{y\}to\[N,dout\]\[N,d\_\{\\text\{out\}\}\]by flattening batch and sequence dimensions, compute token\-wiseℓ2\\ell\_\{2\}norms, and report the token\-average norms‖B⊤gy‖\\\|B^\{\\top\}g\_\{y\}\\\|and‖βRgy‖\\\|\\beta Rg\_\{y\}\\\|\. We logρ\(t\)=‖βRgy‖/‖B⊤gy‖\\rho\(t\)=\\\|\\beta Rg\_\{y\}\\\|/\\\|B^\{\\top\}g\_\{y\}\\\|and the token\-average cosine similaritycost=⟨B⊤gy,βRgy⟩/\(‖B⊤gy‖⋅‖βRgy‖\)\\cos\_\{t\}=\\langle B^\{\\top\}g\_\{y\},\\beta Rg\_\{y\}\\rangle/\(\\\|B^\{\\top\}g\_\{y\}\\\|\\cdot\\\|\\beta Rg\_\{y\}\\\|\)\. Note thatcost\\cos\_\{t\}measures the*directional alignment of the two gradient components*\(not the cosine between the parameter matricesBBandRR\)\.
\(a\)Ratioρ\(t\)=‖βRgy‖/‖B⊤gy‖\\rho\(t\)=\\\|\\beta Rg\_\{y\}\\\|/\\\|B^\{\\top\}g\_\{y\}\\\|
\(b\)Cosinecost=cos\(B⊤gy,βRgy\)\\cos\_\{t\}=\\cos\(B^\{\\top\}g\_\{y\},\\beta Rg\_\{y\}\)
Figure 3:Gradient\-path diagnostics \(LLaMA\-350M\):L0 self\_attn q\_proj\.The DLR\-induced component dominates early updates \(ρ\(t\)\>1\\rho\(t\)\>1\) and decays below11as training progresses, while remaining nearly orthogonal to the baseline low\-rank gradient \(cost≈0\\cos\_\{t\}\\approx 0\)\.\(a\)Ratioρ\(t\)=‖βRgy‖/‖B⊤gy‖\\rho\(t\)=\\\|\\beta Rg\_\{y\}\\\|/\\\|B^\{\\top\}g\_\{y\}\\\|
\(b\)Cosinecost=cos\(B⊤gy,βRgy\)\\cos\_\{t\}=\\cos\(B^\{\\top\}g\_\{y\},\\beta Rg\_\{y\}\)
Figure 4:Gradient\-path diagnostics \(LLaMA\-350M\):L11 mlp\_up\_proj\.The ratioρ\(t\)\\rho\(t\)decreases from above11to below11over training, whilecost\\cos\_\{t\}stays close to0, indicating a strong but complementary DLR gradient contribution\.\(a\)Ratioρ\(t\)=‖βRgy‖/‖B⊤gy‖\\rho\(t\)=\\\|\\beta Rg\_\{y\}\\\|/\\\|B^\{\\top\}g\_\{y\}\\\|
\(b\)Cosinecost=cos\(B⊤gy,βRgy\)\\cos\_\{t\}=\\cos\(B^\{\\top\}g\_\{y\},\\beta Rg\_\{y\}\)
Figure 5:Gradient\-path diagnostics \(LLaMA\-350M\):L23 mlp\_up\_proj\.We observe the same trend as in other layers: the DLR term dominates early \(ρ\(t\)\>1\\rho\(t\)\>1\) and then decays, while remaining nearly orthogonal to the baseline gradient \(cost≈0\\cos\_\{t\}\\approx 0\)\.
## Appendix DThe impact ofDLRon training dynamics
We provide additional training\-dynamics evidence by plotting evaluation loss over pre\-training for three variants: \(i\)DLR\+B \(default\), \(ii\)DLRonly \(no learnable decoder branchBB\), and \(iii\) Random duplication \+ B \(fixed random duplication mapping with the same variance correction1/K1/\\sqrt\{K\}\)\. All three runs enabletorch\.compileand are executed on the same DGX machine with8×8\\timesA100 GPUs connected via NVLink\. Figure[6](https://arxiv.org/html/2606.28932#A4.F6)shows that removingBBsignificantly worsens optimization and yields consistently higher evaluation loss\. Meanwhile,DLR\+B and Random duplication \+ B converge along nearly identical trajectories and reach comparable final losses, supporting the interpretation of DLR as an optimization/conditioning aid that complements \(rather than replaces\) the learnable low\-rank decoder\. Despite similar convergence, Random duplication is substantially less efficient: it achieves only 507,186 tok/s versus 1,445,134 tok/s forDLR\+B, because it relies on indexing\-based gathers instead of contiguous duplication\. These results validate the design constraints used in the main method\. First, keeping the learnable decoderBBis necessary because the duplicated residual is not intended to replace the low\-rank up\-projection\. Second, the residual map should be fixed and linear inzz: any such map, including the fixed random mapping tested here, can in principle be folded into a same\-shape decoder, whereas data\-dependent or nonlinear maps generally cannot\. Finally, using contiguousrepeat\_interleaverather than indexed gathers is what letstorch\.compilefuse the duplication into surrounding computation and recover the throughput profile of vanilla low\-rank training\.
Definition of random duplication\.For each module, we sample a*fixed*index vectoridx∈\{0,…,r−1\}dout\\texttt\{idx\}\\in\\\{0,\\dots,r\-1\\\}^\{d\_\{\\text\{out\}\}\}once and reuse it throughout training, whereidx∼torch\.randint\(0,r,\(dout,\)\)\\texttt\{idx\}\\sim\\texttt\{torch\.randint\}\(0,r,\(d\_\{\\text\{out\}\},\)\)under a deterministic per\-module seed \(base seed XORMD5\(module\_name\)\\mathrm\{MD5\}\(\\texttt\{module\\\_name\}\)\)\. We form the expanded residual by indexingzexp\[j\]=z\[idx\[j\]\]z\_\{\\text\{exp\}\}\[j\]=z\[\\texttt\{idx\}\[j\]\]\(i\.e\.,gather/index\_select\), which incurs irregular memory access and poorer fusion than deterministicrepeat\_interleave\.
Figure 6:Training dynamics on LLaMA\-60M\.Evaluation loss curves for \(i\)DLR\+B \(default\), \(ii\)DLRonly \(removing the learnable decoder branchBB\), and \(iii\) Random duplication \+ B \(replacing deterministic duplication with a fixed random mapping while keeping the same variance correction1/K1/\\sqrt\{K\}\)\. RemovingBBsubstantially degrades convergence, whileDLR\+B and Random duplication \+ B exhibit very similar trajectories\.
## Appendix ESupervised fine\-tuning details
For the downstream SFT evaluation in Tab\.[4](https://arxiv.org/html/2606.28932#S4.T4), we fine\-tune the 1B Full\-Rank, CoLA, and foldedDLR\+CoLA checkpoints on Alpaca\-cleaned \(52K instructions\) for 3 epochs using full\-parameter SFT\. All SFT runs use 16×\\timesH100 DDP, effective batch size 64, learning rate2×10−52\{\\times\}10^\{\-5\}with cosine schedule, BF16, and maximum sequence length 512\. We then evaluate ARC\-Challenge, BoolQ, HellaSwag, MMLU, PIQA, and WinoGrande withlm\-evaluation\-harnessin the zero\-shot setting\. For each method, the three SFT seeds are applied to the same fixed pre\-trained checkpoint, so the reported standard deviations capture SFT\-side data\-ordering variability rather than pre\-training initialization variance\. BoolQ and MMLU collapse to constant chance\-level accuracy with zero across\-seed variance for all three checkpoints, consistent with the behavior of pre\-training\-only 1B models on these formats\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13); Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14)\); we therefore interpret the average mainly through the tasks where the checkpoints differentiate\.
## Appendix FZero\-shot evaluation
In addition to validation perplexity on C4, we perform zero\-shot evaluations of 1B\-scale models on ARC\-Challenge \(ARC\-C\), BoolQ, HellaSwag, MMLU, PIQA, and WinoGrande, using standard zero\-shot prompts \(no instruction tuning\)\. All low\-rank variants use rankr=512r=512and are trained with five random seeds\{41,42,43,44,45\}\\\{41,42,43,44,45\\\}; we report mean±\\pmstd across seeds\. As shown in Tab\.[10](https://arxiv.org/html/2606.28932#A6.T10),DLRimproves the average score over the pure low\-rank baseline and largely closes the gap to full\-rank; we treat these results as supplementary since several tasks remain near\-chance at this pre\-training\-only stage\.
Table 10:Zero\-shot evaluation on 1B models\.Each cell reports zero\-shot accuracy as mean±\\pmstd across five pre\-training seeds\{41,42,43,44,45\}\\\{41,42,43,44,45\\\}; the Avg\. column is the across\-task mean of the per\-task means\. BoolQ and MMLU saturate at constant chance\-level accuracy \(std=0=0\) for all four checkpoints, consistent with the well\-documented behavior of pre\-training\-only 1B models on these formats\.Method \(1B,r=512r\{=\}512\)ARC\-CBoolQHellaSwagMMLUPIQAWinoGrandeAvg\.Full\-Rank22\.06±\\pm0\.5737\.83±\\pm0\.0034\.94±\\pm0\.1822\.95±\\pm0\.0067\.57±\\pm0\.6351\.10±\\pm1\.1339\.41Low\-Rank19\.78±\\pm0\.7637\.83±\\pm0\.0027\.47±\\pm0\.1622\.95±\\pm0\.0060\.28±\\pm0\.7651\.16±\\pm1\.2136\.58DLR\+Low\-Rank21\.59±\\pm0\.6537\.83±\\pm0\.0032\.52±\\pm1\.0322\.95±\\pm0\.0065\.59±\\pm0\.6850\.69±\\pm0\.2438\.53DLR\+CoLA22\.29±\\pm0\.5737\.83±\\pm0\.0034\.20±\\pm0\.2422\.95±\\pm0\.0066\.81±\\pm0\.5850\.61±\\pm1\.5739\.12
## Appendix GRelated work
Related work on efficient pre\-training spans four complementary directions\. First, low\-rank adapters popularized by LoRA learn update\-level low\-rank corrections\(Huet al\.,[2021](https://arxiv.org/html/2606.28932#bib.bib7)\); pre\-training variants such as ReLoRA and SwitchLoRA recover higher effective ranks via periodic merges or frequent subspace switching\(Lialinet al\.,[2023](https://arxiv.org/html/2606.28932#bib.bib8); Zhouet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib9)\), while LORO directly optimizes a fixed\-rank factorization on the low\-rank manifold\(Moet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib16)\)\. Second, memory\-efficient gradient projection methods \(GaLore and variants\) project gradients into low\-rank subspaces to reduce optimizer state, with Fira stabilizing training via norm\-based scaling and a norm\-growth limiter\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib10); Zhanget al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib26); Chenet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib11)\)\. Third, sparse\-plus\-low\-rank designs increase expressivity by adding sparse components: SLTrain employs a fixed unstructured support with low memory overhead\(Hanet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib13); Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14)\), LOST co\-designs complementary low\-rank and channel\-wise structured sparse components guided by an SVD initialization\(Liet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib15)\)\. A closely related latent\-shaping line improves low\-rank trainability without introducing an explicit sparse pathway: CoLA inserts a nonlinearity between factors to enforce low\-rank structure in activations\(Liuet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib12)\)\. LaX\(Zhanget al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib17)\)designs inter\-layer latent crossing to enhance the capacity of low\-rank models by establishing information flow across low\-rank spaces of different layers\. LoR2C\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib18)\)introduces low\-rank residual connections within the model layers to solve gradient vanishing during LoRA fine\-tuning\. ResLoRA\(Wanget al\.,[2025b](https://arxiv.org/html/2606.28932#bib.bib20)\)integrates residual paths into the LoRA fine\-tuning framework to accelerate training convergence and improve performance, utilizing merging strategies to ensure no additional computational cost during inference\. Similarly, FST\(Huet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib36)\)and SLoPe\(Mozaffariet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib37)\)optimize training for semi\-structured \(e\.g\., 2:4\) or block\-sparse patterns to leverage specialized kernels\. Finally, post\-training model folding clusters redundant channels/heads and merges them with variance\-preserving corrections to compress models without data or fine\-tuning\(Wanget al\.,[2025a](https://arxiv.org/html/2606.28932#bib.bib3); Saukhet al\.,[2026](https://arxiv.org/html/2606.28932#bib.bib41)\)\. DLR brings the folding insight*into training*: it introduces a*foldable*latent\-space compensation within each low\-rank \(or CoLA\-style\) layer\. Concretely, with a rank\-rrlatentz=sϕ\(A⊤x\)z=s\\,\\phi\(A^\{\\top\}x\), DLR augments the decoder output by a structured duplicated residual, which is a fixed replication matrix that maps each latent coordinate to a small group of output channels ; compared to the above, this design reduces FLOPs and optimizer state like low\-rank methods, avoids sparse\-kernel overheads, and empirically delivers favorable perplexity–efficiency trade\-offs across LLaMA scales under matched token budgets\. Orthogonal to these four directions, a complementary line of work targets the activation\-memory bottleneck by compressing activations stored for backpropagation: VeLoRA compresses intermediate activations via rank\-1 sub\-token projections and reconstructs them approximately during the backward pass\(Mileset al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib34)\), CompAct stores low\-rank, randomly projected activations and only decompresses gradients for the optimizer update, achieving2525–30%30\\%peak\-memory savings for LLaMA pre\-training and up to50%50\\%for RoBERTa fine\-tuning\(Shamshoumet al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib33)\), and tensor\-decomposition methods compress activation maps using \(high\-order\) SVD with theoretical guarantees on gradient approximation error\(Nguyenet al\.,[2024](https://arxiv.org/html/2606.28932#bib.bib35)\)\.
Tab\.[11](https://arxiv.org/html/2606.28932#A7.T11)is not intended as a complete taxonomy of efficient pre\-training\. Instead, it isolates residual\-style interventions that are easy to conflate withDLR\. ResNet\(Heet al\.,[2015](https://arxiv.org/html/2606.28932#bib.bib30)\)is included only as a conceptual anchor for identity residuals, not as an efficient pre\-training baseline\. Within this narrower comparison,DLRdiffers by combining a pre\-training setting, a parameter\-free intra\-layer latent residual, and a closed\-form fold into the deployed low\-rank graph\.
Table 11:WhereDLRsits among residual\-style methods\.We anchor against ResNet to remind readers that residual primitives predate efficient pre\-training, then compareDLRwith three pre\-training / fine\-tuning residual baselines\.DLRis the only entry that combines \(i\) pre\-training applicability, \(ii\) a parameter\-free residual path, and \(iii\) an exact closed\-form merge into the deployment graph \(Eq\.[12](https://arxiv.org/html/2606.28932#S3.E12)\)\.MethodSettingResidual spaceLearnable?Main roleFoldable?ResNetFullActivationNoStabilize optimizationN/A \(identity\)ResLoRAFTAdapterYesImprove adaptationSometimesLoR2CFTCorrectionYesImprove capacitySometimesLaXPretrainInter\-layer latentPartialCross\-layer flowNoDLRPretrainIntra\-layer latentNoDecoder\-independent gradient pathYes
## Appendix HImpact statement and existing assets
#### Impact statement\.
This work studies an efficiency technique for low\-rank language\-model pre\-training\. Its intended positive impact is to reduce the compute and memory required to train and deploy low\-rank models, which may make pre\-training experiments more accessible and reduce resource use\. The method does not introduce a new application, dataset, or user\-facing system, but improvements in pre\-training efficiency could also lower the cost of training models that may be misused if deployed without appropriate safety evaluation\. We therefore viewDLRas a general\-purpose training method whose downstream risks are primarily those of the models and deployment settings to which it is applied\.
#### Existing assets and licenses\.
We use publicly available research assets and cite their original sources where they are introduced in the paper\. Pre\-training uses C4\(Raffelet al\.,[2020](https://arxiv.org/html/2606.28932#bib.bib21)\)accessed through the Hugging Face Datasets interface; tokenization uses the T5\-base tokenizer\(Raffelet al\.,[2023](https://arxiv.org/html/2606.28932#bib.bib31)\); supervised fine\-tuning uses Alpaca\-cleaned; and downstream evaluation useslm\-evaluation\-harness\. We do not redistribute the raw datasets or third\-party model checkpoints as part of this source package\. Users reproducing the experiments should obtain these assets from their original providers and comply with the licenses and terms listed on the corresponding dataset, model, and software package pages\. Our code release is adapted from the open\-source implementation ofGlentiset al\.\([2025](https://arxiv.org/html/2606.28932#bib.bib14)\); the release will include attribution and license information for the reused code and experiment scripts\.
## Appendix IUse of Large Language Models
We primarily usedChatGPT333https://chatgpt\.com/to correct grammatical errors in the manuscript and to fix minor compilation issues in Overleaf\. In addition,Cursor444https://cursor\.com/withGPT\-5\.2\-High555https://openai\.com/index/introducing\-gpt\-5/was employed to debug programming errors encountered during implementation\. Apart from these auxiliary uses, the research ideas, theoretical contributions, and the writing of this paper were entirely carried out by the authors\. The code implementation is adapted from the open\-source code of\(Glentiset al\.,[2025](https://arxiv.org/html/2606.28932#bib.bib14)\)\.Similar Articles
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
This paper identifies a Rank-1 Subspace phenomenon in LLM pre-training trajectories and proposes Extra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss, achieving consistent zero-shot accuracy gains across GPT-2 and LLaMA families up to 2B parameters.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
# Paper page - MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning Source: [https://huggingface.co/papers/2605.07850](https://huggingface.co/papers/2605.07850) We propose**MatryoshkaLoRA**, a general, Matryoshka\-inspired training framework for LoRA that learns accurate hierarchical low\-rank representations by inserting a fixed, carefully crafted diagonal matrix**P**between the existing LoRA adapters to scale their sub\-ranks accordingly\. By introducing
Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
This paper introduces Null-Space Constrained Response-Specified Unlearning (NSRU), a low-rank framework that uses orthogonal-projected LoRA updates confined to the null space of retain subspaces to perform controlled LLM unlearning while preserving benign capabilities.