PreFT: Prefill-only finetuning for efficient inference
Summary
PreFT proposes applying adapters only to prefill tokens, discarding them during decode, which increases throughput for multi-adapter serving with minimal performance loss.
View Cached Full Text
Cached at: 05/15/26, 06:28 AM
# PreFT: Prefill-only finetuning for inference efficiency
Source: [https://arxiv.org/html/2605.14217](https://arxiv.org/html/2605.14217)
Andrew Lanpouthakoun∗†Aryaman Arora∗†Zhengxuan Wu†Dhruv Pai‡ Ben Keigwin‡Dan Jurafsky†Christopher Potts† †Stanford University‡Tilde Research \{andlanpo,aryamana\}@stanford\.edu
###### Abstract
Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods \(PEFTs\), but*serving*user\-specific PEFTs harms throughput, even with specialised kernels and memory management techniques\. This is because, theoretically and empirically, a mismatch exists between prefill \(processing a large number of tokens at once\) and decode \(generating a single token autoregressively\): the latter has far lower throughput when serving multiple adapters\. Rather than optimising performance relative to parameter count, for efficient multi\-adapter serving, we instead ought to optimise performance relative toserving throughput\. We therefore proposePreFT\(Prefill\-only Finetuning\), wherein we only apply the adapter to prefill tokens and discard it afterwards\.PreFTsignificantly increases throughput with minimal effect on performance\. We develop and release an efficient implementation of two prefill\-only PEFTs, LoRA and ReFT, on the vLLM inference engine\. We first show that serving multi\-userPreFTs is more efficient than traditional PEFTs \(1\.9×1\.9\\timesthe throughput when serving512512adapters on Llama 3\.1 70B\)\. Then, we compare the performance of prefill\-only vs\. all\-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales\. On SFT, we observe that the evaluation loss ofPreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput\. On RL, we consistently find thatPreFTs approach parity with standard PEFTs\. Together, this work validates prefill\-only adaptation of LLMs as a more favourable accuracy–throughput tradeoff than existing PEFTs for personalised serving\.
Figure 1:PreFTadapters approach parity on a variety of tasks with much better throughput\.Inference throughput vs\. accuracy for PEFTs \(LoRA\) andPreFTs \(LoRAP, DiReFTP\) for a variety of tasks\. Tasks besides GSM8K are on Llama 3\.1 8B Base\. To roughly match parameter count in these plots, LoRA/LoRAPis always rank\-11and DiReFTPis rank\-1616\(Tülu\-3, LongBench\-Write\) or rank\-88otherwise\. On Tülu\-3 \(SFT\) and for RL tasks \(GSM8K, MATH, HumanEval\), score is evaluation accuracy on selected downstream benchmarks \(see[section˜5](https://arxiv.org/html/2605.14217#S5)\)\. On LongBench\-Write, score is length\-following \(SlS\_\{l\}\) on 4k–20k token outputs\. Inference throughput \(tokens/s\) is measured on the Uniform Punica microbenchmark when serving 512 adapters on one GPU, simulating multi\-user serving \(see[section˜4](https://arxiv.org/html/2605.14217#S4)\)\.### 1Introduction
Personalising large language models for billions of users is an open research problem; besides memory and system prompt personalisation, considerable effort has been dedicated to devising parameter\-efficient finetuning methods \(PEFTs\) which reduce the memory footprint of finetuning\. LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.14217#bib.bib19)\), the most popular weight\-based PEFT, is an attractive option here; it creates lightweight low\-rank adapters which can be added to the model’s existing weight matrices at inference time\. In turn, as systems for serving LLMs have become increasingly optimised, many have sought to make serving thousands of LoRAs compatible with efficient LLM inference\(Chen et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib6); Sheng et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib37),inter alia; see[section˜2](https://arxiv.org/html/2605.14217#S2)\)\.
When serving multiple LoRAs in a single inference batch, the biggest challenge is that each batch item demands different sets of LoRA weights; this requires \(a\) developing custom kernels \(e\.g\. Punica;Chen et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib6)\), since standard batched matrix multiplication is no longer applicable, as it assumes the same matrix is multiplied to each batch item, and \(b\) managing the memory overhead of keeping multiple LoRAs on GPU, such as by offloading to CPU and paging according to a scheduling algorithm \(e\.g\. S\-LoRA;Sheng et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib37)\)\. Even with these highly custom optimisations, multi\-LoRA serving harms inference throughput\.
In this work, rather than optimising what already exists, we take a different approach: we design and implement PEFTs that are inherently suited for efficient LLM inference from the outset\. Our key motivation is the distinction between*prefill*\(processing the input prompt in parallel\) and*decode*\(generating outputs sequentially\) in efficient LLM inference: prefill iscompute\-bound, so serving PEFTs at this stage does not harm compute utilisation; on the other hand, decode ismemory\-boundand therefore should batch across requests\. However, serving many PEFTs in a single batch imposes significant memory overhead and thus erodes the gains from batching\. In other words,per\-request personalisation is more expensive at decode than at prefill\.
To address this challenge, we create efficient*prefill\-only*variants of two existing PEFTs, LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.14217#bib.bib19)\)and ReFT\(Wu et al\.,[2024b](https://arxiv.org/html/2605.14217#bib.bib40)\), aptly namedPreFTs\. Whereas LoRA operates on weights, ReFT intervenes selectively on activations and is thus well\-suited for interventions only on selected tokens\. We implement efficient multi\-adapter versions of these methods in a fork of the vLLM inference engine\(Kwon et al\.,[2023](https://arxiv.org/html/2605.14217#bib.bib22)\), and further add support for our inference backend in thetrlreinforcement learning library\. Our experimental results provide strong support for two key claims, which we summarise in[Figure˜1](https://arxiv.org/html/2605.14217#S0.F1)\.
First, we show thatPreFTs are efficient at inference\([section˜4](https://arxiv.org/html/2605.14217#S4)\)\. We benchmark our multi\-PreFTserving system against existing multi\-LoRA serving at all token positions in vLLM: without any custom kernels, prefill\-only ReFT achieves1\.9×1\.9\\timesthe throughput of multi\-LoRA when serving up to512512adapters onLlama 3\.1 70Bwith tensor parallelism across44H100s, and prefill\-only LoRA achieves1\.87×1\.87\\times\. Gains hold on a variety of model scales \(2\.21×2\.21\\times/1\.83×1\.83\\timeson Llama 3\.1 8B,2\.39×2\.39\\times/1\.63×1\.63\\timeson Qwen 2\.5 0\.5B for DiReFTP/ LoRAP, respectively\)\.
Second, we show thatPreFTs maintain strong performance\([section˜5](https://arxiv.org/html/2605.14217#S5)\)\. On SFT,PreFTs have higher evaluation loss than all\-position PEFTs on Tülu\-3 and OpenThoughts in data\- and parameter\-matched training, but find no statistically significant difference in accuracy on downstream evaluations\. On RL tasks,PreFTs approach PEFTs on math reasoning with GSM8K and MATH\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.14217#bib.bib9); Hendrycks et al\.,[2021b](https://arxiv.org/html/2605.14217#bib.bib17)\)and code generation trained with MBPP and evaluated on the HumanEval dataset\(Austin et al\.,[2021](https://arxiv.org/html/2605.14217#bib.bib2); Chen et al\.,[2021](https://arxiv.org/html/2605.14217#bib.bib7)\), but overall a small difference remains\. On SFT for long\-form text generation with trained with LongWriter and evaluated with LongBench\-Write\(Bai et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib4)\), we find that LoRAPmaintains similar length\-control ability and quality as traditional LoRA\.
Overall, we find thatPreFTs are worth deploying over traditional all\-position PEFTs in the multi\-user personalisation setting: they maintain most or all of the performance of traditional PEFTs while significantly improving inference efficiency\.
### 2Related work
###### Parameter\-efficient finetuning \(PEFT\)\.
A variety of PEFT techniques have been proposed in the literature; broadly, they freeze the original model and only train some lightweight parameters applied as additional operations\. We focus on LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.14217#bib.bib19)\)and ReFT\(Wu et al\.,[2024b](https://arxiv.org/html/2605.14217#bib.bib40)\); the former applies low\-rank adapters to selected weight matrices, and the latter applies low\-rank hidden representation edits to selected token positions and layers\. Other approaches add entire model components\(Houlsby et al\.,[2019](https://arxiv.org/html/2605.14217#bib.bib18); Pfeiffer et al\.,[2020](https://arxiv.org/html/2605.14217#bib.bib31),inter alia\)or learnable soft tokens\(Li and Liang,[2021](https://arxiv.org/html/2605.14217#bib.bib24)\)\. Additionally, many variants of LoRA\(Liu et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib25); Kopiczko et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib21); Balazy et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib5); Morris et al\.,[2026](https://arxiv.org/html/2605.14217#bib.bib29)\)and ReFT\(Zeng et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib41); Wang et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib38)\)have been proposed which modify parametrisation and/or initialisation\. A few works apply mask LoRA on selected token positions\(Greenewald et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib14); Dietz et al\.,[2026](https://arxiv.org/html/2605.14217#bib.bib11)\),111In their appendix, the latter even suggests a masked neuron\-level adapter, which is similar in spirit to ReFT\.but none propose prefill\-only application\. We choose to focus on LoRA for its widespread adoption in serving infrastructure, and we choose ReFT because it is designed for token\-level interventions\.
###### Serving personalised LLMs\.
Serving multiple finetunes of one model adds overhead to LLM inference\. While LoRA adapters can be merged into and unmerged from model weights as in dLoRA\(Wu et al\.,[2024a](https://arxiv.org/html/2605.14217#bib.bib39)\), if done naïvely this reduces throughput and increases latency inmulti\-LoRA serving\. Therefore, S\-LoRA\(Sheng et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib37)\)and Punica\(Chen et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib6)\)proposed servingunmergedLoRA adapters with specialised kernels for multi\-LoRA batch matrix multiplication, along with offloading LoRAs to CPU memory and using scheduling algorithms \(e\.g\. paging in S\-LoRA and dLoRA\) to fetch them into GPU as needed while minimising latency and idle compute\. Later work introduces further memory management and compression improvements for multi\-LoRA serving\(Gabrielsson et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib13); Zhang et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib42); Zhu et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib45); Potdar et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib32); Ni et al\.,[2026](https://arxiv.org/html/2605.14217#bib.bib30)\)\.PreFTs are able to make use of all of these optimisations\.
### 3IntroducingPreFT
We now introducePreFTas a simple modification of existing all\-position adapters\. For background, we first rigorously define the two PEFTs that we experiment with: LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.14217#bib.bib19)\)and ReFT\(Wu et al\.,[2024b](https://arxiv.org/html/2605.14217#bib.bib40)\)under a unified structure\. We then define our approach and elaborate on the efficiency motivation for prefill\-only adapters\.
#### 3\.1Adapter methods
We first define an adapter𝒜ϕ\\mathcal\{A\}\_\{\\phi\}on a frozen base modelℳ\\mathcal\{M\}as a tuple𝒜ϕ=⟨Φϕ,ℱ,𝒫⟩\\mathcal\{A\}\_\{\\phi\}=\\langle\\Phi\_\{\\phi\},\\mathcal\{F\},\\mathcal\{P\}\\rangleconsisting of:
- •an*intervention function*Φϕ\\Phi\_\{\\phi\}with trainable parametersϕ\\phi,
- •a set of*target modules*ℱ\\mathcal\{F\}withinℳ\\mathcal\{M\}at which the intervention is applied,
- •a set of*target positions*𝒫\\mathcal\{P\}within each input sequence at which the intervention is active\.
Each target modulef∈ℱf\\in\\mathcal\{F\}is a computation withinℳ\\mathcal\{M\}that takes an input𝐱\\mathbf\{x\}and produces an outputf\(𝐱\)f\(\\mathbf\{x\}\)\. One could imagine a weight matrix applied to its input, or the identity map on the residual stream at a given layer\. The adapter replacesf\(𝐱\)f\(\\mathbf\{x\}\)withΦϕ\(f\(𝐱\)\)\\Phi\_\{\\phi\}\(f\(\\mathbf\{x\}\)\)at positionsi∈𝒫i\\in\\mathcal\{P\}:
f\(𝐱\)i←\{Φϕ\(f\(𝐱\)i\)ifi∈𝒫f\(𝐱\)iotherwisef\(\\mathbf\{x\}\)\_\{i\}\\leftarrow\\begin\{cases\}\\Phi\_\{\\phi\}\(f\(\\mathbf\{x\}\)\_\{i\}\)&\\text\{if \}i\\in\\mathcal\{P\}\\\\ f\(\\mathbf\{x\}\)\_\{i\}&\\text\{otherwise\}\\end\{cases\}\(1\)The base parametersθ\\thetaremain frozen; onlyϕ\\phiis trained\. Different adapter methods correspond to different choices of what kind of moduleℱ\\mathcal\{F\}ranges over and what formΦϕ\\Phi\_\{\\phi\}takes\.
###### LoRA\.
LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.14217#bib.bib19)\)targets frozen weight matrices: eachf∈ℱf\\in\\mathcal\{F\}is a linear mapf\(𝐱\)=W𝐱f\(\\mathbf\{x\}\)=W\\mathbf\{x\}withW∈ℝn×mW\\in\\mathbb\{R\}^\{n\\times m\}\. LoRA introduces trainable matrices𝐁∈ℝn×r\\mathbf\{B\}\\in\\mathbb\{R\}^\{n\\times r\}and𝐀∈ℝr×m\\mathbf\{A\}\\in\\mathbb\{R\}^\{r\\times m\}withr≪min\(n,m\)r\\ll\\min\(n,m\), and defines:
ΦϕLoRA\(f\(𝐱\)\)=f\(𝐱\)\+α𝐁𝐀𝐱\\Phi^\{\\text\{LoRA\}\}\_\{\\phi\}\(f\(\\mathbf\{x\}\)\)=f\(\\mathbf\{x\}\)\+\\alpha\\mathbf\{B\}\\mathbf\{A\}\\mathbf\{x\}\(2\)whereα\\alphais a constant scaling prefactor andϕ=\{\(𝐀,𝐁\)\}f∈ℱ\\phi=\\\{\(\\mathbf\{A\},\\mathbf\{B\}\)\\\}\_\{f\\in\\mathcal\{F\}\}\. We work in the experimental setting where the choices forℱ\\mathcal\{F\}are the attention projections and MLP projections at all layers\. In standard LoRA,𝒫\\mathcal\{P\}spans all token positions\.
###### ReFT\.
ReFT\(Wu et al\.,[2024b](https://arxiv.org/html/2605.14217#bib.bib40)\)targets residual streams: eachf∈ℱf\\in\\mathcal\{F\}is the identity map at a given layerll, sof\(𝐱\)=𝐡lf\(\\mathbf\{x\}\)=\\mathbf\{h\}^\{l\}\. We focus on the DiReFT parametrisation, which we found most effective:
ΦϕDiReFT\(𝐡\)=𝐡\+𝐁⊤\(𝐀𝐡\+𝐛\)\\Phi^\{\\text\{DiReFT\}\}\_\{\\phi\}\(\\mathbf\{h\}\)=\\mathbf\{h\}\+\\mathbf\{B\}^\{\\top\}\(\\mathbf\{A\}\\mathbf\{h\}\+\\mathbf\{b\}\)\(3\)where𝐀,𝐁∈ℝr×d\\mathbf\{A\},\\mathbf\{B\}\\in\\mathbb\{R\}^\{r\\times d\}and𝐛∈ℝr\\mathbf\{b\}\\in\\mathbb\{R\}^\{r\}, withr<dr<d, andϕ=\{\(𝐀,𝐁,𝐛\)\}f∈ℱ\\phi=\\\{\(\\mathbf\{A\},\\mathbf\{B\},\\mathbf\{b\}\)\\\}\_\{f\\in\\mathcal\{F\}\}\. Since𝐀\\mathbf\{A\}is unconstrained,ΦDiReFT\\Phi^\{\\text\{DiReFT\}\}adds a vector lying in an at\-most rank\-rrsubspace to the residual stream\. In the original ReFT formulation,𝒫\\mathcal\{P\}was a tunable hyperparameter consisting ofkkprefix and suffix prompt positions; in this work, we instead let𝒫\\mathcal\{P\}span either all prompt tokens or all token positions, as defined next\.
#### 3\.2PreFT
We now introducePreFTas a single\-line modification to existing adapters: we restrict the position set to the prompt tokens only,
𝒫PreFT=\{1,…,p\},\\mathcal\{P\}^\{\\text\{PreFT\}\}=\\\{1,\\dots,p\\\},\(4\)whereppis the length of the input prompt\. Applied to LoRA and DiReFT, this yields the variants LoRAPand DiReFTP, which we compare against their all\-position counterparts LoRA and DiReFTA\(with𝒫A=\{1,…,p\+T\}\\mathcal\{P\}^\{A\}=\\\{1,\\dots,p\+T\\\}whereTTis the number of generated tokens\)\.
###### Why prefill\-only adapters?
Prefill and decode steps represent different inference workloads: prefill iscompute\-bound\(since many tokens from a single request can be processed in parallel, occupying the available compute\), and decode ismemory\-bound\(since each request only computes a single new token and loads a potentially large KV cache from memory\)\.222The exact arithmetic intensity of each operation depends on implementation and hardware, but refer to e\.g\.Austin et al\. \([2025](https://arxiv.org/html/2605.14217#bib.bib3)\): “During prefill, all matrix multiplications are basically always compute\-bound \[…\] During generation, the total token batch size must be greater thanBcritB\_\{\\text\{crit\}\}to be compute\-bound on the linear/feed\-forward operations \(240240for bf16 params on TPU v5e\)\. Because generation happens serially, token\-by\-token, this requires us to batch multiple requests together, which is hard\!”This distinction necessitates different kinds of optimisation in order to maximise efficiency on both types of workloads\. As a result, disaggregating prefill and decode workers entirely is increasingly common\(Zhong et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib43); Qin et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib33),[2026](https://arxiv.org/html/2605.14217#bib.bib34),inter alia\)\.
To maximise throughput during decode, it is beneficial to batch multiple requests\. However, if each request requires serving a unique PEFT adapter, this adds to memory overhead since \(like per\-request KV caches\) each PEFT for the batch must be loaded into memory\. To be clear,Zhu et al\. \([2026](https://arxiv.org/html/2605.14217#bib.bib46)\)show that LoRA’s down projection \(whererris rank;ddis output dimension, andbbis batch size\) is alreadymemory\-boundin half\-precision:
𝕀LoRA,down=FLOPsbytes=2bdr2\(bd\+dr\+br\)=11r\+1b\+1d≪𝔹\\mathbb\{I\}\_\{\\text\{LoRA,down\}\}=\\frac\{\\text\{FLOPs\}\}\{\\text\{bytes\}\}=\\frac\{2bdr\}\{2\(bd\+dr\+br\)\}=\\frac\{1\}\{\\frac\{1\}\{r\}\+\\frac\{1\}\{b\}\+\\frac\{1\}\{d\}\}\\ll\\mathbb\{B\}\(5\)where𝔹\\mathbb\{B\}is e\.g\. 295 for FP16 on NVIDIA H100\. The down projection for DiReFT is identical and therefore also memory\-bound\. This adds significant overhead at already memory\-bound decode\.
We therefore simply proposenot applying the memory\-bound PEFT operations at decode steps\.This is in addition to existing serving optimisations such as custom multi\-LoRA kernels and memory paging\(Chen et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib6); Sheng et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib37)\):PreFTremoves the decode\-time cost entirely rather than amortising it, and composes naturally with these techniques when they are applied to prefill\. It remains for us to demonstrate that the resulting efficiency gains are worth any cost in downstream quality\.
### 4Efficient multi\-PreFTinference
We now implement and benchmark multi\-PreFTserving with LoRA and ReFT in a fork of the LLM inference engine vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2605.14217#bib.bib22)\)\. We compare our approach against vLLM’s built\-in multi\-LoRA serving, and show that prefill\-only adapters are highly performant under a variety of inference workloads\. For example, onLlama\-3\.1\-70Bwe achieve1\.9×1\.9\\timesand1\.87×1\.87\\timesmore inference throughput than multi\-LoRA for prefill\-only ReFT and LoRA, respectively\. Notably, unlike for LoRA, we do not use specially designed kernels to optimise multi\-ReFT serving, but we still achieve significant efficiency gains\.
\(a\)Throughput for LoRA, DiReFTP, and DiReFTAunder 4 inference workloads with varying num\. adapters\.
\(b\)Throughput for LoRA vs\.PreFTs in theUniformsetting, scaling up to 512 adapters with paging\.
Figure 2:PreFTs maintain throughput more effectively than traditional PEFTs as number of adapters increase\.Inference throughput \(tokens/s\) on the Punica microbenchmark when comparing rank\-11LoRA \(prefill\-only vs\. all positions\) and rank\-88DiReFT \(prefill\-only vs\. all positions\) with varying numbers of adapters\.###### Implementation\.
To makePreFTusable at inference time, we forkvLLMand integrate ReFT and LoRAPadapters directly into the serving engine\. The fork also adds compatibility for both multi\-adapter and single adapter runs and exposes a weight\-sync interface that pushes updated adapter parameters to live inference workers in a single collective call, enabling tight training–inference loops for on\-policy RL without restarting the engine\. The implementation works transparently under chunked prefill and across the attention kernels vLLM ships\.333Full details, including the position\-mask and CUDA\-graph considerations that make this work, are provided in Appendix[D](https://arxiv.org/html/2605.14217#A4)\.
#### 4\.1Benchmarking
We benchmark the throughput ofPreFTs against the multi\-LoRA implementation in vLLM, using the microbenchmarks from Punica\(Chen et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib6)\)\. vLLM implements both multi\-LoRA memory paging using the approach proposed in S\-LoRA\(Sheng et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib37)\)and custom Triton kernels for heterogeneous batch LoRA processing based on Punica\.444See[https://github\.com/vllm\-project/vllm/pull/5036](https://github.com/vllm-project/vllm/pull/5036)\.
###### Single GPU, all adapters on\-GPU\.
The single GPU microbenchmark in Punica evaluates throughput given1,0001,000requests processed in a first\-come/first\-served manner \(not offline\)\. We replicate this microbenchmark on a single H100 80GB GPU, for varying numbers of adapters\. All adapters are kept in GPU memory in this setting so as to limit the comparison to inference efficiency\. The maximum batch size for the vLLM worker is set to3232\. We test four types of workloads:Identical\(all requests to one adapter; overhead is from holding adapters in memory\),Uniform\(adapter is chosen uniformly at random\),Distinct\(all requests are to different adapters\), andSkewed\(adapter is sampled from a Zipfian distribution\)\. We run this benchmark onQwen2\.5\-0\.5B InstructandLlama\-3\.1\-8B Instruct, comparing rank\-11LoRA, rank\-1 LoRAP, rank\-88DiReFTP, and rank\-88DiReFTA\. We vary the number of adapters in\{1,2,4,8,16,32\}\\\{1,2,4,8,16,32\\\}\.
In[Figure˜2\(a\)](https://arxiv.org/html/2605.14217#S4.F2.sf1), we plot inference throughput with varying numbers of adapters in all settings, with a grey dashed line indicating baseline throughput without any adapters \(11,79211,792tok/s forQwen2\.5\-0\.5B,3,6063,606tok/s forLlama\-3\.1\-8B\)\. We find that DiReFTPmaintains near\-baseline throughput even as the number of adapters is increased to3232; it achieves2\.08×2\.08\\timesmore throughput than LoRA in the Uniform setting withQwen2\.5\-0\.5Band1\.93×1\.93\\timeswithLlama\-3\.1\-8B\.
###### Single GPU, paged memory for moving adapters from CPU to GPU\.
We replicate vLLM’s multi\-LoRA memory paging system \(adapted from S\-LoRA\) for multi\-ReFT, which allows keeping adapters on CPU and moving only a subset onto GPU as needed\. We fix the maximum number of adapters on\-GPU to3232, and scale up the Punica single\-GPU benchmark in theUniformsetting to512512adapters\.[Figure˜2\(b\)](https://arxiv.org/html/2605.14217#S4.F2.sf2)shows that DiReFTPcontinues to grow its throughput lead over LoRA even with paged adapter memory\.
###### Multiple GPUs, paged memory\.
Finally, we use tensor parallelism over44GPUs to benchmark a larger model:Llama\-3\.1\-70B\(dense\)\.[Figure˜2\(b\)](https://arxiv.org/html/2605.14217#S4.F2.sf2)includes this and shows similar throughput trends\.
### 5The effectiveness ofPreFTs
Figure 3:PreFTs underperform all\-position PEFTs at matched parameter counts, but scale predictably with rank\.Eval loss \(best\-LR\) for LoRAPand DiReFTPcompared to their all position counterparts, on Tülu\-3 \(Llama\-3\.2\-1BandLlama\-3\.1\-8B\) and OpenThoughtsLlama\-3\.2\-1B, as a function of trainable parameter count\.Having established the inference efficiency benefits ofPreFTs in[section˜4](https://arxiv.org/html/2605.14217#S4), we now turn to validating the performance of prefill\-only ReFT and LoRA relative to their all\-position counterparts\. We perform the following experiments:
- •SFT on instruction\-following with Tülu\-3\(Lambert et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib23)\), with downstream evaluations on held\-out benchmarks \([section˜5\.1](https://arxiv.org/html/2605.14217#S5.SS1)\);
- •RLVR on math \(GSM8K and MATH\) and code reasoning \(MBPP, evaluations on HumanEval\), with comparisons to SFT performance \([section˜5\.2](https://arxiv.org/html/2605.14217#S5.SS2); also see[appendix˜G](https://arxiv.org/html/2605.14217#A7)\);
- •SFT on long\-form text generation with LongWriter and evaluated on LongBench\-Write \([section˜5\.3](https://arxiv.org/html/2605.14217#S5.SS3)\);
From the first two sets of experiments, we find thatPreFTs trained with SFT have higher evaluation loss but achieve parity on downstream task performance compared to all\-position ReFT and LoRA; additionally, on RLVR tasks,PreFTs achieve parity on MATH and HumanEval, but not GSM8K\. We further consider thatPreFTperformance may degrade over longer generations; in experiments on LongWriter, we find that LoRAPmaintains the adapted behaviour of its all positions counterpart and investigate why DiReFTPdoes not do the same\.
Overall, we find thatPreFTs approach downstream performance of PEFTs on RL and SFT and the gap on long outputs exists only on DiReFTP, not LoRAP\. Therefore,PreFTs sacrifice little performance compared to PEFTs despite large efficiency gains\.
#### 5\.1SFT: Tülu\-3 and OpenThoughts
Table 1:PEFTs andPreFTs have no significant difference on downstream performance under SFT\.SFT of base Llama models, evaluated on Tülu\-3, gives similar downstream scores for parameter\-matched prefill vs\. all\-position adapters\.\*indicates statistical significance\.We seek to compare performance betweenPreFTand all\-position PEFTs under supervised finetuning\. To do so, we replicate experiments fromSchulman and Thinking Machines Lab \([2025](https://arxiv.org/html/2605.14217#bib.bib35)\), as a useful testbed for comparing adapter performance\. We compare DiReFTPand DiReFTA, and LoRAPand LoRA\.
###### PEFTs have better evaluation loss thanPreFTs\.
Initially, in keeping with , we finetuneLlama\-3\.2\-1B InstructandLlama\-3\.1\-8B Instructon Tülu\-3 \(Lambert et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib23); a general\-purpose instruction\-tuning dataset\) and OpenThoughts \(Guha et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib15); a reasoning dataset with much longer generation lengths\)\. We truncate Tülu\-3 to2,0482\{,\}048tokens and OpenThoughts to16,38416\{,\}384tokens, and only train on completions; further details are in[section˜A\.3](https://arxiv.org/html/2605.14217#A1.SS3)\.
In[Figure˜3](https://arxiv.org/html/2605.14217#S5.F3), we show the results of our sweeps, plotting the evaluation loss of the best\-performing LR for each method at each rank across the three experimental settings\. In general, we find thatPreFTs haveworse evaluation lossthan all\-position PEFTs, but show improvement as rank is scaled up with little effect on throughput\.
###### PEFTs andPreFThave similar downstream performance despite loss differences\.
On Llama Instruct models, finetuning hardly changes downstream performance due to a high starting performance, so we instead run SFT with Tülu\-3 only onbaseLlama\-3\.2\-1BandLlama\-3\.1\-8Bwith a similar LR and rank sweep\. We evaluate on IFEval, MMLU, and GSM8K, which test various downstream abilities; see[section˜A\.3](https://arxiv.org/html/2605.14217#A1.SS3)for more information\. In[Table˜1](https://arxiv.org/html/2605.14217#S5.T1), we compare the best\-overall LR \(by eval\. loss\) parameter\-matchedPreFTs and PEFTs using the Cochran–Mantel–Haenszel test555Details behind motivation and implementation of this test are in[appendixB](https://arxiv.org/html/2605.14217#A2)\.stratified over the evaluation benchmarks and report per\-comparison statistical significance\. Then, we run the two\-sided Wilcoxon signed\-rank test on the per\-comparison pairs, ultimately finding thatthere is no statistically significant difference betweenPreFTs and PEFTs on downstream tasks under SFT\(p=0\.083p=0\.083\)\.
#### 5\.2RLVR: Math reasoning and code generation
We now continue our comparison on reinforcement learning with verifiable feedback \(RLVR\) tasks, focusing initially on math reasoning and code generation\. We use a fork of thetrllibrary for all reinforcement learning experiments, with support for ourvllmfork as the inference backend\. We use the GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib36)\)algorithm to trainQwen2\.5\-0\.5BandLlama\-3\.1\-8Bon the train set of GSM8K, MATH, and MBPP respectively; further details are in[section˜A\.4](https://arxiv.org/html/2605.14217#A1.SS4)\.
###### Results\.
[Table˜2](https://arxiv.org/html/2605.14217#S5.T2)reports prefill\-only and all\-position evaluation accuracy across GSM8K, MATH, and HumanEval onLlama\-3\.1\-8BandQwen2\.5\-0\.5B\. OnQwen2\.5\-0\.5BMATH evaluations, all six prefill cells fall within2\.082\.08pp of their all\-position counterpart and four favor prefill \(two of which are notably statistically significant\)\. HumanEval results look similar to MATH results and show parity amongstPreFTs and PEFTs\. However,Δ\\Deltas happen to be much larger across GSM8K evaluations than other groups; we further investigate this in[appendix˜G](https://arxiv.org/html/2605.14217#A7)but ultimately do not have a clear explanation\. Under the two\-sided Wilcoxon signed\-rank test, we do find a statistically significant difference betweenPreFTs and PEFTs, withp=0\.014<0\.05p=0\.014<0\.05\. With a meanΔ=−2\.04\\Delta=\-2\.04, the difference is not large but still consistently in favour of all\-position PEFTs\.
Table 2:PreFTs nearly match PEFTs on downstream performance under RL\.We report prefill vs\. all\-pos performance of both DiReFT and LoRA on three RL tasks, along with deltas\. We find thatPreFTs approach PEFT performance on MATH and HumanEval but lag on GSM8K\.\*indicates statistical significance\.
#### 5\.3Characterising limits of prefill\-only attenuation
We hypothesise that the influence of prefill\-only adapters attenuates with distance and should produce coherent output at any length but progressively lose the ability to enforce behaviour as generation continues\. We therefore evaluate the extent to which the prefill\-only intervention’s effect has faded and the base model’s original behaviour reasserted itself\. LongWriter\(Bai et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib4)\)is well\-suited to test this hypothesis\. The benchmark prompts models to produce text at specified target lengths ranging from a few hundred to tens of thousands of tokens, and scores each generation along two independent axes: a quality scoreSqS\_\{q\}assessing the produced text, and a length\-following scoreSlS\_\{l\}measuring whether the output matched the requested length\.666Precise descriptions of the scoring methods and the training setup are in[appendixF](https://arxiv.org/html/2605.14217#A6)\.We trainPreFTs and PEFTs with SFT at rank1616onLlama\-3\.1\-8B Instructand evaluate at four output\-length brackets\.
###### LoRA\-prefill preserves length\-following; DiReFT\-prefill overshoots\.
In[Figure˜4](https://arxiv.org/html/2605.14217#S5.F4), LoRAPdoes not fade with decode distance; rather its curves run parallel to LoRA throughout, rejecting our hypothesis\. DiReFTPconfirms the hypothesis but fails in the*opposite*direction: rather than reverting to base behaviour at long lengths, its conditional length\-matching signal collapses into an unconditional*write long*bias\. We conduct ablations on LoRAPspecifically in[appendix˜H](https://arxiv.org/html/2605.14217#A8), finding that applying it to \(entire or only parts of\) MLPs still maintains high length\-following capability, showing that in theory per\-position operations are sufficient\. We leave further investigation of length collapse in DiReFTPto future work\. Overall, we find that LoRAPis sufficient for maintaining high performance on this task\.
Figure 4:LoRAPmatches LoRA on length\-following without sacrificing quality; DiReFTPdoes not\.LongWriter on Llama\-3\.1\-8B\-Instruct \(rank1616\), four required output\-length brackets\. The prefill and all\-position LoRA variants are indistinguishable on bothSlS\_\{l\}\(left\) andSqS\_\{q\}\(middle\) at every bracket\. DiReFTPinstead overshoots target length, writing runaway generations that saturate the 32k\-token decoding cap on most≥\\geq2k requests \(right\) and draggingSqS\_\{q\}below Base at long brackets\. DiReFTAliftsSlS\_\{l\}over base across all brackets while preserving quality\.
### 6Discussion
###### Performance vs\. throughput as a new goal for PEFTs\.
Parameter\-efficient finetuning research optimises performance per trainable parameter, but our results suggest this is the wrong axis for personalised serving\. Parameter count is a proxy for memory footprint at training time; what matters atinferencetime is whether adapter operations land in the compute\-bound prefill phase or the memory\-bound decode phase\.PreFTs are the first to make this distinction, and the throughput gains we measure are the result of adapter operations being placed where they cost the least\.
###### Relationship with KV cache\-based adaptation techniques\.
Besides standard adapter\-style PEFTs \(e\.g\. LoRA\) and representation\-editing PEFTs \(e\.g\. ReFT\), another approach to cheap adaptation modifies the KV cache, via learnable soft prompts\(Li and Liang,[2021](https://arxiv.org/html/2605.14217#bib.bib24)\)or by directly learning new KV cache entries\(Eyuboglu et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib12)\)\. These techniques also avoid loading in adapters, at the expense of having to store and compute attention for a larger KV cache \(which does add overhead on decode steps\)\.PreFTs do not add new entries to the KV cache, but both our approaches disable prefix caching across different adapters\. We do not benchmark against these methods directly; based on their KV\-cache overhead at decode, we expect their throughput to fall between PEFTs and PreFTs, but confirming this empirically is left to future work\. These approaches are thus complementary to ours and may occupy a novel position along the accuracy–throughput tradeoff\.
###### Future work\.
We emphasise that LoRA and DiReFT were originally designed assuming different position applications\. In this work, we evaluate the prefill\-only restriction using these existing adapter architectures unchanged, but novel architectures for such adapters are unexplored\. For example, an architecture that targets attention projections to write its effect into the KV cache, or one whose parametrisation is optimised for fixed\-context\-length operations; these could plausibly close the remaining performance gaps we observe in[section˜5](https://arxiv.org/html/2605.14217#S5)\. We leave the design of such adapters to future work\.
### 7Conclusion
We introducePreFT, a family of prefill\-only parameter\-efficient finetuning methods designed for efficient multi\-adapter serving\. Our throughput gains of e\.g\.1\.9×1\.9\\timesthe throughput of multi\-LoRA when serving 512 adapters onLlama\-3\.1\-70Bcome from placing adapter operations in more optimal positions, rather than kernel optimisations\. Across model scales from 0\.5B to 70B parameters, prefill\-only adaptation matches all\-position adapters on SFT downstream evaluations and approaches them on RLVR tasks, along with significant throughput gains\. For the workloads most likely to drive personalised serving at scale \(e\.g\. style adaptation, per\-user memory\),PreFTsits on a more favourable point of the accuracy–throughput frontier than existing PEFTs\.
### Acknowledgements
We thank Harshit Joshi, Dilara Soylu, Yanzhe Zhang, Nikil Roashan Selvam, Jiuding Sun, and members of the Stanford\#weekly\-interp\-meetingfor helpful discussion throughout the project\. We thank the Palo Alto location of Molly Tea for enabling our learning rate sweeps\.
### References
- Agarwal et al\. \[2024\]Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem\.On\-policy distillation of language models: Learning from self\-generated mistakes\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.URL[https://openreview\.net/forum?id=3zKtaqxLhW](https://openreview.net/forum?id=3zKtaqxLhW)\.
- Austin et al\. \[2021\]Jacob Austin, Augustus Odena, Maxwell I\. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J\. Cai, Michael Terry, Quoc V\. Le, and Charles Sutton\.Program synthesis with large language models\.*CoRR*, abs/2108\.07732, 2021\.URL[https://arxiv\.org/abs/2108\.07732](https://arxiv.org/abs/2108.07732)\.
- Austin et al\. \[2025\]Jacob Austin, Sholto Douglas, Roy Frostig, Anselm Levskaya, Charlie Chen, Sharad Vikram, Federico Lebron, Peter Choy, Vinay Ramasesh, Albert Webson, and Reiner Pope\.*How to Scale Your Model*\.Google DeepMind, 2025\.URL[https://jax\-ml\.github\.io/scaling\-book/](https://jax-ml.github.io/scaling-book/)\.
- Bai et al\. \[2024\]Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li\.Longwriter: Unleashing 10,000\+ word generation from long context llms, 2024\.URL[https://arxiv\.org/abs/2408\.07055](https://arxiv.org/abs/2408.07055)\.
- Balazy et al\. \[2025\]Klaudia Balazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor\.LoRA\-XS: Low\-rank adaptation with extremely small number of parameters\.In Inês Lynce, Nello Murano, Mauro Vallati, Serena Villata, Federico Chesani, Michela Milano, Andrea Omicini, and Mehdi Dastani, editors,*ECAI 2025 \- 28th European Conference on Artificial Intelligence, 25\-30 October 2025, Bologna, Italy \- Including 14th Conference on Prestigious Applications of Intelligent Systems \(PAIS 2025\)*, Frontiers in Artificial Intelligence and Applications, pages 3194–3201\. IOS Press, 2025\.doi:10\.3233/FAIA251185\.URL[https://doi\.org/10\.3233/FAIA251185](https://doi.org/10.3233/FAIA251185)\.
- Chen et al\. \[2024\]Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy\.Punica: Multi\-tenant lora serving\.In Phillip B\. Gibbons, Gennady Pekhimenko, and Christopher De Sa, editors,*Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13\-16, 2024*\. mlsys\.org, 2024\.URL[https://proceedings\.mlsys\.org/paper\_files/paper/2024/hash/054de805fcceb78a201f5e9d53c85908\-Abstract\-Conference\.html](https://proceedings.mlsys.org/paper_files/paper/2024/hash/054de805fcceb78a201f5e9d53c85908-Abstract-Conference.html)\.
- Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert\-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N\. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba\.Evaluating large language models trained on code\.*CoRR*, abs/2107\.03374, 2021\.URL[https://arxiv\.org/abs/2107\.03374](https://arxiv.org/abs/2107.03374)\.
- Chen et al\. \[2026\]Nan Chen, Soledad Villar, and Soufiane Hayou\.Learning rate scaling across LoRA ranks and transfer to full finetuning\.*arXiv:2602\.06204*, 2026\.URL[https://arxiv\.org/abs/2602\.06204](https://arxiv.org/abs/2602.06204)\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*CoRR*, abs/2110\.14168, 2021\.URL[https://arxiv\.org/abs/2110\.14168](https://arxiv.org/abs/2110.14168)\.
- Cochran \[1954\]William G Cochran\.Some methods for strengthening the commonχ2\\chi^\{2\}tests\.*Biometrics*, 10\(4\):417–451, 1954\.URL[https://www\.jstor\.org/stable/3001616](https://www.jstor.org/stable/3001616)\.
- Dietz et al\. \[2026\]Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, and Dietrich Klakow\.Split personality training: Revealing latent knowledge through alternate personalities\.*arXiv:2602\.05532*, 2026\.URL[https://arxiv\.org/abs/2602\.05532](https://arxiv.org/abs/2602.05532)\.
- Eyuboglu et al\. \[2025\]Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, and Christopher Re\.Cartridges: Lightweight and general\-purpose long context representations via self\-study\.*arXiv:2506\.06266*, 2025\.URL[https://arxiv\.org/abs/2506\.06266](https://arxiv.org/abs/2506.06266)\.
- Gabrielsson et al\. \[2025\]Rickard Brüel Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan H\. Greenewald, Mikhail Yurochkin, and Justin Solomon\.Compress then serve: Serving thousands of LoRA adapters with little overhead\.In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste\-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,*Forty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025*, Proceedings of Machine Learning Research\. PMLR / OpenReview\.net, 2025\.URL[https://proceedings\.mlr\.press/v267/gabrielsson25a\.html](https://proceedings.mlr.press/v267/gabrielsson25a.html)\.
- Greenewald et al\. \[2025\]Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, and David Cox\.Activated LoRA: Fine\-tuned LLMs for intrinsics\.*arXiv:2504\.12397*, 2025\.URL[https://arxiv\.org/abs/2504\.12397](https://arxiv.org/abs/2504.12397)\.
- Guha et al\. \[2025\]Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng\-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad\-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai\-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A\. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G\. Dimakis, and Ludwig Schmidt\.OpenThoughts: Data recipes for reasoning models\.*arXiv:2506\.04178*, 2025\.URL[https://arxiv\.org/abs/2506\.04178](https://arxiv.org/abs/2506.04178)\.
- Hendrycks et al\. \[2021a\]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.In*9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021*\. OpenReview\.net, 2021a\.URL[https://openreview\.net/forum?id=d7KBjmI3GmQ](https://openreview.net/forum?id=d7KBjmI3GmQ)\.
- Hendrycks et al\. \[2021b\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the MATH dataset\.*CoRR*, abs/2103\.03874, 2021b\.URL[https://arxiv\.org/abs/2103\.03874](https://arxiv.org/abs/2103.03874)\.
- Houlsby et al\. \[2019\]Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly\.Parameter\-efficient transfer learning for NLP\.In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,*Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9\-15 June 2019, Long Beach, California, USA*, volume 97 of*Proceedings of Machine Learning Research*, pages 2790–2799\. PMLR, 2019\.URL[http://proceedings\.mlr\.press/v97/houlsby19a\.html](http://proceedings.mlr.press/v97/houlsby19a.html)\.
- Hu et al\. \[2022\]Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.Lora: Low\-rank adaptation of large language models\.In*The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022*\. OpenReview\.net, 2022\.URL[https://openreview\.net/forum?id=nZeVKeeFYf9](https://openreview.net/forum?id=nZeVKeeFYf9)\.
- Kalajdzievski \[2023\]Damjan Kalajdzievski\.A rank stabilization scaling factor for fine\-tuning with LoRA\.*arXiv:2312\.03732*, 2023\.URL[https://arxiv\.org/abs/2312\.03732](https://arxiv.org/abs/2312.03732)\.
- Kopiczko et al\. \[2024\]Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M\. Asano\.VeRA: Vector\-based random matrix adaptation\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.URL[https://openreview\.net/forum?id=NjNfLdxr3A](https://openreview.net/forum?id=NjNfLdxr3A)\.
- Kwon et al\. \[2023\]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica\.Efficient memory management for large language model serving with PagedAttention\.In Jason Flinn, Margo I\. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,*Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23\-26, 2023*, pages 611–626\. ACM, 2023\.doi:10\.1145/3600006\.3613165\.URL[https://doi\.org/10\.1145/3600006\.3613165](https://doi.org/10.1145/3600006.3613165)\.
- Lambert et al\. \[2025\]Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V\. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D\. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A\. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi\.Tulu 3: Pushing frontiers in open language model post\-training\.*arXiv:2411\.15124*, 2025\.URL[https://arxiv\.org/abs/2411\.15124](https://arxiv.org/abs/2411.15124)\.
- Li and Liang \[2021\]Xiang Lisa Li and Percy Liang\.Prefix\-tuning: Optimizing continuous prompts for generation\.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 4582–4597, Online, August 2021\. Association for Computational Linguistics\.doi:10\.18653/v1/2021\.acl\-long\.353\.URL[https://aclanthology\.org/2021\.acl\-long\.353/](https://aclanthology.org/2021.acl-long.353/)\.
- Liu et al\. \[2024\]Shih\-Yang Liu, Chien\-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu\-Chiang Frank Wang, Kwang\-Ting Cheng, and Min\-Hung Chen\.DoRA: Weight\-decomposed low\-rank adaptation\.In Ruslan Salakhutdinov, Zico Kolter, Katherine A\. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,*Forty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024*, Proceedings of Machine Learning Research, pages 32100–32121\. PMLR / OpenReview\.net, 2024\.URL[https://proceedings\.mlr\.press/v235/liu24bn\.html](https://proceedings.mlr.press/v235/liu24bn.html)\.
- Lu and Thinking Machines Lab \[2025\]Kevin Lu and Thinking Machines Lab\.On\-policy distillation\.*Thinking Machines Lab: Connectionism*, 2025\.URL[https://thinkingmachines\.ai/blog/on\-policy\-distillation](https://thinkingmachines.ai/blog/on-policy-distillation)\.
- Mantel and Haenszel \[1959\]Nathan Mantel and William Haenszel\.Statistical aspects of the analysis of data from retrospective studies of disease\.*Journal of the National Cancer Institute*, 22\(4\):719–748, 1959\.URL[https://academic\.oup\.com/jnci/article\-abstract/22/4/719/900746](https://academic.oup.com/jnci/article-abstract/22/4/719/900746)\.
- McNemar \[1947\]Quinn McNemar\.Note on the sampling error of the difference between correlated proportions or percentages\.*Psychometrika*, 12\(2\):153–157, 1947\.
- Morris et al\. \[2026\]John X\. Morris, Niloofar Mireshghallah, Mark Ibrahim, and Saeed Mahloujifar\.Learning to reason in 13 parameters\.*arXiv:2602\.04118*, 2026\.URL[https://arxiv\.org/abs/2602\.04118](https://arxiv.org/abs/2602.04118)\.
- Ni et al\. \[2026\]Yinan Ni, Xiao Yang, Yuqi Tang, Zhimin Qiu, Chen Wang, and Tingzhou Yuan\.Predictive\-LoRA: A proactive and fragmentation\-aware serverless inference system for LLMs\.In*Proceedings of the 2025 6th International Conference on Computer Science and Management Technology*, ICCSMT ’25, page 1267–1273, New York, NY, USA, 2026\. Association for Computing Machinery\.ISBN 9798400719981\.doi:10\.1145/3795154\.3795359\.URL[https://doi\.org/10\.1145/3795154\.3795359](https://doi.org/10.1145/3795154.3795359)\.
- Pfeiffer et al\. \[2020\]Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder\.MAD\-X: An Adapter\-Based Framework for Multi\-Task Cross\-Lingual Transfer\.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 7654–7673, Online, November 2020\. Association for Computational Linguistics\.doi:10\.18653/v1/2020\.emnlp\-main\.617\.URL[https://aclanthology\.org/2020\.emnlp\-main\.617/](https://aclanthology.org/2020.emnlp-main.617/)\.
- Potdar et al\. \[2025\]Nihal Potdar, Megha Agarwal, Hanlin Tang, Asfandyar Qureshi, Qi Zheng, Daya Khudia, Tianrun Li, Nikunj Gupta, and James Thomas\.Fast PEFT serving at scale, 2025\.URL[https://www\.databricks\.com/blog/fast\-peft\-serving\-scale](https://www.databricks.com/blog/fast-peft-serving-scale)\.
- Qin et al\. \[2025\]Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu\.Mooncake: A KVCache\-centric disaggregated architecture for LLM serving\.*arXiv:2407\.00079*, 2025\.URL[https://arxiv\.org/abs/2407\.00079](https://arxiv.org/abs/2407.00079)\.
- Qin et al\. \[2026\]Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang\.Prefill\-as\-a\-service: KVCache of next\-generation models could go cross\-datacenter\.*arXiv:2604\.15039*, 2026\.URL[https://arxiv\.org/abs/2604\.15039](https://arxiv.org/abs/2604.15039)\.
- Schulman and Thinking Machines Lab \[2025\]John Schulman and Thinking Machines Lab\.LoRA without regret\.*Thinking Machines Lab: Connectionism*, 2025\.doi:10\.64434/tml\.20250929\.https://thinkingmachines\.ai/blog/lora/\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.DeepSeekMath: Pushing the limits of mathematical reasoning in open language models\.*arXiv:2402\.03300*, 2024\.URL[https://arxiv\.org/abs/2402\.03300](https://arxiv.org/abs/2402.03300)\.
- Sheng et al\. \[2024\]Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E\. Gonzalez, and Ion Stoica\.S\-LoRA: Serving thousands of concurrent LoRA adapters\.*arXiv:2311\.03285*, 2024\.URL[https://arxiv\.org/abs/2311\.03285](https://arxiv.org/abs/2311.03285)\.
- Wang et al\. \[2025\]Barry Wang, Avi Schwarzschild, Alexander Robey, Ali Payani, Charles Fleming, Mingjie Sun, and Daphne Ippolito\.Command\-V: Pasting LLM behaviors via activation profiles\.*arXiv:2506\.19140*, 2025\.URL[https://arxiv\.org/abs/2506\.19140](https://arxiv.org/abs/2506.19140)\.
- Wu et al\. \[2024a\]Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin\.dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving\.In Ada Gavrilovska and Douglas B\. Terry, editors,*18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10\-12, 2024*, pages 911–927\. USENIX Association, 2024a\.URL[https://www\.usenix\.org/conference/osdi24/presentation/wu\-bingyang](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang)\.
- Wu et al\. \[2024b\]Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D\. Manning, and Christopher Potts\.ReFT: Representation finetuning for language models\.In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M\. Tomczak, and Cheng Zhang, editors,*Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024*, 2024b\.URL[http://papers\.nips\.cc/paper\_files/paper/2024/hash/75008a0fba53bf13b0bb3b7bff986e0e\-Abstract\-Conference\.html](http://papers.nips.cc/paper_files/paper/2024/hash/75008a0fba53bf13b0bb3b7bff986e0e-Abstract-Conference.html)\.
- Zeng et al\. \[2025\]Shenglai Zeng, Pengfei He, Kai Guo, Tianqi Zheng, Hanqing Lu, Yue Xing, and Hui Liu\.Towards context\-robust LLMs: A gated representation fine\-tuning approach\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 10262–10276, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.doi:10\.18653/v1/2025\.acl\-long\.506\.URL[https://aclanthology\.org/2025\.acl\-long\.506/](https://aclanthology.org/2025.acl-long.506/)\.
- Zhang et al\. \[2025\]Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo\.Improving the serving performance of multi\-LoRA large language models via efficient LoRA and KV cache management\.*arXiv:2505\.03756*, 2025\.URL[https://arxiv\.org/abs/2505\.03756](https://arxiv.org/abs/2505.03756)\.
- Zhong et al\. \[2024\]Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang\.Distserve: Disaggregating prefill and decoding for goodput\-optimized large language model serving\.In Ada Gavrilovska and Douglas B\. Terry, editors,*18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10\-12, 2024*, pages 193–210\. USENIX Association, 2024\.URL[https://www\.usenix\.org/conference/osdi24/presentation/zhong\-yinmin](https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin)\.
- Zhou et al\. \[2023\]Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou\.Instruction\-following evaluation for large language models\.*arXiv:2311\.07911*, 2023\.URL[https://arxiv\.org/abs/2311\.07911](https://arxiv.org/abs/2311.07911)\.
- Zhu et al\. \[2025\]Ruidong Zhu, Ziyue Jiang, Zhi Zhang, Xin Liu, Xuanzhe Liu, and Xin Jin\.Cannikin: No lagger of SLO in concurrent multiple lora LLM serving\.*IEEE Trans\. Parallel Distributed Syst\.*, 36\(9\):1972–1984, 2025\.doi:10\.1109/TPDS\.2025\.3590014\.URL[https://doi\.org/10\.1109/TPDS\.2025\.3590014](https://doi.org/10.1109/TPDS.2025.3590014)\.
- Zhu et al\. \[2026\]Zhanda Zhu, Qidong Su, Yaoyao Ding, Kevin Song, Shang Wang, and Gennady Pekhimenko\.LoRAFusion: Efficient LoRA fine\-tuning for LLMs\.In Antonio Barbalace, Luo Mai, Roxana Geambasu, and Peter R\. Pietzuch, editors,*Proceedings of the 21st European Conference on Computer Systems, EuroSys 2026, McEwan Hall/The University of Edinburgh, Edinburgh, Scotland, UK, April 27\-30, 2026*, pages 586–604\. ACM, 2026\.doi:10\.1145/3767295\.3769331\.URL[https://doi\.org/10\.1145/3767295\.3769331](https://doi.org/10.1145/3767295.3769331)\.
## Appendix
### Appendix AImplementation details
#### A\.1LoRA
We use thepeftlibrary for our LoRA finetuning\. We follow the parametrisation of LoRA used inSchulman and Thinking Machines Lab \[[2025](https://arxiv.org/html/2605.14217#bib.bib35)\]: we set the scaling prefactor toα/r\\alpha/rwithα:=32\\alpha:=32, use Kaiming uniform initialisation for𝐀\\mathbf\{A\}, and initialise𝐁\\mathbf\{B\}as zeroes\. Unless otherwise noted, we apply LoRA to the attention projections\{Wq,Wk,Wv,Wo\}\\\{W\_\{q\},W\_\{k\},W\_\{v\},W\_\{o\}\\\}and MLP projections\{Wgate,Wup,Wdown\}\\\{W\_\{\\text\{gate\}\},W\_\{\\text\{up\}\},W\_\{\\text\{down\}\}\\\}at all layers\. We use a LoRA dropout of0throughout\. This is because in RL, a non\-zero dropout desyncs the training and rollout policies because vLLM does not see the dropout mask\. For LoRAP, we use the same parametrisation and initialisation as standard LoRA; the prefill\-only restriction is enforced by position\-routing logic\. Specific implementation logic differs depending on training setup\. For reinforcement learning, this is described in[appendix˜D](https://arxiv.org/html/2605.14217#A4), and for SFT, this is described in[section˜A\.3](https://arxiv.org/html/2605.14217#A1.SS3)\.
#### A\.2ReFT
We use a refactored version of thepyreftlibrary, documented in[section˜D\.1](https://arxiv.org/html/2605.14217#A4.SS1)\. Following the parametrisation we develop in[section˜C\.1](https://arxiv.org/html/2605.14217#A3.SS1), we initialise𝐀\\mathbf\{A\}as a random matrix with orthonormal rows, and𝐁\\mathbf\{B\}and𝐛\\mathbf\{b\}as zeroes; this gives zero\-delta initialisation \(the untrained adapter applies the identity\)\. We apply the scaling prefactorsr=1/rs\_\{r\}=1/\\sqrt\{r\}from[section˜C\.2](https://arxiv.org/html/2605.14217#A3.SS2)to enable learning rate transfer across ranks\. We always intervene on the residual stream at every layer, post\-MLP\. For prefill\-only DiReFTP, we use the same parametrisation and initialisation; only the position set𝒫\\mathcal\{P\}described in[section˜3](https://arxiv.org/html/2605.14217#S3)differs\. As with LoRA, the prefill\-only restriction at training time is documented in[appendix˜D](https://arxiv.org/html/2605.14217#A4)for reinforcement learning experiments\.
#### A\.3Supervised finetuning
##### A\.3\.1Infrastructure
For SFT, we use a patched HuggingFaceTrainerclass\. Unless overridden by a specific experiment, all SFT runs share the following settings:
- •Optimiser\.AdamW with\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\)and weight decay0
- •Schedule\.Constant learning rate\. We use no warmup for the reasoning sweeps \(GSM8K, MATH\) and a warmup ratio of0\.10\.1for the long\-context \(LongBench\-Write\) sweeps\.
- •Batch size\.3232sequences per optimiser step; this is achieved as per\-device batch size22with1616gradient accumulation steps on a single GPU, or per\-device batch size11with1616gradient accumulation steps on two GPUs\.
- •Mixed precision\.bf16 throughout; we never run ReFT in fp16\.
- •Sequence length\.2,0482\{,\}048tokens for reasoning,32,76832\{,\}768tokens for LongBench\-Write\.
- •Kernels\.Flash\-Attention 2 and the Liger fused linear–cross\-entropy kernel are enabled for every run; gradient checkpointing is enabled for all 8B runs and disabled at smaller scales\.
##### A\.3\.2Datasets
For Tülu\-3 SFT, we evaluate on the following relatively simple \(and thus suitable for our model scales\) downstream tasks which test the ability to follow instructions along with knowledge and reasoning:
- •GSM8K\[Cobbe et al\.,[2021](https://arxiv.org/html/2605.14217#bib.bib9)\]consists of grade\-school math word problems and usually requires multi\-step reasoning to solve\.
- •MMLU\[Hendrycks et al\.,[2021a](https://arxiv.org/html/2605.14217#bib.bib16)\]contains several domains of multiple\-choice world knowledge and problem\-solving questions\.
- •IFEval\[Zhou et al\.,[2023](https://arxiv.org/html/2605.14217#bib.bib44)\]contains a variety of verifiable constraint\-satisfying tasks\.
#### A\.4Reinforcement learning with verifiable rewards
###### Learning algorithm\.
We use the standard GRPO\[Shao et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib36)\]algorithm for RLVR, as implemented intrl\. We sampleK=4K=4completions per group with temperature1\.01\.0and top\-p=1\.0p=1\.0, use a per\-device batch size of44prompts with88gradient accumulation steps for an effective3232prompts per optimiser step, train for one epoch with a warmup ratio of0\.10\.1, and do not use a KL penalty \(i\.e\.β=0\\beta=0\)\. Maximum prompt length is512512tokens for GSM8K and2,0482\{,\}048tokens for MATH; maximum completion length is2,0482\{,\}048tokens for GSM8K and4,0964\{,\}096for MATH\.
##### A\.4\.1Reward function
Across all three tasks we use a binary accuracy reward, of1\.01\.0for correct answers and0\.00\.0otherwise, with no format reward, intermediate reasoning credit, or length shaping\. Train\-time and eval\-time scoring share the same code path so that reward magnitudes correspond directly to test accuracy\. We will continue by describing the unique reward functions for each dataset trained\.
###### MATH\.
We extract the predicted answer with themath\_verifylibrary, which anchors on the last`\\boxed\{\.\.\.\}`in the completion and compares to the gold answer for symbolic equivalence \(which handles LaTeX formatting\)\. Onmath\_verifyparse failures we fall back to a legacy`\\boxed\{\.\.\.\}`\-aware extractor combined with a string match that strips`$`, commas, and whitespace\.
###### GSM8K\.
GSM8K answers typically are integers or simple decimals, so we use the canonical extractor: split on the`\#\#\#\#`delimiter and take the first numeric match after it, falling back to the last number in the completion when the delimiter is absent\.
###### HumanEval\.
Here, the reward is unit\-test based: we splice the model’s completion into the function signature provided in the prompt, execute the resulting program against the task’s hidden test cases in a sandboxed subprocess, and award1\.01\.0reward if every test asserts\. Syntax errors, runtime exceptions, and timeouts all return0\.00\.0reward\. There is no partial credit for “code that compiles”\.
#### A\.5Inference
For all benchmarking and serving experiments, we use a fork ofvllmwith multi\-LoRA and multi\-ReFT support\. Implementation details of the position\-mask buffer, CUDA\-graph integration, and the weight\-sync RPC for on\-policy RL for the fork are in[sections˜D\.2](https://arxiv.org/html/2605.14217#A4.SS2),[D\.3](https://arxiv.org/html/2605.14217#A4.SS3)and[D\.4](https://arxiv.org/html/2605.14217#A4.SS4)\.
#### A\.6Evaluation infrastructure
After training, we evaluate with greedy sampling \(temperature0\.00\.0\), generating a single completion per evaluation prompt and computing rewards identically to training\. We evaluate every checkpoint on the full test split rather than a held\-out subsample, except where noted in the experiment\-specific appendices\.
#### A\.7Benchmarking infrastructure
For all serving benchmarks, we replicate the synthetic workload introduced by Punica\[Chen et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib6)\]: prompt and output lengths are drawn from the lognormal/uniform distribution they propose, and each request is tagged with one ofNNadapters under a chosen popularity mix \(uniform unless noted, with skewed and identical mixes used in ablations\)\. We drive ourvllmfork directly so we can attribute wall time to the prefill and decode phases separately, admitting new requests into the engine one at a time up to a fixed concurrencyBB\. Each run is preceded by a warmup pass over the same workload to absorb CUDA\-graph capture and any one\-off setup costs, so the timed numbers reflect steady\-state serving rather than start\-up overhead\. We report per\-token prefill and decode latency \(median, withp90p\_\{90\}andp99p\_\{99\}tails\) and end\-to\-end token throughput, all measured in bf16\. The full workload, engine configuration, and per\-figure sweep settings are documented in[appendix˜E](https://arxiv.org/html/2605.14217#A5)\.
#### A\.8Hardware
SFT and RLVR experiments on Llama\-3\.2\-1B\-Instruct and Qwen2\.5\-0\.5B were run on a single H100 80GB GPU per job \(training and the colocated vLLM engine share the GPU; vLLM is allocated40%40\\%of GPU memory and the trainer takes the remainder\)\. Llama\-3\.1\-8B RLVR runs use four H100 80GBs per job: vLLM serves the policy on two GPUs with tensor parallelism \(TP=2\) over an HTTP/NCCL bridge, and the trainer runs on the other two GPUs under DDP \(per\-device batch size44, gradient accumulation88, effective3232prompts per step over the two data\-parallel ranks\)\. Llama\-3\.1\-8B SFT runs use two H100s with FSDPFULL\_SHARDand transformer\-based auto\-wrap\. Multi\-GPU benchmarks for Llama\-3\.1\-70B and Qwen3\-30B\-A3B use four H100s with tensor parallelism, as described in[appendix˜E](https://arxiv.org/html/2605.14217#A5)\.
### Appendix BStatistical testing
###### CMH test\.
To comparePreFTs and PEFTs that are the same architecture and parameter count, trained on the same dataset, applied to the same model, and evaluated on a set ofbbdownstream benchmarks, we use theCochran–Mantel–Haenszel\(CMH\) test\[Cochran,[1954](https://arxiv.org/html/2605.14217#bib.bib10), Mantel and Haenszel,[1959](https://arxiv.org/html/2605.14217#bib.bib27)\]\.
McNemar’s test\[McNemar,[1947](https://arxiv.org/html/2605.14217#bib.bib28)\]is a nonparametric test used to determine whether two binary outcomes \(binary correctness on the same benchmark question, in our case\) differ in their marginal proportions\. CMH is a generalisation of McNemar’s test where one hasKKstrata of such binary\-outcome paired data \(in our case, multiple benchmarks\), which controls for stratum before testing for association\.
###### Wilcoxon signed\-rank test\.
This is a standard nonparametric test for comparing paired differences between two populations, and thus suitable for our prefill vs\. all\-position comparisons in aggregate\.
### Appendix CReFT without Regret
Two useful properties of LoRA that make learning predictable are:zero\-delta initialisation\(i\.e\. applying an untrained LoRA does not change the model output\) andlearning rate transferacross ranks\. These were not proposed for ReFT previously, so we design an initialisation and parametrisation of both DiReFT and LoReFT that results in these properties, which we use in all our experiments\.
#### C\.1Zero\-delta initialisation
The original implementation of ReFT uses default Kaiming uniform initialisation for the weight matrices, and orthonormal Householder parametrisation for𝐑\\mathbf\{R\}\. This means that at initialisation, ReFT applies a non\-identity function and thus immediately displaces the model from its original behaviour\. Early training is thus wasted on recovering baseline performance\.
Like LoRA, we thus seek to set an initialisation such that both DiReFT and LoReFT apply the identity function\. To do this, for LoReFT we set𝐖\(0\):=𝐑\(0\)\\mathbf\{W\}^\{\(0\)\}:=\\mathbf\{R\}^\{\(0\)\},𝐛:=𝟎\\mathbf\{b\}:=\\mathbf\{0\}, and for DiReFT we set𝐁\\mathbf\{B\}to have orthonormal rows at init \(but no constraint afterwards\),𝐀:=𝟎\\mathbf\{A\}:=\\mathbf\{0\},𝐛:=𝟎\\mathbf\{b\}:=\\mathbf\{0\}\.
Figure 5:SFT of Llama 3\.2 1B Instruct and Llama 3\.1 8B Instruct on Tülu\-3, comparing old and new parametrisations of DiReFT and LoReFT across varying LR and rank\.
#### C\.2Learning rate transfer
A useful property of LoRA is that initialisation can be set to ensure that the optimal LR \(approximately\) transfers across ranksrr, reducing the needed hyperparameter sweeps\. This is achieved via applying a scaling prefactor on the LoRAsr=\{1,1/r,1/r\}s\_\{r\}=\\\{1,1/r,1/\\sqrt\{r\}\\\}, chosen depending on details of initialisation\.777For LoRA,Hu et al\. \[[2022](https://arxiv.org/html/2605.14217#bib.bib19)\]originally use1/r1/ras the scaling prefactor \(adopted bySchulman and Thinking Machines Lab,[2025](https://arxiv.org/html/2605.14217#bib.bib35), but with the observation that the LR transfer is imperfect\), but laterKalajdzievski \[[2023](https://arxiv.org/html/2605.14217#bib.bib20)\]proposes1/r1/\\sqrt\{r\}\.Chen et al\. \[[2026](https://arxiv.org/html/2605.14217#bib.bib8)\]shows that the appropriate prefactor depends on initialisation\.We introduce a scaling prefactorsrs\_\{r\}for ReFTs:
Φ𝖫𝗈𝖱𝖾𝖥𝖳\(𝐡\)\\displaystyle\\Phi\_\{\\mathsf\{LoReFT\}\}\(\\mathbf\{h\}\)=𝐡\+sr𝐑⊤\(𝐖𝐡\+𝐛−𝐑𝐡\)\\displaystyle=\\mathbf\{h\}\+s\_\{r\}\\mathbf\{R\}^\{\\top\}\(\\mathbf\{W\}\\mathbf\{h\}\+\\mathbf\{b\}\-\\mathbf\{R\}\\mathbf\{h\}\)\(6\)Φ𝖣𝗂𝖱𝖾𝖥𝖳\(𝐡\)\\displaystyle\\Phi\_\{\\mathsf\{DiReFT\}\}\(\\mathbf\{h\}\)=𝐡\+sr𝐁⊤\(𝐀𝐡\+𝐛\)\\displaystyle=\\mathbf\{h\}\+s\_\{r\}\\mathbf\{B\}^\{\\top\}\(\\mathbf\{A\}\\mathbf\{h\}\+\\mathbf\{b\}\)\(7\)To achieve LR transfer across ranks, we propose settingsrs\_\{r\}as below:
###### Theorem 1\.
Given the following expression for DiReFT, with scaling prefactorsrs\_\{r\}being a constant:
δ\(𝐡\)=sr𝐁⊤\(𝐀𝐡\+𝐛\)\\delta\(\\mathbf\{h\}\)=s\_\{r\}\\mathbf\{B\}^\{\\top\}\(\\mathbf\{A\}\\mathbf\{h\}\+\\mathbf\{b\}\)\(8\)to ensure that theℓ2\\ell\_\{2\}\-norm of the delta∥δ\(𝐡\)∥\\lVert\\delta\(\\mathbf\{h\}\)\\rVertat the first step of optimisation under Adam is invariant to the DiReFT or LoReFT rankrr, the scaling prefactor ought to be:
sr∝1r\\boxed\{s\_\{r\}\\propto\\frac\{1\}\{\\sqrt\{r\}\}\}
###### Proof\.
Consider DiReFT at the first optimiser step under Adam\. We have:
δ\(𝐡\)=sr𝐁⊤\(𝐀𝐡\+𝐛\)\\delta\(\\mathbf\{h\}\)=s\_\{r\}\\mathbf\{B\}^\{\\top\}\(\\mathbf\{A\}\\mathbf\{h\}\+\\mathbf\{b\}\)\(9\)where𝐁\(0\)=𝟎\\mathbf\{B\}^\{\(0\)\}=\\mathbf\{0\},𝐛\(0\)=𝟎\\mathbf\{b\}^\{\(0\)\}=\\mathbf\{0\}at initialisation\.
Under Adam, given a loss functionℒ\\mathcal\{L\}the first update is:
Δ𝐖\(0\)=−η∇𝐖ℒ\|∇𝐖ℒ\|\+ϵ\\Delta\\mathbf\{W\}^\{\(0\)\}=\-\\eta\\frac\{\\nabla\_\{\\mathbf\{W\}\}\\mathcal\{L\}\}\{\\lvert\\nabla\_\{\\mathbf\{W\}\}\\mathcal\{L\}\\rvert\+\\epsilon\}\(10\)For notation, we now define:
𝐠\\displaystyle\\mathbf\{g\}=∇δℒ\|δ=0\\displaystyle=\\nabla\_\{\\delta\}\\mathcal\{L\}\\big\|\_\{\\delta=0\}\(11\)𝐳\(i\)\\displaystyle\\mathbf\{z\}^\{\(i\)\}=𝐁\(i\)𝐠\\displaystyle=\\mathbf\{B\}^\{\(i\)\}\\mathbf\{g\}\(12\)𝐚\(i\)\\displaystyle\\mathbf\{a\}^\{\(i\)\}=𝐀\(i\)𝐡\+𝐛\(i\)\\displaystyle=\\mathbf\{A\}^\{\(i\)\}\\mathbf\{h\}\+\\mathbf\{b\}^\{\(i\)\}\(13\)Now, after the first optimisation step under Adam, before whichδ\(0\)=0\\delta^\{\(0\)\}=0due to initialisation, we have:
𝐀\(1\)\\displaystyle\\mathbf\{A\}^\{\(1\)\}=𝐀\(0\)−η∇𝐁ℒ\|∇𝐁ℒ\|\+ϵ=−ηsr𝐳\(0\)𝐡⊤sr\|𝐳\(0\)𝐡⊤\|\+ϵ\\displaystyle=\\mathbf\{A\}^\{\(0\)\}\-\\eta\\frac\{\\nabla\_\{\\mathbf\{B\}\}\\mathcal\{L\}\}\{\\lvert\\nabla\_\{\\mathbf\{B\}\}\\mathcal\{L\}\\rvert\+\\epsilon\}=\-\\eta\\frac\{s\_\{r\}\\mathbf\{z\}^\{\(0\)\}\\mathbf\{h\}^\{\\top\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}^\{\(0\)\}\\mathbf\{h\}^\{\\top\}\\rvert\+\\epsilon\}\(14\)𝐛\(1\)\\displaystyle\\mathbf\{b\}^\{\(1\)\}=𝐛\(0\)−η∇𝐛ℒ\|∇𝐛ℒ\|\+ϵ=−ηsr𝐳\(0\)sr\|𝐳\(0\)\|\+ϵ\\displaystyle=\\mathbf\{b\}^\{\(0\)\}\-\\eta\\frac\{\\nabla\_\{\\mathbf\{b\}\}\\mathcal\{L\}\}\{\\lvert\\nabla\_\{\\mathbf\{b\}\}\\mathcal\{L\}\\rvert\+\\epsilon\}=\-\\eta\\frac\{s\_\{r\}\\mathbf\{z\}^\{\(0\)\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}^\{\(0\)\}\\rvert\+\\epsilon\}\(15\)𝐁\(1\)\\displaystyle\\mathbf\{B\}^\{\(1\)\}=𝐁\(0\)−η∇𝐀ℒ\|∇𝐀ℒ\|\+ϵ=𝐁\(0\)\\displaystyle=\\mathbf\{B\}^\{\(0\)\}\-\\eta\\frac\{\\nabla\_\{\\mathbf\{A\}\}\\mathcal\{L\}\}\{\\lvert\\nabla\_\{\\mathbf\{A\}\}\\mathcal\{L\}\\rvert\+\\epsilon\}=\\mathbf\{B\}^\{\(0\)\}\(16\)This means that the delta after the first step is:
δ\(1\)\(𝐡\)=sr𝐁\(0\)⊤\(−ηsr𝐳\(0\)𝐡⊤sr\|𝐳\(0\)𝐡⊤\|\+ϵ𝐡−ηsr𝐳\(0\)sr\|𝐳\(0\)\|\+ϵ\)\\delta^\{\(1\)\}\(\\mathbf\{h\}\)=s\_\{r\}\\mathbf\{B\}^\{\(0\)\\top\}\\left\(\-\\eta\\frac\{s\_\{r\}\\mathbf\{z\}^\{\(0\)\}\\mathbf\{h\}^\{\\top\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}^\{\(0\)\}\\mathbf\{h\}^\{\\top\}\\rvert\+\\epsilon\}\\mathbf\{h\}\-\\eta\\frac\{s\_\{r\}\\mathbf\{z\}^\{\(0\)\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}^\{\(0\)\}\\rvert\+\\epsilon\}\\right\)\(17\)We want to compute theℓ2\\ell\_\{2\}\-norm of this expression\. To make things simpler, we focus on the similarity scores𝐚\(i\)∈ℝr\\mathbf\{a\}^\{\(i\)\}\\in\\mathbb\{R\}^\{r\}\. Coordinatewise, we have:
𝐚i\(1\)\\displaystyle\\mathbf\{a\}\_\{i\}^\{\(1\)\}=\(𝐀\(1\)𝐡\)i\+𝐛i\(1\)\\displaystyle=\(\\mathbf\{A\}^\{\(1\)\}\\mathbf\{h\}\)\_\{i\}\+\\mathbf\{b\}^\{\(1\)\}\_\{i\}\(18\)=\(∑j=1d𝐀ij\(1\)𝐡j\)\+𝐛i\(1\)\\displaystyle=\\left\(\\sum\_\{j=1\}^\{d\}\{\\mathbf\{A\}\_\{ij\}^\{\(1\)\}\\mathbf\{h\}\_\{j\}\}\\right\)\+\\mathbf\{b\}^\{\(1\)\}\_\{i\}\(19\)=\(∑j=1d−ηsr𝐳i\(0\)𝐡j2sr\|𝐳i\(0\)𝐡j\|\+ϵ\)−ηsr𝐳isr\|𝐳i\|\+ϵ\\displaystyle=\\left\(\\sum\_\{j=1\}^\{d\}\{\-\\eta\\frac\{s\_\{r\}\\mathbf\{z\}^\{\(0\)\}\_\{i\}\\mathbf\{h\}\_\{j\}^\{2\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}^\{\(0\)\}\_\{i\}\\mathbf\{h\}\_\{j\}\\rvert\+\\epsilon\}\}\\right\)\-\\eta\\frac\{s\_\{r\}\\mathbf\{z\}\_\{i\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}\_\{i\}\\rvert\+\\epsilon\}\(20\)=−ηsr𝐳i\(0\)\(\(∑j=1d𝐡j2sr\|𝐳i\(0\)𝐡j\|\+ϵ\)\+1sr\|𝐳i\|\+ϵ\)\\displaystyle=\-\\eta s\_\{r\}\\mathbf\{z\}\_\{i\}^\{\(0\)\}\\left\(\\left\(\\sum\_\{j=1\}^\{d\}\{\\frac\{\\mathbf\{h\}\_\{j\}^\{2\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}^\{\(0\)\}\_\{i\}\\mathbf\{h\}\_\{j\}\\rvert\+\\epsilon\}\}\\right\)\+\\frac\{1\}\{s\_\{r\}\\lvert\\mathbf\{z\}\_\{i\}\\rvert\+\\epsilon\}\\right\)\(21\)Now we assume thatϵ\\epsilonis negligible in the denominators, and the expression simplifies to SignSGD:
𝐚i\(1\)\\displaystyle\\mathbf\{a\}\_\{i\}^\{\(1\)\}≈−ηsr𝐳i\(0\)\(\(∑j=1d𝐡j2sr\|𝐳i\(0\)𝐡j\|\)\+1sr\|𝐳i\|\)\\displaystyle\\approx\-\\eta s\_\{r\}\\mathbf\{z\}\_\{i\}^\{\(0\)\}\\left\(\\left\(\\sum\_\{j=1\}^\{d\}\{\\frac\{\\mathbf\{h\}\_\{j\}^\{2\}\}\{s\_\{r\}\\lvert\\mathbf\{z\}^\{\(0\)\}\_\{i\}\\mathbf\{h\}\_\{j\}\\rvert\}\}\\right\)\+\\frac\{1\}\{s\_\{r\}\\lvert\\mathbf\{z\}\_\{i\}\\rvert\}\\right\)\(23\)=−ηsign\(𝐳i\(0\)\)\(1\+∑j=1d\|𝐡j\|\)\\displaystyle=\-\\eta\\operatorname\{\\mathrm\{sign\}\}\(\\mathbf\{z\}^\{\(0\)\}\_\{i\}\)\\left\(1\+\\sum\_\{j=1\}^\{d\}\{\\lvert\\mathbf\{h\}\_\{j\}\\rvert\}\\right\)\(24\)=−ηsign\(𝐳i\(0\)\)\(1\+∥𝐡∥1\)\\displaystyle=\-\\eta\\operatorname\{\\mathrm\{sign\}\}\(\\mathbf\{z\}^\{\(0\)\}\_\{i\}\)\\left\(1\+\\lVert\\mathbf\{h\}\\rVert\_\{1\}\\right\)\(25\)Therefore, theℓ2\\ell\_\{2\}\-norm of𝐚\(1\)\\mathbf\{a\}^\{\(1\)\}is:
∥𝐚\(1\)∥2=η\(∥𝐡∥1\+1\)r\\lVert\\mathbf\{a\}^\{\(1\)\}\\rVert\_\{2\}=\\eta\(\\lVert\\mathbf\{h\}\\rVert\_\{1\}\+1\)\\sqrt\{r\}\(26\)Using this and the fact that the rows of𝐁\(0\)\\mathbf\{B\}^\{\(0\)\}are orthonormal at initialisation \(and therefore𝐁\(0\)⊤\\mathbf\{B\}^\{\(0\)\\top\}is an isometry onℝr\\mathbb\{R\}^\{r\}\), we can compute theℓ2\\ell\_\{2\}\-norm of the delta:
∥δ\(1\)\(𝐡\)∥2\\displaystyle\\lVert\\delta^\{\(1\)\}\(\\mathbf\{h\}\)\\rVert\_\{2\}=sr∥𝐁\(0\)⊤𝐚\(1\)∥2\\displaystyle=s\_\{r\}\\lVert\\mathbf\{B\}^\{\(0\)\\top\}\\mathbf\{a\}^\{\(1\)\}\\rVert\_\{2\}\(27\)=sr∥𝐚\(1\)∥2\\displaystyle=s\_\{r\}\\lVert\\mathbf\{a\}^\{\(1\)\}\\rVert\_\{2\}\(28\)=srη\(∥𝐡∥1\+1\)r\\displaystyle=s\_\{r\}\\eta\(\\lVert\\mathbf\{h\}\\rVert\_\{1\}\+1\)\\sqrt\{r\}\(29\)Therefore, to make the norm of the first delta invariant to rank \(modulo the assumption that the Adam epsilon is relatively small\), we should set
sr∝1r\\boxed\{s\_\{r\}\\propto\\frac\{1\}\{\\sqrt\{r\}\}\}\(30\)∎
###### Validation on SFT\.
To confirm that our changes enable optimal LR transfer across ranks for both DiReFT and LoReFT, we use the SFT evaluation fromSchulman and Thinking Machines Lab \[[2025](https://arxiv.org/html/2605.14217#bib.bib35)\]: comparing DiReFT and LoReFT with both original parametrisation and with our new initialisation and scaling prefactor, we finetune Llama 3\.2 1B Instruct and Llama 3\.1 8B Instruct on the Tülu\-3 instruction\-tuning dataset\[Lambert et al\.,[2025](https://arxiv.org/html/2605.14217#bib.bib23)\]\. We intervene on all positions \(including decode steps\) to ensure a fair comparison to LoRA\.888Wu et al\. \[[2024b](https://arxiv.org/html/2605.14217#bib.bib40)\]only intervene on some token positions in prefill; this is a useful property of ReFTs but artificially limits their learning capacity, making comparison to LoRA more difficult since it modifies all steps\.We train on50,00050,000examples and evaluate language modelling loss on a held\-out set of500500samples\. We truncate sequences to2,0482,048tokens\. We sweep learning rates in\{10−5,…,10−2\}\\\{10^\{\-5\},\\ldots,10^\{\-2\}\\\}and ranks in\{1,2,4,8,16,32\}\\\{1,2,4,8,16,32\\\}\. We train with the Adam optimiser, with an effective batch size of3232\(batch size of22with1616gradient accumulation steps\)\.
Our results in[Figure˜5](https://arxiv.org/html/2605.14217#A3.F5)confirm that the optimal LR near\-perfectly transfers across ranks with the new parametrisation \(10−310^\{\-3\}generally,3⋅10−43\\cdot 10^\{\-4\}for DiReFT on 8B\) but not the old one; our changes also fix a pathology in the old LoReFT where scaling rank didnotmonotonically improve performance, probably due to a lack of zero\-delta initialisation corrupting the model’s starting performance\.
### Appendix DvLLM Integration
This appendix expands on Section[4](https://arxiv.org/html/2605.14217#S4)\. We describe how a trained ReFT adapter reaches a live vLLM worker, how the decoder layer applies the intervention without breakingtorch\.compileor CUDA graphs, and how adapter weights are refreshed during on\-policy RL training\. The end\-to\-end data flow is summarised in Figure[6](https://arxiv.org/html/2605.14217#A4.F6)\.
TrainedReftModeladapter weights \+ configReftModel\.build\_vllm\(\.\.\.\)VllmConfig\.reft\_configserialized blueprintRL / SFT trainerlive adapter updatessync\_to\_vllm\(llm\)collective\_rpc\('sync\_reft\_weights'\)ReFT\-aware decoder layerΔ=ϕadapter\(h\+r\)\\Delta\\,=\\,\\phi\_\{\\text\{adapter\}\}\(h\{\+\}r\), masked by\_reft\_position\_maskModel runner\(outside compiled region\)position mask←\\leftarrowquery\_start\_locloadupdateinstall adapterin\-placecopy\_mask buffervLLM workerFigure 6:Architecture of thePreFTfork\. The left column is the one\-time load path from a trainedReftModelinto the engine; the right column is the per\-step training–inference sync path used during on\-policy RL\. Solid arrows carry adapter configuration or weights; the dashed arrow marks per\-forward state \(the position mask\) written by the model runner outside the compiled region\. Adapter parameters live at stable memory addresses inside the compiled decoder layer, so in\-place updates propagate through subsequent graph replays without rebuilding the engine\.#### D\.1The refactoredReftModelinterface
We refactored the user\-facingpyreftAPI around a singleReftModelclass modelled on PEFT’sPeftModel\. AReftModelwraps a HuggingFace causal\-LM, freezes its base weights, and injects intervention layers at the indices specified by aReftConfig\. The class exposes the interface a user would expect from a PEFT adapter—save\_pretrained/from\_pretrainedfor adapter\-only checkpoints,print\_trainable\_parametersfor diagnostics,enable\_adapters/disable\_adaptersfor ablations\. The refactor also transparently delegates unknown attribute lookups to the wrapped base model so that training frameworks \(HFTrainer, TRL,lm\-eval\) work unmodified\.
Three methods handle the vLLM integration path:
- •export\_vllm\_reft\_spec\(\)extracts a serialisable blueprint of the adapter \(class path, constructor kwargs, and CPU state dict\) plus the set of intervention layer indices and the position schedule\.
- •build\_vllm\(model\_name, \*\*kwargs\)instantiates a vLLMLLMwith this blueprint attached toVllmConfigand synchronises trained weights into the engine\.
- •build\_vllm\_from\_checkpoint\(ckpt\_dir, model\_name\)skips the HF model entirely, reading only the adapter config and weights from disk before handing them to vLLM\. This was written to be useful at inference time, where holding both an HF copy and a vLLM copy of an 8B\+ parameter base model wastes GPU memory\.
Together these three calls are the full public surface area between pyreft and the vLLM fork; everything else in this appendix is internal\.
#### D\.2Engine integration
vLLM’sVllmConfigcarries per\-engine configuration from the host process to every worker\. We add areft\_configfield to this dataclass that holds the serialised adapter blueprint\. At model construction time, the causal\-LM classes check for a non\-emptyreft\_configand, if present, swap their stock decoder\-layer class for a ReFT\-aware subclass produced by a factory function\. The subclass instantiates an adapter per target layer from the blueprint, and itsforwardinvokes the base layer unchanged before adding the intervention delta to the residual stream\. Because the subclass is generated from a shared blueprint rather than deserialised from a pickled module, worker processes can reconstruct adapters that contain orthogonal parametrisations \(which cannot be pickled directly\) with no special handling\.
The engine’s weight loader is also patched to mark adapter parameters as expected\-present; otherwise vLLM’s checkpoint validator would raise because the adapter weights are not in the base model’ssafetensors\.
#### D\.3Graph\-safe execution
vLLM compiles each decoder layer’s forward pass withtorch\.compileand wraps it in CUDA graphs for decode\. Naively intervening in this path would either break compilation \(dynamic Python branching\) or produce stale results \(captured graphs referring to tensors that have since been freed\)\. We use three design choices to stay compile\- and capture\-safe\.
###### Adapter inlined, not custom\-opped\.
The deltaΔ=ϕadapter\(h\+r\)\\Delta=\\phi\_\{\\text\{adapter\}\}\(h\{\+\}r\)is computed inline in the subclassedforward\. We did not wrap it in a custom op, because doing so forced a graph split that cost more than the op saved on simple LoReFT/DiReFT interventions\.
###### Position mask via pre\-computed buffer\.
ReFT applies its delta only at prefill positions \(or at all tokens, depending on the position schedule\)\. Computing the mask inside the compiled forward would require branching onattn\_metadatafields whose concrete types differ across attention backends\. We instead compute the mask*outside*the compiled region, in the model runner, before each forward: the runner readsquery\_start\_locfrom the attention metadata to find the prefill/decode split, computes one mask per unique position schedule, and writes it into a fixed\-addressnn\.Bufferon each layer \(\_reft\_position\_mask\)\. The compiled layer reads this buffer by slicing, which is a pure tensor op and therefore safe under capture\. Deriving the split fromquery\_start\_locrather than backend\-specific fields makes the mask correct under both full and chunked prefill on every attention kernel vLLM ships\. This entire process is skipped when tokens are all marked as decode tokens, speeding up computation\.
###### Graph\-safe caches for derived tensors\.
Some adapter quantities are expensive to compute but rarely change; for example, the Householder productRRthat materialises the orthogonal rotation for LoReFT\. We compute these once outside the compiled region and store them innn\.Buffers at fixed addresses\. The compiled forward reads the buffers directly; no SVD or matrix decomposition ever runs inside a CUDA graph\. When adapter weights change \(during RL\) these caches are refreshed by a separatecollective\_rpccall \(§[D\.4](https://arxiv.org/html/2605.14217#A4.SS4)\)\.
#### D\.4Weight synchronisation for on\-policy RL
On\-policy methods such as GRPO require the inference engine’s adapter to track the training adapter step\-for\-step\. Restarting the vLLM engine after every gradient update is prohibitive, so we expose acollective\_rpc\("sync\_reft\_weights", \.\.\.\)RPC on the vLLM worker that accepts a\{layer\_idx: state\_dict\}mapping, loads each adapter’s state dict in place, and refreshes the graph\-safe caches of §[D\.3](https://arxiv.org/html/2605.14217#A4.SS3)\. Because adapter parameters live at stable memory addresses, the next graph replay sees the new weights without recapture \(similar to how LoRA updates propagate\)\.
ReftModel\.sync\_to\_vllm\(llm\)wraps this RPC on the training side: it extracts the current adapter state dicts from the HF model, ships them to the engine, and returns the per\-worker sync counts for logging\. A typical RL loop therefore looks like: generate rollouts with the vLLM engine, compute advantages and take a gradient step on the HF model, callsync\_to\_vllm, and repeat\. Engine startup happens once\.
#### D\.5ReFTRequests
Our forked vLLM adds native support for ReFT adapters as a first\-class serving primitive, analogous to how stock vLLM handles LoRA requests\. At engine construction time, a*ReFT spec*, which maps transformer layer indices to adapter weights and the intervention position, is registered viareft\_config\. At inference time, callers attach aReFTRequestto each generation call, specifying which adapter to apply; a single vLLM instance can thus serve multiple ReFT adapters by routing different requests to each adapter, just like LoRA serving\.
#### D\.6Prefill\-only LoRA
Standard vLLM LoRA applies the adapter at every forward pass regardless of generation phase\. To support prefill\-only LoRA, we extendLoRARequestwith alora\_positionfield\. Whenlora\_position="prefill"is set, the vLLM LoRA kernel applies the delta only during the prefill phase and bypasses it during decode steps, so generated tokens see only the frozen base\-model weights\. On the training side \(HF/PyTorch\), we implement equivalent semantics\.
#### D\.7Scope and limitations
The fork currently supports Llama and Qwen2 family decoders, LoRA, DiReFT, and LoReFT adapters with either a*prefill*or*all\-tokens*position schedule, and tensor\-parallel execution on a single node\. Speculative decoding, pipeline parallelism, and non\-Llama/Qwen2 architectures have not been tested and are likely to need additional integration work along the lines of §[D\.2](https://arxiv.org/html/2605.14217#A4.SS2)\.
### Appendix EBenchmarking infrastructure
This appendix expands on[section˜4](https://arxiv.org/html/2605.14217#S4)\. We describe the workload, the engine configuration, the FCFS scheduling loop, the metrics we report, and the sweeps that produce each figure in the main text\.
#### E\.1Workload
We replicate the request distribution used by Punica\[Chen et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib6)\]so that our results sit on the same axes as the established multi\-adapter serving literature\.
###### Length distribution\.
Prompt lengths are drawn i\.i\.d\. from the following distribution:
Lognormal\(σ=0\.8,loc=−1\.0,scale=18\.0\)\\mathrm\{Lognormal\}\(\\sigma\{=\}0\.8,\\,\\mathrm\{loc\}\{=\}\{\-\}1\.0,\\,\\mathrm\{scale\}\{=\}18\.0\)and clipped to\[1,Lmax−2\]\[1,\\,L\_\{\\max\}\{\-\}2\]\. Each request’s total length \(prompt \+ completion\) is drawn uniformly from\[p\+2,Lmax\]\[p\{\+\}2,\\,L\_\{\\max\}\]; the output length is the difference\. We useLmax=2,048L\_\{\\max\}=2\{,\}048throughout unless otherwise noted\. Sampling is deterministic for a given seed\.
###### Adapter assignment\.
Each request is tagged with an adapter index drawn from one of four mixes:*identical*\(all requests share adapter0\),*uniform*\(uniform overNNadapters; this is our default\),*skewed*\(Zipf withα=1\.0\\alpha=1\.0over the adapter catalogue\), and*distinct*\(requestiiuses adapterimodNi\\bmod N, then shuffled\)\. Most figures report uniform popularity; the skewed and distinct mixes are used for the popularity\-sensitivity ablations\.
###### Adapter weights\.
For LoRA we materialiseNNrandom adapters per layer at the requested rank and target modules;𝐀\\mathbf\{A\}and𝐁\\mathbf\{B\}are i\.i\.d\.𝒩\(0,0\.012\)\\mathcal\{N\}\(0,\\,0\.01^\{2\}\)in fp16\. For ReFT we materialiseNNadapters of the requested type \(LoReFT, DiReFT\) by constructing a CPU adapter at the requested rank and position and adding𝒩\(0,σperturb2\)\\mathcal\{N\}\(0,\\,\\sigma\_\{\\text\{perturb\}\}^\{2\}\)noise \(σperturb=0\.1\\sigma\_\{\\text\{perturb\}\}=0\.1\) to the learned source weight and bias to break symmetry across adapters\. These weights are not trained: benchmarks measure serving cost, not task quality\.
###### Prompt content\.
Prompt token IDs are drawn uniformly from\[100,32000\)\[100,\\,32000\)\. Sampling is greedy \(T=0T=0,n=1n=1\) withignore\_eos = Trueandmax\_tokensfixed to the generated output length, so generations always run to the requested length and timing is not perturbed by stochastic stopping\.
#### E\.2Engine configuration
We construct anLLMEnginefromEngineArgswith prefix caching disabled,max\_num\_seqsset to the FCFS batch size,max\_model\_len=Lmax=L\_\{\\max\}, andgpu\_memory\_utilization=0\.9=0\.9\. We never useenforce\_eagerunless explicitly noted, so all numbers in the main text include CUDA\-graph capture\.
For LoRA runs we setenable\_lora=True,max\_loras=MactiveM\_\{\\text\{active\}\}\(the maximum number of distinct adapters that may be co\-batched\),max\_lora\_rank=max\(8,r\)\\max\(8,r\), andmax\_cpu\_loras=N\+1N\+1so the entire catalogue fits in CPU memory and only the active set occupies GPU memory at any time\. For ReFT runs we setenable\_reft=True,max\_refts=MactiveM\_\{\\text\{active\}\}, andmax\_cpu\_refts=N\+1N\+1, then callcollective\_rpc\("load\_reft\_adapter", …\)once per adapter to push it through the public weight\-sync path described in[section˜D\.4](https://arxiv.org/html/2605.14217#A4.SS4)\. For the adapter\-less baseline we leave both features disabled but otherwise reuse the same engine configuration and the same workload\.
Tensor parallelism is set per model:TP=1\\text\{TP\}=1for Qwen2\.5\-0\.5B and Llama\-3\.1\-8B benchmarks \(single H100\), andTP=4\\text\{TP\}=4for the Llama\-3\.1\-70B and Qwen3\-30B\-A3B benchmarks \(four H100s on a single node\)\. All inference uses bf16 weights\.
#### E\.3Scheduling loop
The benchmark drives the engine throughLLMEngine\.step\(\)rather than callingLLM\.generate, so that we can attribute latency to the prefill \(encode\) and decode phases separately\. The driver maintains a*workset*of at mostBBin\-flight requests and admits new requests one at a time whenever a slot frees up; admission is first\-come\-first\-served over the synthetic stream\. For each in\-flight request we record:
- •its*encode latency*as the wall time betweenadd\_requestand the step in which the first output token is observed;
- •its*decode latency*as the wall time between the first output token and the step that marks the request finished;
- •its prompt and output lengths \(from the workload, not from engine state\)\.
The engine’s own scheduler is responsible for prefill/decode chunking and continuous batching — we do not change vLLM’s scheduling policy\. A separate warmup run with the same workload \(different seed\) is executed before every timed run and discarded; this absorbs CUDA\-graph capture, weight loading, and any one\-off allocator behaviour so the timed run reflects steady\-state serving cost\.
#### E\.4Metrics
We report four quantities per run:
- •Per\-token encode latency, computed per request as encode latency divided by prompt length, then aggregated as thep50p\_\{50\},p90p\_\{90\},p99p\_\{99\}, mean, and standard deviation across requests\.
- •Per\-token decode latency, defined analogously over output tokens\.
- •End\-to\-end throughput, the total number of prompt and output tokens served divided by the total wall time of the timed run\.
- •Per\-request encode and decode latency, with the same percentile breakdown, useful for tail\-latency comparisons\.
Unless otherwise noted, plots in the main text show median per\-token latency with shadedp10p\_\{10\}–p90p\_\{90\}bands and median throughput\.
### Appendix FLongWriter evaluation
We use the LongBench\-Write benchmark ofBai et al\. \[[2024](https://arxiv.org/html/2605.14217#bib.bib4)\]: 120 open\-ended writing prompts paired with a target output length\. We partition results into five required\-length buckets \(<500<\\\!500,500500–22k,22k–44k,44k–2020k,2020k\+\+words\) to gauge performance on each of the buckets\. The headline metrics are a length\-following scoreSl∈\[0,100\]S\_\{l\}\\in\[0,100\]and a quality scoreSq∈\[0,100\]S\_\{q\}\\in\[0,100\], each averaged over the prompts in a bucket\. We additionally report the fraction of generations truncated at the decoding cap \([appendix˜F](https://arxiv.org/html/2605.14217#A6.SS0.SSS0.Px4)\) to disentangle “can’t write more” from “won’t write more” failure modes\.
###### Generation\.
For every promptiiwith target lengthℓireq\\ell\_\{i\}^\{\\text\{req\}\}, we sample a single responseyiy\_\{i\}at temperature0\.50\.5with a 32,768\-token output cap\. We measure produced lengthℓigen\\ell\_\{i\}^\{\\text\{gen\}\}in words using the LongWriter convention: each Chinese ideograph contributes one “word” and each maximal run of\[a\-zA\-Z\]\[\\text\{a\-zA\-Z\}\]contributes one English word; whitespace and punctuation are discarded\.
###### Length\-following scoreSlS\_\{l\}\.
We use the piecewise score from the LongWriter paper unchanged:
Sl\(ℓreq,ℓgen\)=\{100⋅max\(0,1−13\(ℓgenℓreq−1\)\)ifℓgen\>ℓreq100⋅max\(0,1−12\(ℓreqℓgen−1\)\)if0<ℓgen≤ℓreq0otherwise\.S\_\{l\}\(\\ell^\{\\text\{req\}\},\\ell^\{\\text\{gen\}\}\)=\\begin\{cases\}100\\cdot\\max\\\!\\Big\(0,\\,1\-\\tfrac\{1\}\{3\}\\big\(\\tfrac\{\\ell^\{\\text\{gen\}\}\}\{\\ell^\{\\text\{req\}\}\}\-1\\big\)\\Big\)&\\text\{if \}\\ell^\{\\text\{gen\}\}\>\\ell^\{\\text\{req\}\}\\\\\[4\.0pt\] 100\\cdot\\max\\\!\\Big\(0,\\,1\-\\tfrac\{1\}\{2\}\\big\(\\tfrac\{\\ell^\{\\text\{req\}\}\}\{\\ell^\{\\text\{gen\}\}\}\-1\\big\)\\Big\)&\\text\{if \}0<\\ell^\{\\text\{gen\}\}\\leq\\ell^\{\\text\{req\}\}\\\\\[4\.0pt\] 0&\\text\{otherwise\.\}\\end\{cases\}\(31\)The asymmetry penalises under\-shooting twice as harshly as over\-shooting \(a 4×\\timesovershoot reachesSl=0S\_\{l\}=0, while a 3×\\timesundershoot does\)\. Bucket\-levelSlS\_\{l\}is the unweighted mean over prompts whose required length falls in that bucket\.
###### Quality scoreSqS\_\{q\}\.
We follow the LLM\-as\-judge protocol ofBai et al\. \[[2024](https://arxiv.org/html/2605.14217#bib.bib4)\], replacing their GPT\-4 judge withgpt\-5\-miniwith low reasoning for cost\. For each \(prompt, response\) pair the judge returns six integer scores in\{1,…,5\}\\\{1,\\dots,5\\\}along the LongWriter rubric:Relevance,Accuracy,Coherence,Clarity,Breadth and Depth, andReading Experience\. We use the LongWriter judge prompt verbatim, including the instructionnotto penalise length \(soSlS\_\{l\}andSqS\_\{q\}measure orthogonal failure modes\)\. The judge is instructed to emit JSON; we parse and clip each dimension to\[1,5\]\[1,5\], dropping items whose JSON cannot be recovered\. The per\-item raw scores¯i∈\[1,5\]\\bar\{s\}\_\{i\}\\in\[1,5\]is the unweighted mean across the six dimensions \(over those that parsed\)\. We report two equivalent forms:
Sqraw=1\|B\|∑i∈Bs¯i∈\[1,5\],Sq=Sqraw−14⋅100∈\[0,100\],S\_\{q\}^\{\\text\{raw\}\}=\\frac\{1\}\{\|B\|\}\\sum\_\{i\\in B\}\\bar\{s\}\_\{i\}\\in\[1,5\],\\qquad S\_\{q\}=\\frac\{S\_\{q\}^\{\\text\{raw\}\}\-1\}\{4\}\\cdot 100\\in\[0,100\],whereBBis the set of prompts in a given bucket\. Throughout the main paper we plot the0–100100form soSlS\_\{l\}andSqS\_\{q\}share a common scale\.
###### Truncation rate\.
Every generation reports a finish reason from the decoding backend \(stopwhen an EOS / stop sequence is emitted,lengthwhen the 32,768\-token cap is hit\)\. The bucket\-level truncation rate is
τB=\|\{i∈B:finish\_reasoni=length\}\|\|B\|∈\[0,1\],\\tau\_\{B\}=\\frac\{\|\\\{i\\in B:\\text\{finish\\\_reason\}\_\{i\}=\\texttt\{length\}\\\}\|\}\{\|B\|\}\\in\[0,1\],which we report as a percentage\. A response that ends voluntarily inside the budget contributes0toτB\\tau\_\{B\}; a response that would have continued past3232k tokens contributes11\. Truncation rate complementsSlS\_\{l\}by separating two regimes: a lowSlS\_\{l\}paired with lowτB\\tau\_\{B\}implies the model ended early on its own; a lowSlS\_\{l\}paired with highτB\\tau\_\{B\}implies the model wanted to continue but ran out of decoding budget\.
### Appendix GComparing SFT and RL on GSM8K
Table 3:PreFTs lag PEFTs on GSM8K under both SFT and RL\.GSM8K test accuracy onLlama\-3\.1\-8Bwith rank\-88adapters; the best learning rate was selected for each \(training, method, position\) cell\.\*indicates statistical significance\.To investigate the gap on GSM8K under RL further, we conduct an ablation with identical training data under both SFT and RLVR, comparingPreFTs against their all\-position counterparts\.[Table˜3](https://arxiv.org/html/2605.14217#A7.T3)reports final GSM8K accuracy for each \(method, training regime\) pair, withΔ\\Deltacolumns showing the prefill\-vs\-all\-position gap within each regime\. In all cases,PreFTs underperform PEFTs by some margin\. These results suggest that the restriction to prefill positions is particularly difficult for learning on GSM8K\. This also agrees with prior comparisons on math reasoning for ReFT\[Wu et al\.,[2024b](https://arxiv.org/html/2605.14217#bib.bib40)\]; however, this phenomenon is not replicated in MATH, which is a more difficult benchmark\. We leave further investigation of this gap to future work\.
### Appendix HLong output attenuation experiments
Figure 7:LoRAPon MLP\-only still outperforms DiReFTPand DiReFTAat high ranksLongWriter on Llama\-3\.1\-8B\-Instruct \(rank1616\), four required output\-length brackets\.In the LongBench\-Write experiments, we find that DiReFTPruns fail while LoRAPruns do not\. We conduct two follow\-up experiments to investigate why\.
###### Is the failure mediated by attention?
Our initial hypothesis was that the asymmetry traces to*where*each adapter writes during prefill\. LoRA on attention projections modifies the K and V vectors written to the cache, so every subsequent decode token attends to KV\-cache contents directly shaped by the trained delta\. DiReFT instead writes additively to the residual stream at prompt positions and reaches the cache only indirectly through the un\-modified K/V projections\. If this distinction is what allows LoRAPto survive decode, then restricting LoRA’s target modules to MLP only \(gate\_proj,up\_proj,down\_proj\) should reproduce DiReFTP’s failure mode — neither variant would have direct access to K/V\.
[Figure˜7](https://arxiv.org/html/2605.14217#A8.F7)refutes this hypothesis\. MLP\-only LoRAPperforms on par with MLP\-only LoRA and high rank DiReFTPand DiReFTAat every length bracket on bothSlS\_\{l\}andSqS\_\{q\}, and avoids the runaway\-truncation regime that DiReFTPoccupies\. Removing attention\-projection LoRA does not break the prefill\-only setting\.
###### Is the failure due to under\-parameterisation?
A second possibility is that DiReFTPsimply lacks the capacity at rank 16 to encode the conditional length\-matching signal LoRA manages at the same rank, given LoRA’s larger effective parameter count across seven target modules\. We re\-train DiReFTPat rank 128 and re\-evaluate\. The overshoot pathology persists: DiReFTPr=128 produces≥\\geq7k\-word generations across every length bracket and saturates the 32k decoding cap on more than 70% of≥\\geq2k requests\. Increased capacity does not recover the prefill\-only setting\.
###### Implication\.
Neither restricting LoRA to MLP modules nor scaling DiReFT to rank 128 changes the asymmetry observed in the main results\. The difference between LoRAPand DiReFTPappears not to depend on which projections each adapter targets, nor on rank\. We conjecture the failure is structural: LoRA’s multiplicative perturbation reshapes how prompt tokens are encoded into hidden states, and that reshaping propagates through the KV cache to persist through decode; DiReFT’s additive residual write at prompt positions injects a fixed signal that fails to translate into the conditional, target\-dependent decoding behaviour LongWriter requires\. A precise mechanistic characterisation is left to future work\.
### Appendix IOn\-policy distillation
We find a gap betweenPreFTs and all\-position PEFTs when training with teacher\-forced SFT, but no such gap exists in our RL experiments\. We hypothesised that on\-policy training being more effective for prefill\-only adapters implies that we may outperform SFT if we doon\-policy distillation\[Agarwal et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib1), Lu and Thinking Machines Lab,[2025](https://arxiv.org/html/2605.14217#bib.bib26)\]from an all\-positions teacher adapter into aPreFT\.
###### Implementation\.
We take the GRPO implementation intrland override the reference model \(which is used to compute the KL term in GRPO\) to apply the teacher adapter\. We pass dummy rewards of0, so that the loss consists of only the KL term\. We setβ:=1\\beta:=1and sweep learning rates\.
###### Experiment 1: Distilling rank\-44LoRA on Llama 3\.2 1B Instruct\.
We take the best\-performing rank\-44all\-positions LoRA finetuned on Llama 3\.2 1B Instruct on Tülu\-3, and distill it into a prefill\-only LoRA\. We show results below for various LRs after500500steps of training\. Lower KL loss \(which is the distillation target\) does correlate with lower evaluation loss on Tülu, indicating the distillation is successfully transferring in\-distribution learning from the teacher to the student\.
Unfortunately, lower learning rates achieve worse distillation loss but better downstream performance, indicating that the training is harming out\-of\-distribution evaluations\. However, we later realised that the Llama Instruct models are generally too strong at instruction\-following tasks for Tülu\-finetuning to matter, so we instead switch experiments to the base model\. Our teacher is actually worse than the untrained student\!
###### Experiment 2: Distilling rank\-11LoRA on Llama 3\.1 1B \(base\)\.
Given that finetuning an already instruction\-tuned model on Tülu\-3 is not informative \(i\.e\., the teacher is not better than the student at the start of training\), we instead perform on\-policy distillation from a teacher adapter train on thebasemodel, as in the main text \([section˜5](https://arxiv.org/html/2605.14217#S5)\)\. We take the best all\-positions rank\-11LoRA on Llama 3\.2 1B base and on\-policy distill it into LoRAP\.
Below, we present our results\. On\-policy distillation does lift MMLU performance from below\-chance to slightly above chance, very slightly improves GSM8K, but actually hurts IFEval\. This leads us to conclude that on\-policy distillation is not particularly promising in this setting; we may require a better model to experiment on, better teachers, longer training, or different training/evaluation sets, all of which are beyond our computational budget\.
### Appendix JBroader Societal Impacts
###### Positive societal impacts\.
PreFT improves the inference efficiency of multi\-tenant adapter serving for personalized large language models, with throughput gains of up to1\.9×1\.9\\timesat scale on Llama\-3\.1\-70B\. The most direct positive impact is a reduction in energy and compute costs per served request, which both lowers the environmental footprint of personalized LLM deployment and reduces the economic barrier to offering personalized applications\. Cheaper multi\-tenant serving particularly benefits use cases where per\-user adaptation matters but serving costs have been prohibitive, including educational tools adapted to individual learners, accessibility applications tailored to specific users’ communication patterns, and language\- or dialect\-specific adaptations for communities under\-represented in base model training data\. The per\-adapter serving paradigm that PreFT supports is also structurally compatible with privacy\-preserving personalization architectures, where user\-specific weights remain isolated rather than aggregated into a shared model or passed through every inference as long\-context input\. By releasing our vLLM fork and refactored pyreft library, we additionally lower the engineering bar for academic and independent researchers studying multi\-adapter serving, on\-policy reinforcement learning with adapters, and related topics that previously required substantial systems infrastructure to investigate\.
###### Negative societal impacts\.
The negative impacts of PreFT are not specific to our method but apply to efficiency improvements in LLM serving generally\. Lowering the cost of personalized LLM deployment increases the economic feasibility of harmful applications alongside beneficial ones, including automated disinformation, harassment, and manipulative personalization\. PreFT does not introduce new model capabilities; it makes existing capabilities more efficient to serve\. Concerns about misuse of personalized LLMs therefore apply to the underlying base models and to deployment decisions made by operators, rather than to our methodological contribution\. We release a serving framework and adapter infrastructure; we do not release new model weights, datasets, or end\-user applications that would amplify these concerns\. We also note the standard Jevons\-paradox observation that efficiency improvements tend to increase aggregate usage of the underlying technology, so even efficiency\-driven reductions in per\-request energy may be offset by increased deployment volume; this is a property of efficiency contributions in general and is not specific to PreFT\.
### Appendix KDetailed SFT results
#### K\.1Llama\-3\.2\-1B, Tulu\-3
Llama\-3\.2\-1B\.
#### K\.2Llama\-3\.2\-1B, OpenThoughts
Llama\-3\.2\-1B\.
#### K\.3Llama\-3\.2\-1B, GSM8K
Llama\-3\.2\-1B\.
#### K\.4Llama\-3\.2\-1B\-Base, Tulu\-3
Llama\-3\.2\-1B\-Base\.
#### K\.5Llama\-3\.2\-1B\-Base, OpenThoughts
Llama\-3\.2\-1B\-Base\.
#### K\.6Llama\-3\.1\-8B, Tulu\-3
Llama\-3\.1\-8B\.
#### K\.7Llama\-3\.1\-8B, GSM8K
Llama\-3\.1\-8B\.
#### K\.8Llama\-3\.1\-8B\-Base, Tulu\-3
Llama\-3\.1\-8B\-Base\.
### Appendix LSFT \(downstream performance\)
#### L\.1Llama\-3\.2\-1B, Tulu\-3
All methods on Tulu\-3, Llama\-3\.2\-1B\.
#### L\.2Llama\-3\.2\-1B, OpenThoughts
All methods on OpenThoughts, Llama\-3\.2\-1B\.
#### L\.3Llama\-3\.2\-1B\-Base, Tulu\-3
All methods on Tulu\-3, Llama\-3\.2\-1B\-Base\.
#### L\.4Llama\-3\.1\-8B, Tulu\-3
All methods on Tulu\-3, Llama\-3\.1\-8B\.
#### L\.5Llama\-3\.1\-8B\-Base, Tulu\-3
All methods on Tulu\-3, Llama\-3\.1\-8B\-Base\.
#### L\.6Llama\-3\.1\-8B, GSM8K
All methods on GSM8K, Llama\-3\.1\-8B \(eval/reward\_full\)\.
### Appendix MDetailed RL results
#### M\.1Qwen2\.5\-0\.5B, GSM8K
Qwen2\.5\-0\.5B\.
#### M\.2Qwen2\.5\-0\.5B, MATH
Qwen2\.5\-0\.5B\.
#### M\.3Qwen2\.5\-0\.5B, HumanEval
Qwen2\.5\-0\.5B\.
#### M\.4Llama\-3\.1\-8B, GSM8K
Llama\-3\.1\-8B\.
#### M\.5Llama\-3\.1\-8B, MATH
Llama\-3\.1\-8B\.
#### M\.6Llama\-3\.1\-8B, HumanEval
Llama\-3\.1\-8B\.
### Appendix NAsset licenses and attribution
We use the following datasets, models, and software, listed with their licenses\. All assets are used in accordance with their license terms\.
###### Datasets\.
- •
- •
- •GSM8K\[Cobbe et al\.,[2021](https://arxiv.org/html/2605.14217#bib.bib9)\]: MIT\.
- •MATH\[Hendrycks et al\.,[2021b](https://arxiv.org/html/2605.14217#bib.bib17)\]: MIT\.
- •MBPP\[Austin et al\.,[2021](https://arxiv.org/html/2605.14217#bib.bib2)\]: CC\-BY 4\.0\.
- •HumanEval\[Chen et al\.,[2021](https://arxiv.org/html/2605.14217#bib.bib7)\]: MIT\.
- •LongBench\-Write\[Bai et al\.,[2024](https://arxiv.org/html/2605.14217#bib.bib4)\]: Apache 2\.0\.
###### Base models\.
- •Llama\-3\.1, Llama\-3\.2: Llama 3\.x Community License \(Meta\)\.
- •Qwen\-2\.5: Tongyi Qianwen License \(Alibaba\)\.
###### Software\.
- •vLLM\[Kwon et al\.,[2023](https://arxiv.org/html/2605.14217#bib.bib22)\]: Apache 2\.0\. We extend vLLM with multi\-ReFT serving support; our fork inherits the Apache 2\.0 license\.
- •pyreft\[Wu et al\.,[2024b](https://arxiv.org/html/2605.14217#bib.bib40)\]: Apache 2\.0\. We refactor pyreft to support a unifiedReftModelinterface and vLLM integration; our refactor inherits the Apache 2\.0 license\.
- •trl, peft, transformers: Apache 2\.0 \(Hugging Face\)\.
###### LLM\-as\-judge\.
We use OpenAI’sgpt\-5\-minimodel via the OpenAI API as the judge model for LongBench\-Write quality scoring \([appendix˜F](https://arxiv.org/html/2605.14217#A6)\), in accordance with OpenAI’s API terms of service\. The judge is used only for evaluation; no training data is generated using this model\.Similar Articles
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill is a new prefill acceleration framework proposed in a research paper that enables block-wise dynamic sparsification for universal long-context processing in LLMs. It integrates with vLLM to achieve up to 2.1x speedup in Time-To-First-Token across various model architectures.
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
The paper introduces SPEED, a layer-asymmetric KV visibility policy that reduces long-context inference costs by processing prompt tokens only in lower layers during prefill while maintaining full-depth attention during decoding.
ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
ShadowPEFT introduces a centralized parameter-efficient fine-tuning method that uses a depth-shared shadow module to refine transformer layer representations, matching or outperforming LoRA/DoRA with comparable trainable parameters.
ConFu: Contemplate the Future for Better Speculative Sampling
ConFu introduces a novel speculative decoding framework that enables draft models to anticipate future generation directions through contemplate tokens and soft prompts, achieving 8-20% improvements in token acceptance rates and generation speed over EAGLE-3 across multiple LLM models.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
This paper introduces PaT (Planning-after-Trial), an adaptive test-time computation strategy for code generation that reduces inference costs by approximately 69% while maintaining performance comparable to larger models.