@TanejaPriyal: i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.” so, i wrote a two-part series and ran …

X AI KOLs Timeline Papers

Summary

The author benchmarks serving 1,000 LoRA adapters on one GPU using vLLM, finding that active adapter count and traffic shape are the real bottlenecks, and provides recommendations for tuning max_loras.

i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.” so, i wrote a two-part series and ran a benchmark: what happens when you serve 1,000 LoRA adapters on one GPU? what i learned: > total adapter count is not the real bottleneck. what matters is how many adapters are active together. > traffic shape changes everything. at 1k adapters, evenly spread traffic got 884 tok/s; skewed traffic got 2,167 tok/s. > vLLM’s max_loras is not “higher is better.” too low caused multi-second first-token delays; too high reduced throughput. > multi-LoRA serving is about managing the active working set, not just storing lots of adapters. limitation: this uses synthetic adapters, so it focuses on serving mechanics, rather than model quality. part 1, the mechanics of LoRA: adapters, rank, and multi-tenant serving: https://priyaltaneja.com/mechanics-of-lora… part 2, multi-LoRA at scale: an empirical map of vLLM’s operating range: https://priyaltaneja.com/multi-lora-at-scale… code, CSVs, figures: https://github.com/priyaltaneja/multi-lora-serving-benchmark…
Original Article
View Cached Full Text

Cached at: 05/27/26, 03:18 AM

i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.”

so, i wrote a two-part series and ran a benchmark: what happens when you serve 1,000 LoRA adapters on one GPU?

what i learned:

total adapter count is not the real bottleneck. what matters is how many adapters are active together. traffic shape changes everything. at 1k adapters, evenly spread traffic got 884 tok/s; skewed traffic got 2,167 tok/s. vLLM’s max_loras is not “higher is better.” too low caused multi-second first-token delays; too high reduced throughput. multi-LoRA serving is about managing the active working set, not just storing lots of adapters.

limitation: this uses synthetic adapters, so it focuses on serving mechanics, rather than model quality.

part 1, the mechanics of LoRA: adapters, rank, and multi-tenant serving: https://priyaltaneja.com/mechanics-of-lora…

part 2, multi-LoRA at scale: an empirical map of vLLM’s operating range: https://priyaltaneja.com/multi-lora-at-scale…

code, CSVs, figures: https://github.com/priyaltaneja/multi-lora-serving-benchmark…

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

arXiv cs.LG

Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.