@TanejaPriyal: i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.” so, i wrote a two-part series and ran …
Summary
The author benchmarks serving 1,000 LoRA adapters on one GPU using vLLM, finding that active adapter count and traffic shape are the real bottlenecks, and provides recommendations for tuning max_loras.
View Cached Full Text
Cached at: 05/27/26, 03:18 AM
i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.”
so, i wrote a two-part series and ran a benchmark: what happens when you serve 1,000 LoRA adapters on one GPU?
what i learned:
total adapter count is not the real bottleneck. what matters is how many adapters are active together. traffic shape changes everything. at 1k adapters, evenly spread traffic got 884 tok/s; skewed traffic got 2,167 tok/s. vLLM’s max_loras is not “higher is better.” too low caused multi-second first-token delays; too high reduced throughput. multi-LoRA serving is about managing the active working set, not just storing lots of adapters.
limitation: this uses synthetic adapters, so it focuses on serving mechanics, rather than model quality.
part 1, the mechanics of LoRA: adapters, rank, and multi-tenant serving: https://priyaltaneja.com/mechanics-of-lora…
part 2, multi-LoRA at scale: an empirical map of vLLM’s operating range: https://priyaltaneja.com/multi-lora-at-scale…
code, CSVs, figures: https://github.com/priyaltaneja/multi-lora-serving-benchmark…
Similar Articles
Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Hybrid-LoRA proposes a framework that selectively applies full fine-tuning to a small subset of modules while using LoRA for the rest, achieving performance near full fine-tuning with significantly lower computational cost. Experiments show improvements of up to 5.65% over existing parameter-efficient baselines.
ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services
ReLoRA is a knowledge-reusing adaptation framework that efficiently restores service-ready LoRA adapters for evolving LLM services, reducing time-to-readiness by up to 8.9× and improving accuracy by up to 4.6% through adaptive initialization and scheduled regularization.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.
Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution
Code2LoRA introduces a hypernetwork that generates LoRA adapters from a repository in a single forward pass, allowing frozen code LLMs to adapt to repository context without extra tokens, and supporting evolving codebases efficiently. It also delivers RepoPeftBench, a benchmark for repo-conditioned code modeling.
PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis
Presents a systematic study of parameter-efficient fine-tuning using LoRA on Qwen2.5-3B for telecommunications customer support, comparing 16 LoRA configurations with both traditional metrics and energy consumption analysis. Finds divergence between quantitative and qualitative performance.