@TanejaPriyal: i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.” so, i wrote a two-part series and ran …

X AI KOLs Timeline 05/26/26, 06:52 PM Papers

lora fine-tuning adapters vllm multi-tenant-serving benchmark serving

Summary

The author benchmarks serving 1,000 LoRA adapters on one GPU using vLLM, finding that active adapter count and traffic shape are the real bottlenecks, and provides recommendations for tuning max_loras.

i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.” so, i wrote a two-part series and ran a benchmark: what happens when you serve 1,000 LoRA adapters on one GPU? what i learned: > total adapter count is not the real bottleneck. what matters is how many adapters are active together. > traffic shape changes everything. at 1k adapters, evenly spread traffic got 884 tok/s; skewed traffic got 2,167 tok/s. > vLLM’s max_loras is not “higher is better.” too low caused multi-second first-token delays; too high reduced throughput. > multi-LoRA serving is about managing the active working set, not just storing lots of adapters. limitation: this uses synthetic adapters, so it focuses on serving mechanics, rather than model quality. part 1, the mechanics of LoRA: adapters, rank, and multi-tenant serving: https://priyaltaneja.com/mechanics-of-lora… part 2, multi-LoRA at scale: an empirical map of vLLM’s operating range: https://priyaltaneja.com/multi-lora-at-scale… code, CSVs, figures: https://github.com/priyaltaneja/multi-lora-serving-benchmark…

Original Article

View Cached Full Text

Cached at: 05/27/26, 03:18 AM

i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.”

so, i wrote a two-part series and ran a benchmark: what happens when you serve 1,000 LoRA adapters on one GPU?

what i learned:

total adapter count is not the real bottleneck. what matters is how many adapters are active together. traffic shape changes everything. at 1k adapters, evenly spread traffic got 884 tok/s; skewed traffic got 2,167 tok/s. vLLM’s max_loras is not “higher is better.” too low caused multi-second first-token delays; too high reduced throughput. multi-LoRA serving is about managing the active working set, not just storing lots of adapters.

limitation: this uses synthetic adapters, so it focuses on serving mechanics, rather than model quality.

part 1, the mechanics of LoRA: adapters, rank, and multi-tenant serving: https://priyaltaneja.com/mechanics-of-lora…

part 2, multi-LoRA at scale: an empirical map of vLLM’s operating range: https://priyaltaneja.com/multi-lora-at-scale…

code, CSVs, figures: https://github.com/priyaltaneja/multi-lora-serving-benchmark…

@TanejaPriyal: i wanted to understand LoRA beyond “adapters are cheaper than full fine-tuning.” so, i wrote a two-part series and ran …

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

Submit Feedback

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis