@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …
Summary
Discusses the nuanced reality of prefill-decode disaggregation in LLM serving at scale, based on customer patterns and validated on AMD with vLLM.
View Cached Full Text
Cached at: 06/15/26, 11:08 PM
One pattern we keep seeing with customers serving LLMs at scale:
Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced.
So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved.
2/ In this post, we benchmark PD on AMD MI325X with Ray Serve + vLLM. Across Qwen3-235B and DeepSeek-V3 workloads, PD delivers up to 2.7x better goodput and up to 67% compute cost reduction. But only in the right regimes:
3/ First insight: PD does not make prefill faster. PD adds a KV transfer step between prefill and decode workers. That means TTFT can get worse, even when throughput improves. For strict TTFT SLAs, aggregated serving is often simpler and better.
4/ Second insight: PD’s real win is TPOT. In aggregated serving, prefill and decode share the same GPUs. As load increases, prefill work interrupts decode, and TPOT degrades. With PD, decode runs on dedicated GPUs, so TPOT stays much flatter under load.
5/ Third insight: TPOT savings compound over generation length. A 5–10ms/token improvement may look small. But over hundreds or thousands of output tokens, it turns into meaningful E2E latency and throughput gains. This matters for reasoning, agents, and long-form generation.
6/ Fourth insight: the P:D ratio is workload-dependent. Changing ISL/OSL, cache hit rate, or target QPS can change the optimal split. A bad ratio can make PD strictly worse than aggregated. Start with 1:1, then move GPUs toward the bottleneck. Even better if we can dynamically change these ratios at runtime as workload profile changes.
7/ Takeaway:
PD is not a universal win. It helps when the workload is TPOT/E2E-sensitive and generations are long enough for per-token savings to compound. It can lose when TTFT dominates, outputs are short, or the P:D ratio is wrong.
Full post with intuition, benchmarks, and reproducible AMD + Ray + vLLM setup:
Similar Articles
@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…
This blog post from Anyscale explains the intuition behind Prefill-Decode (PD) disaggregation for LLM serving, showing how separating prefill and decode phases onto dedicated GPUs can achieve up to 2.7x better goodput and 67% cost savings when using Ray and vLLM on AMD MI325X, while also discussing when PD disaggregation does not help.
@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…
This article summarizes a presentation by Junda Chen on disaggregated inference for LLMs, explaining why goodput (throughput meeting latency SLOs) matters more than raw throughput, and how separating prefill and decode phases improves performance. It also highlights the influence on NVIDIA Dynamo.
@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…
Andrew Ng and DeepLearning.AI have launched a new short course on efficient LLM inference with vLLM, built in partnership with Red Hat, covering quantization, PagedAttention, continuous batching, and benchmarking for serving LLMs at scale.
@TheAhmadOsman: LLM Decoding Simplified From the upcoming article on X
Ahmad Osman teases an upcoming article on X that simplifies LLM decoding.
@amitiitbhu: New Article: How does vLLM work? Read here: https://outcomeschool.com/blog/how-does-vllm-work…
A detailed blog post explaining how vLLM works, including PagedAttention, KV cache management, and continuous batching for efficient LLM serving.