@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …

X AI KOLs Following 06/15/26, 05:20 PM News

llm-serving prefill-decode disaggregation inference-optimization vllm amd scaling

Summary

Discusses the nuanced reality of prefill-decode disaggregation in LLM serving at scale, based on customer patterns and validated on AMD with vLLM.

One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced. So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved.

Original Article

View Cached Full Text

Cached at: 06/15/26, 11:08 PM

One pattern we keep seeing with customers serving LLMs at scale:

Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced.

So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved.

2/ In this post, we benchmark PD on AMD MI325X with Ray Serve + vLLM. Across Qwen3-235B and DeepSeek-V3 workloads, PD delivers up to 2.7x better goodput and up to 67% compute cost reduction. But only in the right regimes:

3/ First insight: PD does not make prefill faster. PD adds a KV transfer step between prefill and decode workers. That means TTFT can get worse, even when throughput improves. For strict TTFT SLAs, aggregated serving is often simpler and better.

4/ Second insight: PD’s real win is TPOT. In aggregated serving, prefill and decode share the same GPUs. As load increases, prefill work interrupts decode, and TPOT degrades. With PD, decode runs on dedicated GPUs, so TPOT stays much flatter under load.

5/ Third insight: TPOT savings compound over generation length. A 5–10ms/token improvement may look small. But over hundreds or thousands of output tokens, it turns into meaningful E2E latency and throughput gains. This matters for reasoning, agents, and long-form generation.

6/ Fourth insight: the P:D ratio is workload-dependent. Changing ISL/OSL, cache hit rate, or target QPS can change the optimal split. A bad ratio can make PD strictly worse than aggregated. Start with 1:1, then move GPUs toward the bottleneck. Even better if we can dynamically change these ratios at runtime as workload profile changes.

7/ Takeaway:

PD is not a universal win. It helps when the workload is TPOT/E2E-sensitive and generations are long enough for per-token savings to compound. It can lose when TTFT dominates, outputs are short, or the P:D ratio is wrong.

Full post with intuition, benchmarks, and reproducible AMD + Ray + vLLM setup:

@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …

Similar Articles

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…

@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…

@TheAhmadOsman: LLM Decoding Simplified From the upcoming article on X

@amitiitbhu: New Article: How does vLLM work? Read here: https://outcomeschool.com/blog/how-does-vllm-work…

Submit Feedback

Similar Articles

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…

@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…

@TheAhmadOsman: LLM Decoding Simplified From the upcoming article on X

@amitiitbhu: New Article: How does vLLM work? Read here: https://outcomeschool.com/blog/how-does-vllm-work…