@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …

X AI KOLs Following News

Summary

Discusses the nuanced reality of prefill-decode disaggregation in LLM serving at scale, based on customer patterns and validated on AMD with vLLM.

One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced. So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved.
Original Article
View Cached Full Text

Cached at: 06/15/26, 11:08 PM

One pattern we keep seeing with customers serving LLMs at scale:

Prefill-decode disaggregation is often treated like a magic wand. But the reality is more nuanced.

So we wrote down the core insights for when PD helps, when it does not, and validated them on AMD + vLLM — where the PD path has been much less paved.

2/ In this post, we benchmark PD on AMD MI325X with Ray Serve + vLLM. Across Qwen3-235B and DeepSeek-V3 workloads, PD delivers up to 2.7x better goodput and up to 67% compute cost reduction. But only in the right regimes:

3/ First insight: PD does not make prefill faster. PD adds a KV transfer step between prefill and decode workers. That means TTFT can get worse, even when throughput improves. For strict TTFT SLAs, aggregated serving is often simpler and better.

4/ Second insight: PD’s real win is TPOT. In aggregated serving, prefill and decode share the same GPUs. As load increases, prefill work interrupts decode, and TPOT degrades. With PD, decode runs on dedicated GPUs, so TPOT stays much flatter under load.

5/ Third insight: TPOT savings compound over generation length. A 5–10ms/token improvement may look small. But over hundreds or thousands of output tokens, it turns into meaningful E2E latency and throughput gains. This matters for reasoning, agents, and long-form generation.

6/ Fourth insight: the P:D ratio is workload-dependent. Changing ISL/OSL, cache hit rate, or target QPS can change the optimal split. A bad ratio can make PD strictly worse than aggregated. Start with 1:1, then move GPUs toward the bottleneck. Even better if we can dynamically change these ratios at runtime as workload profile changes.

7/ Takeaway:

PD is not a universal win. It helps when the workload is TPOT/E2E-sensitive and generations are long enough for per-token savings to compound. It can lose when TTFT dominates, outputs are short, or the P:D ratio is wrong.

Full post with intuition, benchmarks, and reproducible AMD + Ray + vLLM setup:

Similar Articles