@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

X AI KOLs Following 06/17/26, 02:41 AM News

prefill-decode-disaggregation llm-serving ray vllm amd cost-savings goodput

Summary

This blog post from Anyscale explains the intuition behind Prefill-Decode (PD) disaggregation for LLM serving, showing how separating prefill and decode phases onto dedicated GPUs can achieve up to 2.7x better goodput and 67% cost savings when using Ray and vLLM on AMD MI325X, while also discussing when PD disaggregation does not help.

Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's real benefit is flat, stable TPOT under load - TPOT savings compound over output sequence length The optimal P:D ratio is dependent in particular on input lengths, output lengths, and cache hits. Meaningful optimizations are possible, but tuning can be sensitive. Benchmarks performed with @raydistributed + @vllm_project on AMD MI325X.

Original Article

View Cached Full Text

Cached at: 06/17/26, 03:44 AM

Some intuition about PD disaggregation from the blog

PD doesn’t speed up prefill and can actually hurt TTFT
PD’s real benefit is flat, stable TPOT under load
TPOT savings compound over output sequence length

The optimal P:D ratio is dependent in particular on input lengths, output lengths, and cache hits. Meaningful optimizations are possible, but tuning can be sensitive.

Benchmarks performed with @raydistributed + @vllm_project on AMD MI325X.

Achieving Up to 67% Cost Savings with Prefill-Decode Disaggregation Using Ray + vLLM on AMD MI325X | Anyscale

Source: https://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings In LLM serving, the optimization objective is deceptively simple: given a set of latency SLA targets – time to first token (TTFT), time per output token (TPOT), end-to-end latency (E2E) – maximize the queries per second (QPS) you can sustain, also known as the “goodput”.

One of the most powerful levers for breaking through the goodput ceiling isPrefill-Decode (PD) disaggregation. In this post, we cover how we used Ray Serve LLM to orchestrate PD disaggregation inference workloads on AMD, achieving up to2.7x better goodput.

The key advantage for using PD disaggregation is that instead of running both phases on the same GPUs – where they compete for compute, memory bandwidth, and scheduling budget – PD separates them onto dedicated hardware. Prefill GPUs handle prompt processing. Decode GPUs handle token generation. By eliminating mutual interference, each phase runs closer to its theoretical throughput, and the system as a whole serves more requests under the same SLA constraints.

However, this is no free lunch – PD adds operational complexity: KV cache must be transferred across nodes and the prefill-to-decode ratio must be tuned per workload. We show results where under the same GPU budget and SLA, Prefill-Decode disaggregation on Ray + vLLM can serve1.3x to 2.3x more QPSthan aggregated serving – depending on the workload (up to67% compute cost reduction) and also results where it does not help – so you can make the right decision for your workload.

PD vs Aggregated - Max Sustainable QPS Under SLA (Same GPU Count) PD vs Aggregated max sustainable QPS under SLA across 5 workload scenarios (Model, ISL/OSL and hit-rate variations). Validated on Qwen3-235B and DeepSeek-V3 on AMD MI325X.

PD vs Aggregated - Max Sustainable QPS Under SLA (Same GPU Count) We tested two large MoE models across a range of workloads – varying input/output lengths, KV cache hit rates, and P:D ratios – to find where PD disaggregation saves cost and where it doesn’t. This post walks through the core intuition behind why PD disaggregation helps, the AMD stack needed to enable PD disaggregation (RIXL for KV transfer), and how to set it up with Ray Serve. For a managed solution you can useAnyscalebringing similar cost savings to your workloads.

LinkCore Intuition – Why PD Works (and When It Doesn’t)https://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#core-intuition-%E2%80%93-why-pd-works-(and-when-it-doesn’t)

This section covers the four key insights you need to reason about PD disaggregation for any workload. Each insight includes data from our experiments, plus clear guidance on when PD disaggregation loses. These results are agnostic to hardware setup and should be taken as general guidelines and conclusions about prefill decode disaggregation.

LinkInsight 1: PD does NOT make prefill faster – it can actually hurt TTFThttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-1:-pd-does-not-make-prefill-faster-%E2%80%93-it-can-actually-hurt-ttft

The most common misconception about PD is that it speeds up everything. It does not. On the metric that matters most for interactive responsiveness – time to first token – PD is consistently slower than aggregated serving on the same GPU footprint.

**Why aggregated TTFT is already good.**In vLLM’s scheduler, there is no separate “prefill phase” or “decode phase.” The scheduler runs all currently-active requests – both prefill and decode – before admitting new requests from the waiting queue. Chunked prefill is enabled by default for all decoder-only models, i.e. long prompts are split into chunks sized bymax\_num\_batched\_tokens(defaults to 8192). Each chunk runs as one scheduler iteration. Typically, decode steps consume trivially little budget per iteration. For example, a batch of 128 concurrent decode requests uses at most 128 tokens out of the 8192-token budget, leaving the vast majority of each iteration available for prefill tokens. In this case, when a new request gets added to the batch, the 8064 unused token budget gets allocated to the prefill of the new request. This implies inflation on the TPOT of that iteration but not much inflation on the TTFT. TTFT is therefore dominated by the raw compute time of the prefill forward pass (attention + MoE routing), not by contention with decode.

Aggregated Serving - vLLM Scheduler Timeline In aggregated serving, decode tokens consume only ~1.6% of the scheduler’s token budget per iteration — TTFT is dominated by prefill compute, not decode contention.

Aggregated Serving - vLLM Scheduler Timeline **What PD changes.**PD adds a KV cache transfer step after prefill completes. The prefill node sends KV data over the network (RDMA/RoCE) to the decode node. This transfer has inherent overhead that depends on model architecture, KV cache size, and network conditions. Under high load and kv-cache pressure, prefill nodes can also queue up, adding queuing delay on top of transfer overhead.

PD Serving - KV Transfer Adds TTFT Overhead In PD serving, the KV cache transfer step between prefill and decode nodes adds overhead that inflates TTFT.

PD Serving - KV Transfer Adds TTFT Overhead **The net effect.**On the same GPU footprint, aggregated consistently achieves equal or lower TTFT than PD.

Qwen3-235B on MI325X & DeepSeek-V3 on MI325X (30- Hit Rate) Figure 3: TTFT vs QPS — Agg TTFT stays flat while PD TTFT rises under load.

Qwen3-235B on MI325X & DeepSeek-V3 on MI325X (30- Hit Rate) If your SLA is measured purely on time-to-first-token (e.g., interactive search, auto-complete), aggregated will consistently beat PD. As Figure 3 shows, PD’s TTFT baseline on DeepSeek-V3 is ~330ms (due to KV transfer overhead), while Agg stays at ~260ms across all QPS levels:

Under aTTFT < 300msSLA, PDcannot serve any traffic(baseline TTFT exceeds the target), while Agg sustains7.0+ QPS.
Under aTTFT < 500msSLA, Agg sustains7.0+ QPSvs PD’s5.0 QPS– Agg wins by at least 1.4x.

Agg’s advantage here is structural: no KV transfer step, and prefill load is naturally distributed across replicas. If you needbothfast TTFT and fast TPOT, consider accepting a slightly relaxed TTFT target – even a small relaxation can unlock major TPOT and E2E improvements through PD.

**Bottom line:**If your SLA is strictly TTFT-limited, aggregated is the simpler and better choice.

LinkInsight 2: PD’s real win is flat, stable TPOT under loadhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-2:-pd’s-real-win-is-flat,-stable-tpot-under-load

This is the core mechanism behind PD’s value. In aggregated serving, prefill and decode share the same GPU. Each scheduler iteration that includes prefill tokens is compute-heavier than a pure-decode iteration. As QPS rises, more prefill work stacks up, and decode tokens wait longer. TPOT degrades linearly (or worse) with increasing QPS, because every new prefill request steals compute from all in-flight decode steps.

PD eliminates this entirely. Decode runs on dedicated GPUs that never see a prefill token. TPOT stays nearly flat regardless of how much prefill work is happening on other nodes.

Qwen3-235B and DeepSeek-V3 Figure 4: TPOT vs QPS – Agg TPOT rises steeply while PD stays flat. Under the target SLA, PD sustains significantly more QPS than Agg on both models.

Qwen3-235B and DeepSeek-V3

LinkInsight 3: TPOT savings compound over output sequence lengthhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-3:-tpot-savings-compound-over-output-sequence-length

PD’s per-token TPOT advantage looks modest in isolation (5-10ms). But it multiplies across every output token:

Total savings = TPOT delta × output_length.

This compounding is why PD wins on E2E latency despite losing on TTFT.

PD E2E Advantage Grows With Output Length A 5.1ms/token TPOT advantage compounds: PD’s E2E win grows from 12% at OSL=140 to 17% at OSL=4K (Qwen3-235B, 24 GPU, QPS=4).

PD E2E Advantage Grows With Output Length **When PD loses: short output.**For short-output workloads (classification, extraction, short QA), the savings don’t accumulate enough to justify the complexity. In these cases you should use aggregated.

LinkInsight 4: The optimal P:D ratio depends on your workloadhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-4:-the-optimal-p:d-ratio-depends-on-your-workload

This is the most practical insight for practitioners. The P:D ratio determines how GPU resources are split between prefill and decode. Getting it wrong can make PDworsethan aggregated.

Key findings across workloads:

Workload

Cache Hit Rate

Bottleneck

Optimal Ratio

Long input, short output (ISL=16K, OSL=1K)

Prefill throughput

2P:1D

Long input, long output (ISL=16K, OSL=4K)

Decode throughput

1P:3D

Multi-turn with high cache reuse

80%

Decode throughput

1P:2D

Multi-turn with moderate cache reuse

30–60%

Mixed

1P:1Dto1P:2D

**Rule of thumb:**The marginal GPU should go to wherever the bottleneck is. High cache hit rates make prefill cheap – allocate more to decode. Low cache hit rates with long inputs – allocate more to prefill.

The most common PD pitfall: deploying with a ratio that does not match the workload. This can make PD strictly worse than aggregated on every metric.

Wrong P:D Ratio Can Be Worse Than Aggregated Wrong P:D ratio can be dramatically worse than aggregated. Always benchmark your specific workload. Start with 1:1, then adjust based on whether TTFT or TPOT hits the SLA first.

Wrong P:D Ratio Can Be Worse Than Aggregated

LinkWhat’s Special About AMD – RIXL and the KV Transfer Stackhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#what’s-special-about-amd-%E2%80%93-rixl-and-the-kv-transfer-stack

PD disaggregation requires high-bandwidth KV cache transfer between prefill and decode nodes. On NVIDIA hardware, this is handled byNIXL(NVIDIA Interconnect eXchange Library) over NVLink, InfiniBand, or EFA. On AMD, we useRIXL(ROCm Interconnect eXchange Library) – a plug-and-play replacement for NIXL that uses UCX transport over RDMA/RoCE InfiniBand.

The key point is that RIXL exposes the sameNixlConnectorinterface in vLLM. Zero code changes are needed in the serving layer. If your vLLM config sayskv\_connector: NixlConnector, it works on both NVIDIA (via NIXL) and AMD (via RIXL).

LinkContainer and Software Stackhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#container-and-software-stack

Our container image is built fromDockerfile\.v0\.18\.devwith the following components:

Component

Version / Source

Base image

anyscale/ray:nightly\-py312\-cu128(orrayproject/ray:nightly\-py312\-cu128for OSS)

ROCm

7.0 (rocm/dev\-ubuntu\-22\.04:7\.0\-complete for buildstages)

vLLM

0.18.0 viawheels\.vllm\.ai/rocm/0\.18\.0/rocm700prebuilt wheels

RIXL

Built from source (ROCm/RIXLcommitf33a5599)

UCX

Built from source (ROCm/ucx commitda3fac2a) with–with\-rocm,–with\-verbs,–enable\-mt

ROCm Triton

Built from source (ROCm/tritoncommitf9e5bf54)

Python

3.12

VLLM_ROCM_USE_AITER=1                        # AMD AI Tensor Engine Runtime
VLLM_ROCM_USE_AITER_MOE=1                    # Optimized MoE kernels
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4     # Fast all-reduce with INT4 quantization
VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16=1   # BF16-to-FP16 cast for quick-reduce

LinkA Critical Operational Notehttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#a-critical-operational-note

Network transport quality is everything.With proper UCX/RDMA configuration, cross-node KV transfer performs comparably to intra-node. With TCP fallback, throughput degrades catastrophically – we observed up to19x degradationin testing. Always validate the RDMA transport layer before benchmarking PD.

The UCX transport configuration in our deployments:

UCX_TLS=rc,sm,self,rocm_copy,rocm_ipc
UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1

This configures 8x Mellanox ConnectX interfaces for RoCE fabric, using reliable connected (RC) transport with shared memory and ROCm GPU direct paths. Hardware tested: AMD MI325X with 288GB HBM3e, 8 GPUs per node.

LinkHow to Reproducehttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#how-to-reproduce

Everything needed to reproduce these results is consolidated in a single repository: Dockerfile, serve configs, and benchmark scripts. The instructions assume you have a Ray cluster running on AMD MI325X nodes – viaAnyscale,KubeRay, or bare metal. All artifacts are availablehere.

LinkConclusionhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#conclusion

PD disaggregation on Ray + vLLM delivers1.3 –2.3x more QPSunder the same GPU budget and SLA – up to67% cost reductionon AMD MI325X. To summarize the main insights:

PD winswhen your SLA is TPOT- or E2E latency-sensitive and output is long enough for per-token savings to compound.
Aggregated winswhen TTFT is the binding constraint, output is short, or cache hit rates are high enough to eliminate prefill-decode contention.
**The P:D ratio matters.**Given the workload the optimal value could be different from case to case.

LinkGet Startedhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#get-started

**Reproduce our results.**Clone therepo, deploy a config, and run the benchmark CLI against your workload.
Need managed solutions:Anyscalegives you managed Ray services. You can deploy Anyscale on k8s and from then on you don’t have to manage Ray clusters.
**Talk to us.**Found a workload where PD behaves differently? We want to hear about it – reach out on theRay communityorRay’s slack.

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

Achieving Up to 67% Cost Savings with Prefill-Decode Disaggregation Using Ray + vLLM on AMD MI325X | Anyscale

LinkCore Intuition – Why PD Works (and When It Doesn’t)https://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#core-intuition-%E2%80%93-why-pd-works-(and-when-it-doesn’t)

LinkInsight 1: PD does NOT make prefill faster – it can actually hurt TTFThttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-1:-pd-does-not-make-prefill-faster-%E2%80%93-it-can-actually-hurt-ttft

LinkInsight 2: PD’s real win is flat, stable TPOT under loadhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-2:-pd’s-real-win-is-flat,-stable-tpot-under-load

LinkInsight 3: TPOT savings compound over output sequence lengthhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-3:-tpot-savings-compound-over-output-sequence-length

LinkInsight 4: The optimal P:D ratio depends on your workloadhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#insight-4:-the-optimal-p:d-ratio-depends-on-your-workload

LinkWhat’s Special About AMD – RIXL and the KV Transfer Stackhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#what’s-special-about-amd-%E2%80%93-rixl-and-the-kv-transfer-stack

LinkContainer and Software Stackhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#container-and-software-stack

LinkA Critical Operational Notehttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#a-critical-operational-note

LinkHow to Reproducehttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#how-to-reproduce

LinkConclusionhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#conclusion

LinkGet Startedhttps://www.anyscale.com/blog/ray-vllm-prefill-decode-disaggregation-amd-mi325x-67-percent-savings#get-started

Similar Articles

@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …

@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

@raydistributed: Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput…

Submit Feedback

Similar Articles

@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …

@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

@raydistributed: Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput…