@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…
Summary
This article summarizes a presentation by Junda Chen on disaggregated inference for LLMs, explaining why goodput (throughput meeting latency SLOs) matters more than raw throughput, and how separating prefill and decode phases improves performance. It also highlights the influence on NVIDIA Dynamo.
View Cached Full Text
Cached at: 06/18/26, 06:20 PM
Sakura Internet’s Michishita-san’s article comprehensively summarizes LLM Inference and comes highly recommended.
It features easy-to-understand diagrams explaining the motivation behind Chunked-Prefill, and it also touches on PD Disaggregation. (The explanation of the transition from Continuous Batching to Chunked-Prefill is written in a very clear and accessible way, which is great.)
PD Disaggregation (separating Prefill and Decode) should be a pretty hot topic, but I don’t hear much discussion about it domestically in Japan, so I think this article, which carefully explains it, is quite valuable.
For a straightforward concrete example of Disaggregation, just check out the talk by the author of DistServe (https://youtu.be/tIPDwUepXcA?si=4YcNpkGnDrxC-jfq…), and I feel like this one article alone should suffice for grasping the concept.
Target article:
TL;DR
Separating prefill and decode phases across different GPU pools improves goodput—throughput that meets latency Service Level Objectives (SLOs)—rather than maximizing raw aggregate throughput.
Introduction
In GPU Mode Session #58 (video by @kazukifujii), Junda Chen, a second-year PhD student at UC San Diego’s How AI Lab, presents disaggregated inference for Large Language Models (LLMs). He argues that goodput is a better metric than simple throughput for evaluating LLM serving systems, because it accounts for per‑request latency constraints. Chen’s work has influenced industry frameworks like NVIDIA Dynamo, featured in Jensen Huang’s GTC 2025 keynote.
Why Goodput Matters More Than Throughput
Service‑Level Objectives (SLOs)
Different applications (chatbots, search engines, code agents) impose different SLOs on two key latencies:
- TTFT (Time To First Token): e.g., 1–2 seconds for chat; 100–200 ms for code agents.
- TPOT (Time Per Output Token): the time between successive generated tokens.
Providers monitor these SLOs to keep service healthy. Historically, cost per request was measured by raw throughput (requests/second or tokens/second). But raw throughput ignores whether users actually experience acceptable latency.
The Goodput Concept
Goodput is the number of requests that simultaneously satisfy both TTFT and TPOT SLOs, divided by time. For example:
- A system handling 10 requests/second may drop to 3 good requests/second if constraints (TTFT ≤ 200 ms, TPOT ≤ 50 ms) are enforced.
- High raw throughput does not guarantee high goodput; an overloaded system delivers many “bad” requests that waste GPU compute and harm user experience.
Chen highlights a graph from GTC 2025 (GB200 NVL72 with FP4) that shows a trade‑off between aggregate throughput and per‑user token rate. Picking the right operating point (e.g., 200 token/s/user) maximises goodput, not raw throughput.
Prefill vs. Decode: Different Workload Characteristics
Prefill (Compute‑Bound)
- Processes the entire input prompt (system prompt + user input) in parallel.
- Computes KV‑cache and generates the first output token.
- Even a single prefill request can saturate many GPUs (e.g., a 1K‑token prompt saturates 100 GPUs).
Decode (Memory‑Bound)
- Generates tokens one by one (autoregressive).
- Dominated by moving model parameters from memory to SRAM.
- Becomes compute‑saturated only with very large batch sizes (often causing out‑of‑memory errors).
The Problem: Mixed Batching Hurts Decode Latency
Most systems use chunked prefill (batching prefill and decode together). Chen demonstrates with a Llama 13B model on an A100:
- One prefill (128 tokens): ~14 ms.
- One decode: ~7.5 ms.
- When batched together: decode latency jumps to ~14 ms (1.8× increase).
- For longer prompts (1K+ tokens), the interference grows to 12× decode slowdown.
This happens because a new prefill request is computationally heavy and forces the decode step to wait, effectively delaying every ongoing decode request. The “good” throughput plummets.
Disaggregated Inference as a Solution
Separating prefill and decode onto dedicated GPU pools avoids this interference. Prefill GPUs handle compute‑heavy prompt processing; decode GPUs handle memory‑bound token generation. This allows each pool to be independently sized and batched, improving goodput.
NVIDIA Dynamo (announced at GTC 2025) provides a framework for such disaggregation. Chen notes that the industry has quickly adopted this approach; most major companies now have their own disaggregated serving features.
Conclusion
Aggregate throughput can mislead: a system that maximizes tokens per second may still produce poor user experience. Goodput—throughput that meets SLOs—is a better cost metric. Disaggregating prefill and decode is a practical way to achieve high goodput with fewer GPUs, as demonstrated by recent research and industry adoption.
Similar Articles
Inference Engines for LLMs & Local AI Hardware (2026 Edition)
This article provides a comprehensive guide to LLM inference engines for local AI hardware in 2026, explaining how to choose based on hardware strategy, workload, and serving model, and covering engines like llama.cpp, MLX, ExLlamaV2/3, vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo.
@polydao: This Stanford lecture on AI inference will teach you more about how LLMs work in production than most ML courses > Clau…
A Stanford lecture on AI inference emphasizes practical bottlenecks like KV-cache and techniques like speculative decoding and continuous batching, offering more real-world insight than typical ML courses.
@TheAhmadOsman: LLM Inference Engine Stack Breakdown and Workload/Bottlenecks Cheatsheet From the upcoming Inference Engine Comprehensi…
Ahmad Osman shares a cheatsheet breaking down the LLM inference engine stack and common workload bottlenecks ahead of a comprehensive article.
@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…
This blog post from Anyscale explains the intuition behind Prefill-Decode (PD) disaggregation for LLM serving, showing how separating prefill and decode phases onto dedicated GPUs can achieve up to 2.7x better goodput and 67% cost savings when using Ray and vLLM on AMD MI325X, while also discussing when PD disaggregation does not help.
LLMs 101: A Practical Guide (2026 Edition)
A comprehensive practical guide to LLMs covering inference mechanics, tokens, Transformers, KV cache, local deployment hardware, and quantization as of May 2026.