@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…

X AI KOLs Timeline 06/18/26, 07:04 AM News

llm inference pd-disaggregation goodput serving latency nvidia-dynamo

Summary

This article summarizes a presentation by Junda Chen on disaggregated inference for LLMs, explaining why goodput (throughput meeting latency SLOs) matters more than raw throughput, and how separating prefill and decode phases improves performance. It also highlights the influence on NVIDIA Dynamo.

Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It features easy-to-understand diagrams explaining the motivation behind Chunked-Prefill, and it also touches on PD Disaggregation. (The explanation of the transition from Continuous Batching to Chunked-Prefill is written in a very clear and accessible way, which is great.) PD Disaggregation (separating Prefill and Decode) should be a pretty hot topic, but I don't hear much discussion about it domestically in Japan, so I think this article, which carefully explains it, is quite valuable. For a straightforward concrete example of Disaggregation, just check out the talk by the author of DistServe (https://youtu.be/tIPDwUepXcA?si=4YcNpkGnDrxC-jfq…), and I feel like this one article alone should suffice for grasping the concept. Target article:

Original Article

View Cached Full Text

Cached at: 06/18/26, 06:20 PM

Sakura Internet’s Michishita-san’s article comprehensively summarizes LLM Inference and comes highly recommended.

It features easy-to-understand diagrams explaining the motivation behind Chunked-Prefill, and it also touches on PD Disaggregation. (The explanation of the transition from Continuous Batching to Chunked-Prefill is written in a very clear and accessible way, which is great.)

PD Disaggregation (separating Prefill and Decode) should be a pretty hot topic, but I don’t hear much discussion about it domestically in Japan, so I think this article, which carefully explains it, is quite valuable.

For a straightforward concrete example of Disaggregation, just check out the talk by the author of DistServe (https://youtu.be/tIPDwUepXcA?si=4YcNpkGnDrxC-jfq…), and I feel like this one article alone should suffice for grasping the concept.

Target article:

TL;DR

Separating prefill and decode phases across different GPU pools improves goodput—throughput that meets latency Service Level Objectives (SLOs)—rather than maximizing raw aggregate throughput.

Introduction

In GPU Mode Session #58 (video by @kazukifujii), Junda Chen, a second-year PhD student at UC San Diego’s How AI Lab, presents disaggregated inference for Large Language Models (LLMs). He argues that goodput is a better metric than simple throughput for evaluating LLM serving systems, because it accounts for per‑request latency constraints. Chen’s work has influenced industry frameworks like NVIDIA Dynamo, featured in Jensen Huang’s GTC 2025 keynote.

Why Goodput Matters More Than Throughput

Service‑Level Objectives (SLOs)

Different applications (chatbots, search engines, code agents) impose different SLOs on two key latencies:

TTFT (Time To First Token): e.g., 1–2 seconds for chat; 100–200 ms for code agents.
TPOT (Time Per Output Token): the time between successive generated tokens.

Providers monitor these SLOs to keep service healthy. Historically, cost per request was measured by raw throughput (requests/second or tokens/second). But raw throughput ignores whether users actually experience acceptable latency.

The Goodput Concept

Goodput is the number of requests that simultaneously satisfy both TTFT and TPOT SLOs, divided by time. For example:

A system handling 10 requests/second may drop to 3 good requests/second if constraints (TTFT ≤ 200 ms, TPOT ≤ 50 ms) are enforced.
High raw throughput does not guarantee high goodput; an overloaded system delivers many “bad” requests that waste GPU compute and harm user experience.

Chen highlights a graph from GTC 2025 (GB200 NVL72 with FP4) that shows a trade‑off between aggregate throughput and per‑user token rate. Picking the right operating point (e.g., 200 token/s/user) maximises goodput, not raw throughput.

Prefill vs. Decode: Different Workload Characteristics

Prefill (Compute‑Bound)

Processes the entire input prompt (system prompt + user input) in parallel.
Computes KV‑cache and generates the first output token.
Even a single prefill request can saturate many GPUs (e.g., a 1K‑token prompt saturates 100 GPUs).

Decode (Memory‑Bound)

Generates tokens one by one (autoregressive).
Dominated by moving model parameters from memory to SRAM.
Becomes compute‑saturated only with very large batch sizes (often causing out‑of‑memory errors).

The Problem: Mixed Batching Hurts Decode Latency

Most systems use chunked prefill (batching prefill and decode together). Chen demonstrates with a Llama 13B model on an A100:

One prefill (128 tokens): ~14 ms.
One decode: ~7.5 ms.
When batched together: decode latency jumps to ~14 ms (1.8× increase).
For longer prompts (1K+ tokens), the interference grows to 12× decode slowdown.

This happens because a new prefill request is computationally heavy and forces the decode step to wait, effectively delaying every ongoing decode request. The “good” throughput plummets.

Disaggregated Inference as a Solution

Separating prefill and decode onto dedicated GPU pools avoids this interference. Prefill GPUs handle compute‑heavy prompt processing; decode GPUs handle memory‑bound token generation. This allows each pool to be independently sized and batched, improving goodput.

NVIDIA Dynamo (announced at GTC 2025) provides a framework for such disaggregation. Chen notes that the industry has quickly adopted this approach; most major companies now have their own disaggregated serving features.

Conclusion

Aggregate throughput can mislead: a system that maximizes tokens per second may still produce poor user experience. Goodput—throughput that meets SLOs—is a better cost metric. Disaggregating prefill and decode is a practical way to achieve high goodput with fewer GPUs, as demonstrated by recent research and industry adoption.

Source: GPU Mode Session #58 – Disaggregated Inference

@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…

TL;DR

Introduction

Why Goodput Matters More Than Throughput

Service‑Level Objectives (SLOs)

The Goodput Concept

Prefill vs. Decode: Different Workload Characteristics

Prefill (Compute‑Bound)

Decode (Memory‑Bound)

The Problem: Mixed Batching Hurts Decode Latency

Disaggregated Inference as a Solution

Conclusion

Similar Articles

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

@polydao: This Stanford lecture on AI inference will teach you more about how LLMs work in production than most ML courses > Clau…

@TheAhmadOsman: LLM Inference Engine Stack Breakdown and Workload/Bottlenecks Cheatsheet From the upcoming Inference Engine Comprehensi…

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

LLMs 101: A Practical Guide (2026 Edition)

Submit Feedback

Similar Articles

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

@polydao: This Stanford lecture on AI inference will teach you more about how LLMs work in production than most ML courses > Clau…

@TheAhmadOsman: LLM Inference Engine Stack Breakdown and Workload/Bottlenecks Cheatsheet From the upcoming Inference Engine Comprehensi…

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

LLMs 101: A Practical Guide (2026 Edition)