Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

Hacker News Top Papers

Summary

Researchers propose Prefill-as-a-Service (PrfaaS), a system that offloads long-context prefill to remote compute-dense clusters and streams KVCache over commodity Ethernet, enabling independent scaling and 32-54% higher throughput for a 1T-parameter hybrid model.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/22/26, 11:03 AM

# Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Source: [https://arxiv.org/html/2604.15039](https://arxiv.org/html/2604.15039)
Ruoyu Qin1,2Weiran He1Yaoyu Wang1Zheming Li1Xinran Xu1 Yongwei Wu2Weimin Zheng2Mingxing Zhang2 1Moonshot AI2Tsinghua University

###### Abstract

Prefill\-decode \(PD\) disaggregation has become the standard architecture for large\-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer\. In conventional dense\-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high\-bandwidth network domain, limiting heterogeneous deployment and resource elasticity\. Recent hybrid\-attention architectures substantially reduce KVCache size, making cross\-cluster KVCache transport increasingly plausible\. However, smaller KVCache alone does not make heterogeneous cross\-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter\-cluster bandwidth fluctuates\. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization\.

We present Prefill\-as\-a\-Service \(PrfaaS\), a cross\-datacenter serving architecture that selectively offloads long\-context prefill to standalone, compute\-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode\. Rather than treating reduced KVCache as sufficient, PrfaaS combines model\-side KV efficiency with system\-side selective offloading, bandwidth\-aware scheduling, and cache\-aware request placement\. This design removes the requirement that heterogeneous accelerators share the same low\-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters\. In a case study using an internal 1T\-parameter hybrid model, a PrfaaS\-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross\-datacenter bandwidth\.

## 1Introduction

Prefill\-decode \(PD\) disaggregation has become the dominant deployment paradigm for large\-scale LLM serving because it separates two fundamentally different phases of inference: prefill is primarily compute\-intensive, whereas decode is primarily memory\-bandwidth\-intensive\. At Moonshot AI, Mooncake\[[22](https://arxiv.org/html/2604.15039#bib.bib8)\]helped push this shift into practice by treating KVCache as a first\-class systems resource, a direction that has since propagated across the broader serving ecosystem via our collaboration with open frameworks such as vLLM\[[29](https://arxiv.org/html/2604.15039#bib.bib15)\], SGLang\[[7](https://arxiv.org/html/2604.15039#bib.bib16)\], and Dynamo\[[20](https://arxiv.org/html/2604.15039#bib.bib17)\]\. In principle, PD disaggregation should also unlock a more ambitious goal: heterogeneous serving, in which prefill runs on compute\-dense accelerators and decode runs on bandwidth\-optimized accelerators\. Hardware roadmaps are already moving in this direction\. NVIDIA’s Rubin CPX\[[19](https://arxiv.org/html/2604.15039#bib.bib18)\]explicitly targets high\-throughput long\-context prefill, while architectures such as Groq’s Language Processing Unit \(LPU\)\[[1](https://arxiv.org/html/2604.15039#bib.bib24),[8](https://arxiv.org/html/2604.15039#bib.bib25)\]emphasize the extreme memory bandwidth required for decode\.

In practice, however, this heterogeneous vision remains difficult to realize because current PD disaggregation still relies on a strong networking assumption: when prefill and decode are placed on different nodes, the KVCache produced by prefill must be transferred quickly enough to avoid stalling computation\. In conventional deployments, this effectively confines both phases to the same tightly coupled, high\-bandwidth network domain, typically a single datacenter\-scale RDMA fabric, where PD disaggregation works well inside a homogeneous cluster\. The difficulty is that this single\-datacenter paradigm does not extend naturally to heterogeneous serving\. Accelerator resources are usually pooled by both chip type and physical location, so compute\-oriented and inference\-oriented hardware are often unavailable within the same tightly coupled domain\. This creates a strong incentive to separate prefill from decode across cluster boundaries, which can reduce costs and latency for long\-context requests by moving prefill to faster compute hardware\. However, this advantage materializes only if KVCache transfer remains sufficiently cheap\. Once prefill and decode no longer share the same high\-bandwidth fabric, the generated KVCache must traverse a slower inter\-cluster link\. If that transfer cost is too high, it erases the prefill\-side gain and becomes the new bottleneck\. Even when two clusters are geographically close, requiring them to share a single RDMA\-scale fabric is operationally rigid and often unrealistic\. Worse, once heterogeneous deployment is forced into one tightly coupled cluster, its prefill\-to\-decode hardware ratio becomes difficult to adapt as traffic patterns evolve\. As a result, current PD deployments still fall short of the flexibility that heterogeneous disaggregation is supposed to provide\.

The central obstacle, therefore, is KVCache transfer\. Recent hybrid\-attention architectures change this picture in an important way\. Emerging models\[[26](https://arxiv.org/html/2604.15039#bib.bib14),[27](https://arxiv.org/html/2604.15039#bib.bib20),[31](https://arxiv.org/html/2604.15039#bib.bib27),[3](https://arxiv.org/html/2604.15039#bib.bib21),[11](https://arxiv.org/html/2604.15039#bib.bib28),[2](https://arxiv.org/html/2604.15039#bib.bib29),[18](https://arxiv.org/html/2604.15039#bib.bib36)\]interleave a small number of full\-attention layers with a larger number of linear\-complexity or bounded\-state layers, such as Kimi Delta Attention \(KDA\)\[[26](https://arxiv.org/html/2604.15039#bib.bib14)\], Sliding Window Attention \(SWA\)\[[5](https://arxiv.org/html/2604.15039#bib.bib33)\], and related mechanisms\. This design substantially reduces KVCache growth relative to dense\-attention architectures, often by an order of magnitude, thereby making cross\-datacenter KVCache transfer plausible\. But plausibility is not yet practicality: a naive design that externalizes all prefills would still suffer from bursty arrivals, skewed request lengths, uneven prefix\-cache distribution, and fluctuating inter\-cluster bandwidth\. Hybrid architectures relax the KVCache bottleneck, but they do not eliminate the need for system design; rather, they create the opportunity that system design must exploit\.

![Refer to caption](https://arxiv.org/html/2604.15039v1/x1.png)

\(a\)Status quo: Tightly coupled single\-cluster inference\.\(b\)PrfaaS: Multi\-cluster disaggregated inference via cross\-datacenter KVCache\.
Figure 1:Comparison of two deployment paradigms for PD\-disaggregated LLM serving\.This observation leads to the key design principle of this paper:Prefill\-as\-a\-Service \(PrfaaS\)via cross\-datacenter KVCache\. As illustrated in Figure[1](https://arxiv.org/html/2604.15039#S1.F1), instead of forcing heterogeneous accelerators into a single RDMA island, PrfaaS constructs standalone clusters dedicated to long\-context prefill using inexpensive, high\-throughput compute\. Besides, rather than fully separating every request, PrfaaS offloads only long uncached prefills to these compute\-dense prefill clusters, while short requests remain on the local PD path\. The resulting KVCache is then transferred over commodity Ethernet to decode\-capable PD clusters\. This design reflects the underlying systems reality: the motivation to split prefill is strong, but the reduced KVCache of hybrid models is still not cheap enough to justify indiscriminate transfer\. What makes the design feasible is selective offloading, which concentrates cross\-cluster transfer on the requests for which prefill acceleration matters most, while avoiding the inefficiency of sending short requests through a bandwidth\-constrained path\.

Making this design work requires scheduling and cache management that explicitly address the remaining systems challenges\. Considering that bandwidth remains constrained even after KVCache reduction, PrfaaS uses length\-based threshold routing to offload only sufficiently long requests, a bandwidth\-aware scheduler to react to fluctuating link conditions before congestion accumulates, and a global KVCache manager with a hybrid prefix\-cache pool to account jointly for request length, cache placement, and available cross\-cluster bandwidth\. These mechanisms make cross\-cluster heterogeneous serving viable in realistic environments: they preserve the benefits of PD disaggregation without requiring heterogeneous accelerators to share the same low\-latency RDMA fabric, and they allow compute\-oriented prefill capacity and bandwidth\-oriented decode capacity to scale independently across loosely coupled clusters, datacenters, or regions\.

This flexibility directly addresses deployment constraints that are otherwise difficult to resolve in practice, including non\-colocated accelerator classes, hardware asymmetry across cloud regions, and opportunistic remote capacity\. We evaluate this idea through a case study using an internal 1T\-parameter hybrid model following the Kimi Linear architecture\[[26](https://arxiv.org/html/2604.15039#bib.bib14)\]\. With a heterogeneous deployment consisting of a standalone PrfaaS cluster for long\-context prefill and a conventional PD cluster for decode and short prefills, the system achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross\-datacenter bandwidth per machine\. These results show that KVCache\-efficient model architectures are necessary but not sufficient for cross\-datacenter heterogeneous serving\. What makes the deployment practical is the combination of model\-side KVCache reduction with system\-side selective offloading and bandwidth\-aware scheduling\. Together, they turn cross\-datacenter PD disaggregation from an appealing idea into a realistic serving architecture\.

## 2Background

### 2\.1The Bandwidth Wall in Conventional PD Disaggregation

Prefill\-decode \(PD\) disaggregation has become the standard systems abstraction for large\-scale LLM serving because it cleanly separates two fundamentally different phases of inference: prefill is dominated by arithmetic throughput, while decode is dominated by memory bandwidth\. That separation improves utilization and enables phase\-specific optimization\. But it does not come for free\. Once prefill and decode are placed on different nodes, every request must export its KVCache from the prefill side to the decode side, turning what was previously an on\-device state transition into a cross\-node transport problem\. In practice, that transport requirement is what keeps today’s PD deployments confined to a single data center and attached to RDMA\-class scale\-out networks\.

Table 1:Configurations of representative models\. Type A denotes the linear\-complexity block, and Type B denotes the quadratic\-complexity full attention block\.Under latency\-sensitive serving constraints, KVCache produced by prefill instances can be shipped asynchronously to maximize compute utilization\. To avoid GPU idling, the egress bandwidthBoutB\_\{\\text\{out\}\}of a prefill cluster must exceed the aggregate rate at which the cluster produces KVCache\. Because this aggregate rate scales linearly with the number of instances, the binding constraint reduces to the*KV throughput*of a single model instance, which we define as

Φkv​\(l\)=Skv​\(l\)Tprefill​\(l\),\\Phi\_\{\\text\{kv\}\}\(l\)=\\frac\{S\_\{\\text\{kv\}\}\(l\)\}\{T\_\{\\text\{prefill\}\}\(l\)\},\(1\)whereSkv​\(l\)S\_\{\\text\{kv\}\}\(l\)is the KVCache size for a request of lengthllandTprefill​\(l\)T\_\{\\text\{prefill\}\}\(l\)is the corresponding prefill latency\. The value of this metric depends largely on model architecture\. Table[1](https://arxiv.org/html/2604.15039#S2.T1)summarizes the configurations considered in this paper\. For conventional dense\-attention architectures, this transport demand is a dominant systems constraint\. In standard Transformer\-style attention, KVCache grows linearly with context length and can reach tens of gigabytes\. Figure[2](https://arxiv.org/html/2604.15039#S2.F2)shows the KV throughput of MiniMax\-M2\.5, a representative dense model with GQA, at various input lengths\. The bottleneck is stark: for a 32K\-token request, a single MiniMax\-M2\.5 instance produces KVCache at roughly 60 Gbps, requiring egress bandwidth that far exceeds the cross\-datacenter Ethernet capacity of a typical machine\. This is precisely why conventional PD disaggregation remains operationally tied to tightly integrated network domains\. The network budget is so large that moving prefill and decode across looser interconnects, let alone across data centers, is simply not realistic\.

That network coupling also prevents heterogeneous serving from scaling cleanly\. Specialized chips already exist for each phase: hardware such as Rubin CPX targets prefill throughput, while LPU\-style designs target decode bandwidth\. Yet high\-performance interconnects remain tightly coupled to machine form factors and deployment environments, so connecting unlike hardware at RDMA\-class bandwidth generally requires bespoke engineering\. Worse, once heterogeneous hardware is forced into a single tightly coupled cluster, the system inherits a fixed prefill\-to\-decode hardware ratio\. In production traffic, request mix, request volume, and prefix\-cache hit rate fluctuate continuously, so one side of the pipeline inevitably becomes overprovisioned while the other becomes the bottleneck\. In a homogeneous cluster, any machine can be dynamically reassigned between prefill and decode roles as load shifts\. A heterogeneous cluster offers no such flexibility: a chip specialized for prefill cannot serve decode and vice versa, leading to severe load imbalance and stranded capacity\. The result is higher operational complexity and limited real\-world adoption of heterogeneous PD beyond bespoke or low\-throughput scenarios\.

![Refer to caption](https://arxiv.org/html/2604.15039v1/x2.png)
Figure 2:KV throughput of MiniMax\-M2\.5 on an 8×\\timesH200 instance at various input lengths\.Table 2:Prefill latency and KV throughput characteristics of different attention mechanisms\. Lower is better for both metrics\.
### 2\.2Hybrid Attention Changes the PD Deployment Boundary

Table 3:KV throughputΦkv\\Phi\_\{\\text\{kv\}\}\(Gbps\) at various input lengths\. All models are benchmarked on 8×\\timesH200 with SGLang v0\.5\.9\[[7](https://arxiv.org/html/2604.15039#bib.bib16)\]\.What changes this picture is not a new scheduler alone, but a shift in model architecture\. As LLMs move toward longer contexts, the cost of conventional MHA becomes increasingly untenable, prompting a broad transition toward KVCache\-friendly designs\. Table[2](https://arxiv.org/html/2604.15039#S2.T2)categorizes mainstream attention improvements along two dimensions: prefill latency \(TprefillT\_\{\\text\{prefill\}\}\) and KV throughput \(Φkv\\Phi\_\{\\text\{kv\}\}\)\. Under long\-context workloads, full attention mechanisms such as GQA and MLA retain quadratic complexity, resulting in high prefill cost\. Sparse attention\[[6](https://arxiv.org/html/2604.15039#bib.bib35)\]reduces the amount of computation and can lower prefill latency, but it still requires transferring sequence\-length\-dependent KVCache to decode instances, leaving KV throughput as the dominant bottleneck\. By contrast, linear attention and SWA maintain linear computation cost, while their bounded state size substantially reduces KV throughput\.

A growing number of flagship open\-source models adopt linear attention\[[26](https://arxiv.org/html/2604.15039#bib.bib14),[27](https://arxiv.org/html/2604.15039#bib.bib20),[3](https://arxiv.org/html/2604.15039#bib.bib21),[18](https://arxiv.org/html/2604.15039#bib.bib36)\]or SWA\[[31](https://arxiv.org/html/2604.15039#bib.bib27),[11](https://arxiv.org/html/2604.15039#bib.bib28),[2](https://arxiv.org/html/2604.15039#bib.bib29)\], combining these mechanisms into hybrid stacks that interleave a small number of full\-attention layers with a larger number of linear\-complexity layers\. Representative examples include Qwen3\.5\-397B at a 3:1 linear\-to\-full ratio, MiMo\-V2\-Flash at a 5:1 SWA\-to\-full ratio, and Ring\-2\.5\-1T at a 7:1 linear\-to\-full ratio\. In these architectures, only the full\-attention layers produce KVCache that scales with sequence length, while the linear\-complexity layers maintain fixed\-size recurrent state whose footprint becomes negligible in the long\-context regime\.

Equation \([1](https://arxiv.org/html/2604.15039#S2.E1)\) quantifies how model architecture governs the bandwidth demand of PD disaggregation\. Table[3](https://arxiv.org/html/2604.15039#S2.T3)benchmarksΦkv\\Phi\_\{\\text\{kv\}\}for several recent open\-source dense and hybrid models\. Compared with dense models of similar size, hybrid models exhibit a sharp reduction inΦkv\\Phi\_\{\\text\{kv\}\}, meaning that each unit of compute generates far less state that must later traverse the network\. At 32K tokens, MiMo\-V2\-Flash achieves a KV throughput of 4\.66 Gbps versus 59\.93 Gbps for MiniMax\-M2\.5, a 13×\\timesreduction\. Qwen3\.5\-397B reaches 8\.25 Gbps versus 33\.35 Gbps for Qwen3\-235B, a 4×\\timesreduction\. The paper further notes that for Ring\-2\.5\-1T, MLA contributes roughly a 4\.5×\\timescompression over GQA, while the 7:1 hybrid ratio contributes another∼\{\\sim\}8×\\timesreduction, yielding an overall KV memory saving of roughly 36×\\times\.

The key systems implication is not merely lower inference cost, but a reduced KV throughput that shifts the deployable network boundary of PD disaggregation from RDMA\-class fabrics to commodity Ethernet\. In dense\-attention models, prefill emits too much state too quickly, so the network becomes the hard coupling between phases\. In hybrid models, prefill still performs substantial work, but the emitted KVCache is dramatically smaller\. This does not make cross\-datacenter PD free enough for indiscriminate transfer\. Rather, it opens a qualitatively different operating regime in which cross\-cluster KVCache transport becomes plausible for selected requests and therefore worth optimizing at the system level\.

### 2\.3From Intra\-Datacenter PD to Prefill\-as\-a\-Service Paradigm

In intra\-datacenter PD deployments, prefill and decode nodes communicate over high\-bandwidth, fully meshed networks such as RDMA, so the network is far from being a bottleneck relative to prefill computation\. In cross\-cluster PD, however, the relationship between inter\-cluster bandwidth and model KV throughput directly determines whether cross\-datacenter KVCache is feasible\. The cluster\-level bandwidth requirement follows from the per\-instance KV throughput\. For anNN\-GPU prefill cluster, the minimum egress bandwidth is

Bout=NP⋅𝔼​\[Skv\]𝔼​\[Tprefill\]≈NP⋅Φkv​\(Lavg\),B\_\{\\text\{out\}\}=\\frac\{N\}\{P\}\\cdot\\frac\{\\mathbb\{E\}\[S\_\{\\text\{kv\}\}\]\}\{\\mathbb\{E\}\[T\_\{\\text\{prefill\}\}\]\}\\approx\\frac\{N\}\{P\}\\cdot\\Phi\_\{\\text\{kv\}\}\(L\_\{\\text\{avg\}\}\),\(2\)wherePPis the parallelism degree \(GPUs per instance\) andLavgL\_\{\\text\{avg\}\}is the average*uncached*input length of the requests actually offloaded to the PrfaaS cluster\. Notably,BoutB\_\{\\text\{out\}\}depends not only onΦkv\\Phi\_\{\\text\{kv\}\}, which is governed by model architecture and hardware, but also onLavgL\_\{\\text\{avg\}\}, which is shaped by request length distribution, prefix\-cache hit rate, and routing policy\. This dependence is exactly why hybrid models alone are not enough\. On the model side, adopting KVCache\-friendly architectures reducesΦkv\\Phi\_\{\\text\{kv\}\}\. On the system side, selective offloading and a bandwidth\-aware scheduler that accounts for bandwidth constraints and KVCache locality \(§[3\.4](https://arxiv.org/html/2604.15039#S3.SS4)\) determine which requests consume the cross\-datacenter budget in the first place, keepingBoutB\_\{\\text\{out\}\}within the available inter\-datacenter bandwidth envelope\.

To make the analysis concrete, we consider a prefill cluster comprising 512 H200 GPUs withLavg=32​KL\_\{\\text\{avg\}\}=32\\text\{K\}\. Under this configuration, MiniMax\-M2\.5 and Qwen3 respectively require 3\.8 Tbps and 2\.1 Tbps of egress bandwidth, which effectively locks deployment into a tightly integrated single\-cluster fabric\. By contrast, models that employ hybrid architectures see their KV throughput drop by an order of magnitude, bringing the bandwidth demand to a level that modern inter\-datacenter links can sustain\. Ring\-2\.5\-1T requires roughly 170 Gbps of dedicated line capacity\. Moreover, by routing even longer requests \(e\.g\., 128K tokens\) to the PrfaaS cluster for processing, the bandwidth demand falls further to below 100 Gbps\. Even at the scale of a 10,000\-GPU datacenter, the aggregate egress bandwidth totals only about 1\.8 Tbps, well within the capacity of physical inter\-datacenter links\.

That is the systems turning point described before\. Once KV throughput falls far enough, heterogeneous serving no longer has to be implemented solely as an awkward co\-location of unlike accelerators behind the same RDMA island\. Instead, prefill can be selectively externalized into standalone, compute\-dense Prefill\-as\-a\-Service clusters, while decode remains in conventional bandwidth\-optimized PD clusters\. The problem then shifts from “how do we force heterogeneous hardware into one tightly coupled deployment?” to “how do we identify the requests worth offloading and transport their KVCache across loosely coupled clusters efficiently?” In that sense, KVCache\-friendly model architectures, together with selective offloading and bandwidth\-aware scheduling, jointly enable heterogeneous PD disaggregation to scale beyond a single datacenter\.

## 3Disaggregation over Cross\-Datacenter KVCache

### 3\.1Overview

![Refer to caption](https://arxiv.org/html/2604.15039v1/x3.png)Figure 3:Deployment topology of the PrfaaS\-PD architecture\.The core idea of cross\-datacenter KVCache is not to externalize every prefill, but to selectively extend disaggregated LLM serving beyond the boundary of a single cluster when remote prefill acceleration is worth the transfer cost\. We realize this vision through the PrfaaS\-PD architecture, which leverages cross\-datacenter KVCache to decouple prefill and decode across loosely connected clusters for requests whose long uncached prefills benefit most from faster compute\. As shown in Figure[3](https://arxiv.org/html/2604.15039#S3.F3), dedicated PrfaaS clusters perform compute\-intensive long\-context prefill on cost\-effective, high\-throughput accelerators and stream the resulting KVCache to local PD clusters via commodity Ethernet, while short or bandwidth\-unfriendly requests remain on the local PD path\.

The PrfaaS\-PD architecture comprises three subsystems: compute, network, and storage\. The compute subsystem consists of multiple clusters, each containing only homogeneous hardware, as different chip types are typically difficult to co\-locate within the same facility\. Clusters fall into two categories\.*Local PD clusters*perform PD\-disaggregated serving and can complete inference for a request end to end\.*PrfaaS clusters*provide selective remote prefill capacity for requests whose incremental uncached length exceeds a routing threshold\. After prefill, the resulting KVCache is transferred to a local PD cluster for decode\. The network subsystem spans two tiers: intra\-cluster networks use RDMA for latency\-sensitive collective communication and PD KVCache transfers, while inter\-cluster links rely on VPC peering or dedicated lines for cross\-datacenter KVCache transfer\. The storage subsystem resides within each cluster, building a distributed hybrid prefix cache pool \(§[3\.2](https://arxiv.org/html/2604.15039#S3.SS2)\) across all machines\. A global KVCache manager maintains KVCache metadata across all clusters\. On top of these infrastructure components, a global scheduler routes requests to clusters and nodes based on request characteristics, network conditions, and cache distribution, maximizing end\-to\-end throughput under cross\-cluster bandwidth constraints, as detailed in §[3\.4](https://arxiv.org/html/2604.15039#S3.SS4)\.

### 3\.2Hybrid Prefix Cache Pool

Prefix cache pools allow the serving system to offload KVCache to distributed host memory and SSDs within a cluster, substantially increasing the prefix cache hit rate\. Conventional prefix cache pools, however, are designed for a single KVCache type and perform matching and eviction at the token or block level\. In hybrid models, the recurrent states of linear attention or SWA layers are request\-level: their size is independent of input length, and they can only be reused when the cached length matches exactly\. In contrast, the KVCache of full attention layers are block\-level: they grow linearly with input length and support partial prefix matching\. This heterogeneity challenges the conventional all\-layer\-uniform KVCache storage paradigm\.

Based on vLLM’s hybrid KVCache manager\[[28](https://arxiv.org/html/2604.15039#bib.bib34)\], we build a hybrid prefix cache system tailored for cross\-cluster KVCache transfer, as shown in Figure[4](https://arxiv.org/html/2604.15039#S3.F4)\. Linear states and full\-attention KVCache are managed by separate KVCache groups with aligned block sizes, allowing all groups to allocate and free blocks from a shared KVCache pool\. On top of this shared pool, we partition cache blocks into two categories: prefix\-cache blocks and transfer\-cache blocks\. Prefix\-cache blocks must be fully populated before they can be reused across requests\. Transfer\-cache blocks hold the KVCache produced at the tail of a prefill request for PD\-disaggregated transfer, and the cache pool discards them once the transfer completes\.

When a new request arrives, the global KVCache manager computes prefix\-match information for every cluster, and the request router uses this information to select the prefill cluster and the cache\-affine node within it\. Beyond routing, the KVCache manager also performs cache rebalancing to mitigate hotspots\. When sufficient inter\-cluster bandwidth is available, cross\-cluster cache transfer is feasible as well, as discussed in §[3\.4\.3](https://arxiv.org/html/2604.15039#S3.SS4.SSS3)\.

![Refer to caption](https://arxiv.org/html/2604.15039v1/x4.png)Figure 4:Hybrid prefix cache pool\. Linear states and full\-attention KVCache are managed by separate groups backed by a unified block pool\. Blocks are categorized as prefix\-cache \(intra\-cluster only, block\-aligned\) or transfer\-cache \(cross\-cluster, discarded after transfer\)\.
### 3\.3PrfaaS\-PD Disaggregation

Building on conventional intra\-cluster PD disaggregation, we introduce PrfaaS clusters as a selective extension that increases deployment scalability and lowers cost without forcing all requests onto the cross\-cluster path\. A PrfaaS\-PD system may contain several PrfaaS and local PD clusters, whose sizes and ratios are determined by hardware capabilities, network bandwidth, and request traffic\.

Each PrfaaS cluster functions as a stateless KVCache producer whose effective throughput equals the minimum of its prefill computation rate and its network egress bandwidth\. Not all requests benefit equally from being offloaded to a PrfaaS cluster\. Short\-context prefill is typically memory\- or communication\-bound rather than compute\-bound, resulting in low arithmetic utilization that cannot fully exploit the compute\-dense accelerators in PrfaaS clusters\. We therefore retain prefill nodes within the local PD cluster and apply a length\-based routing policy\. Letlldenote the incremental prefill length of a request \(excluding any cached prefix\) andtta routing threshold\. Whenl\>tl\>t, the request is routed to a PrfaaS cluster, and the resulting KVCache is transferred to a decode node upon completion\. Whenl≤tl\\leq t, the request is handled by a prefill node within the PD cluster\. With the growing adoption of agentic workloads, the majority of requests are incremental prefills with prefix cache hits\. For such requests, the global KVCache manager tracks the storage locations of all cached entries, ensuring that only the incremental portion is transferred across clusters\. The scheduler must jointly consider cache affinity and cluster bandwidth when making routing decisions, as discussed in §[3\.4\.3](https://arxiv.org/html/2604.15039#S3.SS4.SSS3)\.

Realizing PrfaaS\-PD disaggregation in practice requires stable, high\-throughput Ethernet transport\. Although hybrid model architectures substantially reduce the nominal bandwidth requirement, bursty traffic and uneven link utilization can still cause congestion, increasing queuing delay and degrading KVCache delivery efficiency\. Our design therefore aims to smooth transfer traffic and sustain high link utilization without inducing persistent congestion\. To this end, we combine layer\-wise prefill pipelining to overlap KVCache generation with transmission, multi\-connection TCP transport to fully utilize the available bandwidth, and congestion monitoring integrated with the scheduler to detect loss and retransmission signals early and prevent congestion accumulation\.

### 3\.4Modeling and Scheduling

A PrfaaS\-PD system comprises three roles: PrfaaS prefill, PD\-P \(the prefill nodes within a PD cluster\), and PD\-D \(the decode nodes within a PD cluster\)\. To derive scheduling policies, we construct an analytical throughput model that captures the compute and bandwidth constraints of each role and use it to guide both short\-term routing and long\-term resource allocation\.

#### 3\.4\.1Throughput Model

Table 4:Notation used in the PrfaaS\-PD throughput model\.We model the steady\-state throughput of the PrfaaS\-PD system using the notation in Table[4](https://arxiv.org/html/2604.15039#S3.T4)\. For tractability, we approximate all PrfaaS requests by a representative lengthllong=𝔼​\[L​∣L\>​t\]l\_\{\\text\{long\}\}=\\mathbb\{E\}\[L\\mid L\>t\]with service timeTprefill​\(llong\)T\_\{\\text\{prefill\}\}\(l\_\{\\text\{long\}\}\), and all PD\-P requests bylshort=𝔼​\[L∣L≤t\]l\_\{\\text\{short\}\}=\\mathbb\{E\}\[L\\mid L\\leq t\]with service timeTprefill​\(lshort\)T\_\{\\text\{prefill\}\}\(l\_\{\\text\{short\}\}\)\. Requests arrive uniformly at aggregate rateΛ\\Lambda\. A fractionp=P​\(L\>t\)p=P\(L\>t\)of requests are routed to PrfaaS, and the remaining1−p1\-pare handled by PD\-P\.

Each PrfaaS request undergoes two pipelined phases: prefill computation and KVCache transfer\. Through layer\-wise prefill pipelining, the PrfaaS cluster throughput is determined by the slower of compute and egress transfer:

Θprfaas=min⁡\(NprfaasTprefill​\(llong\),BoutSkv​\(llong\)\)\.\\Theta\_\{\\text\{prfaas\}\}=\\min\\\!\\left\(\\frac\{N\_\{\\text\{prfaas\}\}\}\{T\_\{\\text\{prefill\}\}\(l\_\{\\text\{long\}\}\)\},\\;\\frac\{B\_\{\\text\{out\}\}\}\{S\_\{\\text\{kv\}\}\(l\_\{\\text\{long\}\}\)\}\\right\)\.\(3\)Since intra\-cluster RDMA bandwidth is not a bottleneck, PD\-P throughput is determined solely by compute capacity:

Θpd\-p=NpTprefill​\(lshort\)\.\\Theta\_\{\\text\{pd\-p\}\}=\\frac\{N\_\{p\}\}\{T\_\{\\text\{prefill\}\}\(l\_\{\\text\{short\}\}\)\}\.\(4\)The decode\-stage \(PD\-D\) throughput is

Θpd\-d=Nd⋅𝐵𝑆maxTdecode⋅Lout,\\Theta\_\{\\text\{pd\-d\}\}=\\frac\{N\_\{d\}\\cdot\\mathit\{BS\}\_\{\\max\}\}\{T\_\{\\text\{decode\}\}\\cdot L\_\{\\text\{out\}\}\},\(5\)where𝐵𝑆max\\mathit\{BS\}\_\{\\max\}andTdecodeT\_\{\\text\{decode\}\}are treated as SLO\-governed constants\.

PrfaaS and PD\-P serve as upstream producers, each prefilling a disjoint subset of requests \(fractionsppand1−p1\{\-\}p, respectively\), while PD\-D is the sole downstream consumer\. Together, they form a converging pipeline:

RequestPrfaaS PrefillPD\-P PrefillPD\-D Decodepp1−p1\{\-\}pEthernetRDMA

The end\-to\-end system throughput is limited by the slowest stage, accounting for the routing split:

Λmax=min⁡\(Θprfaasp,Θpd\-p1−p,Θpd\-d\)\.\\Lambda\_\{\\max\}=\\min\\\!\\left\(\\frac\{\\Theta\_\{\\text\{prfaas\}\}\}\{p\},\\;\\frac\{\\Theta\_\{\\text\{pd\-p\}\}\}\{1\-p\},\\;\\Theta\_\{\\text\{pd\-d\}\}\\right\)\.\(6\)

#### 3\.4\.2Throughput\-Optimal Configuration

Given fixed hardware resources \(NprfaasN\_\{\\text\{prfaas\}\},Np\+NdN\_\{p\}\{\+\}N\_\{d\}\) and network bandwidthBoutB\_\{\\text\{out\}\}, we seek two decision variables that maximizeΛmax\\Lambda\_\{\\max\}: the routing thresholdtt\(which determinespp,llongl\_\{\\text\{long\}\}, andlshortl\_\{\\text\{short\}\}\) and the PD\-cluster prefill\-to\-decode ratioNp/NdN\_\{p\}/N\_\{d\}\.

The thresholdttgoverns the trade\-off between PrfaaS and PD\-P load\. Increasingttrestricts PrfaaS to longer requests, for whichTprefill​\(l\)T\_\{\\text\{prefill\}\}\(l\)grows faster thanSkv​\(l\)S\_\{\\text\{kv\}\}\(l\)\(near\-quadratic prefill versus linear KVCache size\)\. This lowers the per\-instance KV throughput and eases bandwidth pressure\. Conversely, decreasingttfloods PrfaaS with shorter requests whose high KV throughput is more likely to trigger the bandwidth bottleneck\. The optimalttbalances PrfaaS and PD\-P throughput so that both stages approach saturation simultaneously:

Θprfaasp=Θpd\-p1−p\.\\frac\{\\Theta\_\{\\text\{prfaas\}\}\}\{p\}=\\frac\{\\Theta\_\{\\text\{pd\-p\}\}\}\{1\-p\}\.\(7\)For a fixed cluster sizeNp\+NdN\_\{p\}\+N\_\{d\}\(the number of machines in a datacenter is constant in the short term\), the ratioNp/NdN\_\{p\}/N\_\{d\}should balance the aggregate producer throughput against the consumer throughput\. Too few prefill nodes starve the decode stage of KVCache, while too many leave prefill capacity idle\. The optimal ratio satisfies

Θprfaas\+Θpd\-p=Θpd\-d\.\\Theta\_\{\\text\{prfaas\}\}\+\\Theta\_\{\\text\{pd\-p\}\}=\\Theta\_\{\\text\{pd\-d\}\}\.\(8\)Equations \([7](https://arxiv.org/html/2604.15039#S3.E7)\) and \([8](https://arxiv.org/html/2604.15039#S3.E8)\) constrain two unknowns,ttandNp/NdN\_\{p\}/N\_\{d\}\. BecauseΘprfaas/p\\Theta\_\{\\text\{prfaas\}\}/pdecreases monotonically withpp\(and hence with decreasingtt\) whileΘpd\-p/\(1−p\)\\Theta\_\{\\text\{pd\-p\}\}/\(1\{\-\}p\)increases, a grid search overttandNp/NdN\_\{p\}/N\_\{d\}efficiently finds the optimal operating point\.

#### 3\.4\.3Dual\-Timescale Scheduling

The scheduling is central to making cross\-datacenter disaggregation practical rather than merely architecturally possible\. In theory, hybrid model architectures reduce KV throughput enough that commodity Ethernet links can sustain cross\-cluster transfers, and the steady\-state analysis above can maximize cluster throughput by optimizingttandNp/NdN\_\{p\}/N\_\{d\}\. In practice, however, traffic variations and burstiness can cause transient congestion and queue buildup at the PrfaaS egress\. Moreover, clusters maintain large prefix caches whose bulk transfers can further strain cross\-cluster links\. To handle this dynamic environment, we design a dual\-timescale scheduling algorithm that treats cross\-cluster bandwidth and throughput as the primary constraints, separating the factors governing request routing and resource allocation into short\-term and long\-term categories with a dedicated strategy for each\.

##### Short\-term: bandwidth\- and cache\-aware routing\.

The PrfaaS cluster has a bandwidth\-imposed throughput ceilingBout/Skv​\(llong\)B\_\{\\text\{out\}\}/S\_\{\\text\{kv\}\}\(l\_\{\\text\{long\}\}\)\. As the cluster approaches this ceiling, congestion builds at the egress link\. The scheduler therefore continuously monitors the PrfaaS egress utilization and request queue depth\. When utilization approaches a threshold or queuing surges, a short\-term routing adjustment is triggered\.

At initialization or upon a policy update, the scheduler profiles current compute capacity and egress bandwidth, then searches for the optimal thresholdttbased on the incremental prefill\-length distribution after prefix matching\. By routing only sufficiently long requests to PrfaaS, the scheduler reduces per\-request bandwidth demand and avoids congestion near the bandwidth ceiling\.

For requests with prefix cache hits, the scheduler must jointly consider cache affinity and bandwidth availability\. Letltotall\_\{\\text\{total\}\}denote the total input length of a request, and letlprfaasl\_\{\\text\{prfaas\}\}andlpdl\_\{\\text\{pd\}\}denote the cached prefix lengths in the PrfaaS and PD clusters, respectively\. The routing depends on whether bandwidth or compute is the binding constraint\. When bandwidth is scarce, prefix caches in each cluster are evaluated independently: ifltotal−lpd≤tl\_\{\\text\{total\}\}\-l\_\{\\text\{pd\}\}\\leq t, the request is prefilled locally in PD\-P, and otherwise offloaded to PrfaaS\. When bandwidth is abundant, compute becomes the scarce resource, and cross\-cluster cache transfer can reduce redundant computation\. The scheduler considers the best cache across all clusters by lettinglprefix=max⁡\(lprfaas,lpd\)l\_\{\\text\{prefix\}\}=\\max\(l\_\{\\text\{prfaas\}\},\\,l\_\{\\text\{pd\}\}\); ifltotal−lprefix≤tl\_\{\\text\{total\}\}\-l\_\{\\text\{prefix\}\}\\leq t, the request is prefilled in PD\-P; otherwise it goes to PrfaaS\. When the cluster with the longer cache differs from the compute cluster, a cross\-cluster cache transfer is performed\.

##### Long\-term: traffic\-driven allocation re\-optimization\.

Over longer timescales, shifts in traffic volume create persistent imbalances between pipeline stages\. WhenΘprfaas\+Θpd\-p≪Θpd\-d\\Theta\_\{\\text\{prfaas\}\}\+\\Theta\_\{\\text\{pd\-p\}\}\\ll\\Theta\_\{\\text\{pd\-d\}\}, prefill is the system bottleneck, and whenΘprfaas\+Θpd\-p≫Θpd\-d\\Theta\_\{\\text\{prfaas\}\}\+\\Theta\_\{\\text\{pd\-p\}\}\\gg\\Theta\_\{\\text\{pd\-d\}\}, decode is\. The scheduler identifies the binding constraint by monitoring queue depth and utilization at each stage\. Since traffic variations at this timescale are gradual and often periodic, the scheduler periodically re\-evaluates load balance and converts nodes between prefill and decode roles within the PD cluster, adjustingNpN\_\{p\}andNdN\_\{d\}to restore the optimality conditions of Equations \([7](https://arxiv.org/html/2604.15039#S3.E7)\) and \([8](https://arxiv.org/html/2604.15039#S3.E8)\)\. As instance counts change, the routing thresholdttis re\-optimized accordingly\.

## 4Case Study: Bandwidth Demand of PrfaaS\-PD Architecture

In this section, we use an internal 1T\-parameter hybrid\-architecture model as a case study to evaluate whether selective PrfaaS offloading can keep cross\-datacenter KVCache transfer within a realistic bandwidth budget while improving system throughput under realistic hardware and deployment settings\. Following the notation in Table[4](https://arxiv.org/html/2604.15039#S3.T4), we apply the profiling\-based throughput model from §[3\.4](https://arxiv.org/html/2604.15039#S3.SS4)to solve for two key parameters that maximize system throughput: the routing thresholdttand the prefill\-to\-decode instance ratioNp/NdN\_\{p\}/N\_\{d\}within the local PD cluster\. Under the resulting configuration, the heterogeneous PrfaaS\-PD deployment achieves 54% higher throughput and 64% lower P90 TTFT than a homogeneous PD\-only baseline, and 32% higher throughput than a naive heterogeneous deployment without scheduling\. The average PrfaaS cluster egress is only 13 Gbps, well within the Ethernet capacity\.

### 4\.1Setup

Table 5:KVCache sizeSkvS\_\{\\text\{kv\}\}, prefill latencyTprefillT\_\{\\text\{prefill\}\}, and KV throughputΦkv\\Phi\_\{\\text\{kv\}\}of the internal 1T hybrid model at various input lengths\. Prefill latency is benchmarked on 8×\\timesH200 with in\-house vLLM\[[29](https://arxiv.org/html/2604.15039#bib.bib15)\]\.We deploy two clusters connected by a VPC network, providing an aggregate cross\-cluster bandwidth of approximately 100 Gbps\. The PrfaaS cluster consists of 32 H200 GPUs with higher compute throughput, dedicated to long\-context prefill requests withL\>tL\>t\. The local PD cluster consists of 64 H20 GPUs operating in conventional PD\-disaggregated mode with 800 Gbps RDMA interconnect per node, where the prefill\-to\-decode ratio can be adjusted according to request traffic\. Note that although H200 and H20 have different price points, the PrfaaS cluster only requires high compute throughput\. In production deployments, cost\-effective accelerators with comparable compute capability can serve as substitutes\. As a baseline, we use a homogeneous PD cluster of 96 H20 GPUs\.

To reflect realistic serving requirements, the workload uses an internal hybrid model with 1T parameters whose architecture follows Kimi Linear\[[26](https://arxiv.org/html/2604.15039#bib.bib14)\], employing an interleaved KDA:MLA layer structure at a 3:1 ratio\. The model is deployed at 8 GPUs per instance and profiled separately for prefill and decode using in\-house vLLM\. Table[5](https://arxiv.org/html/2604.15039#S4.T5)lists the KVCache sizeSkvS\_\{\\text\{kv\}\}, prefill latencyTprefillT\_\{\\text\{prefill\}\}, and KV throughputΦkv\\Phi\_\{\\text\{kv\}\}for this model at various input lengths\.

Request input lengths follow a truncated log\-normal distribution \(μ=9\.90\\mu=9\.90,σ=1\.00\\sigma=1\.00, truncated to\[128,128​K\]\[128,\\,128\\text\{K\}\]\) with a mean of approximately 27K tokens, reflecting the long\-context distribution characteristic of real\-world workloads\. The output length is fixed at 1024 tokens, and the SLO \(excluding speculative decoding\) is set to 40 tokens/s\. All throughput and bandwidth results reported below are derived by feeding the measured profiling data into the throughput model of §[3\.4](https://arxiv.org/html/2604.15039#S3.SS4)\.

### 4\.2Throughput Modeling and Solution

![Refer to caption](https://arxiv.org/html/2604.15039v1/x5.png)\(a\)Search over prefill/decode allocation\.
![Refer to caption](https://arxiv.org/html/2604.15039v1/x6.png)\(b\)Search over routing thresholdtt\.

Figure 5:Illustration of the grid search process for the two optimization variables\. \(a\) fixesttat the optimum and searches over the prefill/decode instance split within the local PD cluster\. \(b\) fixesNp=3N\_\{p\}=3,Nd=5N\_\{d\}=5and searches overtt\.Using the PrfaaS\-PD throughput model from §[3\.4](https://arxiv.org/html/2604.15039#S3.SS4), we optimize the routing thresholdttand the PD\-cluster prefill/decode allocation to maximizeΛmax\\Lambda\_\{\\max\}\. We solve them by exhaustive two\-dimensional grid search\. The optimal configuration is listed in the second column of Table[6](https://arxiv.org/html/2604.15039#S4.T6)\.

Figure[5](https://arxiv.org/html/2604.15039#S4.F5)illustrates the search process, fixing one variable and searching over the other\. Figure[5\(a\)](https://arxiv.org/html/2604.15039#S4.F5.sf1)shows that, for a fixed thresholdtt\(and hence fixedΘprfaas\\Theta\_\{\\text\{prfaas\}\}\), the system throughput peaks when prefill and decode throughput are approximately balanced\. The optimal allocation in the local PD cluster isNp=3N\_\{p\}=3andNd=5N\_\{d\}=5\. Figure[5\(b\)](https://arxiv.org/html/2604.15039#S4.F5.sf2)fixesNp=3N\_\{p\}=3andNd=5N\_\{d\}=5, therefore the overallΛmax\\Lambda\_\{\\max\}is determined bymin⁡\(Θprfaas/p,Θpd\-p/\(1−p\)\)\\min\(\\Theta\_\{\\text\{prfaas\}\}/p,\\;\\Theta\_\{\\text\{pd\-p\}\}/\(1\{\-\}p\)\)\. The optimum occurs at the intersection of the two curves, yieldingt=19\.4t=19\.4K\. At this operating point, approximately 50% of all requests \(the longer ones\) are offloaded to the PrfaaS cluster, fully utilizing the high\-compute\-throughput accelerators\.

Table 6:Comparison of optimal configurations across PrfaaS\-PD, homogeneous PD, and naive heterogeneous PD deployments\.
### 4\.3Result Analysis

#### 4\.3\.1Cross\-Datacenter Bandwidth Utilization

A key prerequisite for the PrfaaS\-PD architecture is that the inter\-cluster network link can sustain the KVCache transfer demand\. We evaluate this by measuring the KV throughput of the PrfaaS cluster under the deployed workload distribution\.

With the routing threshold set tot=19\.4t=19\.4K, 49\.6% of requests are routed to PrfaaS, and the offloaded subset has𝔼​\[L​∣L\>​t\]≈44\\mathbb\{E\}\[L\\mid L\>t\]\\approx 44K tokens\. The aggregate PrfaaS egress load is therefore approximately 13 Gbps, consuming only 13% of the Ethernet link\. This confirms that the KVCache of a hybrid\-architecture model can be transported over commodity Ethernet with substantial headroom with proper scheduling\. By contrast, conventional Transformer models with full KV heads would produce significantly larger KVCache volumes, potentially requiring RDMA\-class interconnects\.

#### 4\.3\.2Comparison with Homogeneous PD

We apply the same throughput modeling methodology to the homogeneous PD baseline of 96 H20 GPUs\. The results are shown in the third column of Table[6](https://arxiv.org/html/2604.15039#S4.T6)\. In the homogeneous cluster, the system throughput is also maximized when prefill and decode throughput are balanced, yielding an optimal allocation of 9 prefill instances and 3 decode instances\.

Thanks to the superior compute throughput of the PrfaaS cluster, the PrfaaS\-PD configuration requires two fewer prefill instances in the local PD cluster, freeing capacity for additional decode slots\. The overall system throughput improves by 54%\.

Another key benefit of PrfaaS\-PD is the reduction in TTFT, particularly for long\-context requests\. In the homogeneous baseline, long requests compete with short requests for prefill capacity, inflating queuing delays\. In PrfaaS\-PD, long requests are offloaded to the dedicated high\-throughput PrfaaS cluster, where prefill completes significantly faster than in the PD cluster even after accounting for cross\-cluster transfer latency\. As shown in Table[6](https://arxiv.org/html/2604.15039#S4.T6), the mean and P90 TTFT are reduced by 50% and 64%, respectively, compared to the homogeneous baseline\.

#### 4\.3\.3Comparison with Naive Heterogeneous PD

The comparison with homogeneous PD demonstrates the fundamental performance advantage of heterogeneous deployment\. The comparison with naive heterogeneous PD further highlights the importance of scheduling in heterogeneous systems\. In the naive heterogeneous PD configuration, no scheduling optimization is applied: all prefill is assigned to H200 GPUs and all decode to H20 GPUs, without length\-based routing or load balancing\. As shown in Table[6](https://arxiv.org/html/2604.15039#S4.T6), the naive heterogeneous PD achieves only1\.16×1\.16\\timesthroughput over the homogeneous baseline, a 25% reduction compared to PrfaaS\-PD\. This degradation stems from the severe imbalance between prefill and decode throughput, and more fundamentally from treating heterogeneous prefill as a universal path rather than selectively offloading only the requests that benefit most from PrfaaS\.

### 4\.4Summary

This case study demonstrates that for hybrid\-architecture models, cross\-datacenter KVCache becomes practical when it is paired with selective PrfaaS offloading rather than applied indiscriminately\.

Regarding feasibility, the KVCache transfer of a hybrid model consumes only 13% of a 100 Gbps Ethernet link\. This is well within reach of commodity Ethernet infrastructure and far below the bandwidth demands of RDMA interconnects, confirming the viability of cross\-datacenter KVCache transfer\.

Regarding effectiveness, the PrfaaS\-PD configuration \(32 H200 GPUs for PrfaaS, 64 H20 GPUs for local PD\) achieves 54% higher throughput and 64% lower P90 TTFT over a 96\-H20 homogeneous PD\-only baseline; at equal cost, the throughput gain is approximately 15%\. These gains stem from offloading compute\-intensive long\-context prefill to high\-throughput PrfaaS accelerators\. We note that H200 and H20 serve here as a representative hardware pair, not the sole viable combination\. Cost\-effective prefill\-specialized chips can further reduce deployment cost in production\.

Besides, the PrfaaS cluster is currently compute\-bound with ample bandwidth headroom\. Under larger\-scale deployments or higher\-bandwidth dedicated lines, the PrfaaS cluster can be further expanded to yield additional throughput gains\. For example, in IDC\-scale deployments with thousands of PrfaaS GPUs, the aggregate egress bandwidth required for KVCache transfer remains on the order of 1 Tbps, well within the capacity of modern datacenter fabrics, enabling further improvements in throughput and resource efficiency\.

## 5Discussion

Cross\-datacenter KVCache extends PD disaggregation from a single tightly coupled cluster to loosely connected heterogeneous clusters\. Its practicality depends on coordinated progress across model architecture, system design, and hardware\. In this section, we discuss how these trends reinforce one another and what they imply for next\-generation LLM serving systems\.

##### KVCache\-friendly model architecture\.

As context windows continue to grow, KVCache increasingly dominates inference cost in storage and transfer\. Architectural techniques such as MLA, sliding window attention, and linear attention have already shown that KVCache size can be reduced substantially without sacrificing model capability\. Going forward, model co\-design is likely to optimize not only FLOPs but also KVCache transfer volume\. These improvements directly reduce the latency and bandwidth cost of cross\-datacenter KVCache, expanding the range of deployments in which PrfaaS\-PD is cost\-effective\.

##### KVCache compression and reuse\.

Beyond architectural innovations, a growing body of work reduces LLM inference cost through KVCache compression and reuse at the algorithm or system level\. Methods such as H2O\[[35](https://arxiv.org/html/2604.15039#bib.bib13)\]and KIVI\[[14](https://arxiv.org/html/2604.15039#bib.bib11)\]selectively evict or quantize KVCache entries to shrink memory footprints\. CacheGen\[[13](https://arxiv.org/html/2604.15039#bib.bib9)\]applies conventional compression techniques to reduce KVCache transfer volume, while CacheBlend\[[34](https://arxiv.org/html/2604.15039#bib.bib10)\]and FusionRAG\[[30](https://arxiv.org/html/2604.15039#bib.bib37)\]enable reuse of approximately matched KVCache across requests\. Together, these techniques complement KVCache\-friendly model design by reducing effective memory pressure and network traffic, thereby making cross\-datacenter KVCache more robust under production workloads\.

##### Phase\-specialized inference hardware\.

The disaggregation between prefill and decode is also becoming visible in hardware design\. Prefill is compute\-intensive, whereas decode is dominated by memory bandwidth\. Recent chip roadmaps are increasingly phase\-specialized: NVIDIA Rubin CPX\[[19](https://arxiv.org/html/2604.15039#bib.bib18)\]emphasizes compute throughput for prefill, while chips such as the LPU\[[8](https://arxiv.org/html/2604.15039#bib.bib25)\]and Taalas HC1\[[25](https://arxiv.org/html/2604.15039#bib.bib26)\]feature extremely high memory bandwidth for fast decode\. Cross\-datacenter KVCache fits this trend naturally\. It removes the requirement that heterogeneous chips be deployed within the same tightly coupled network domain, allowing operators to size prefill and decode clusters independently and to deploy each phase on the hardware best suited to it\.

## 6Related Work

System optimization for online LLM inference has gradually shifted from monolithic single\-cluster engines toward disaggregated, heterogeneity\-aware, and KVCache\-centric pipelines\. On one hand, disaggregated serving splits each request into a compute\-intensive prefill phase and a memory\-bandwidth\-intensive decode phase to reduce inter\-phase interference and allow independent scaling\. Splitwise\[[21](https://arxiv.org/html/2604.15039#bib.bib1)\]and DistServe\[[37](https://arxiv.org/html/2604.15039#bib.bib2)\]formulate PD disaggregation from cost/power and goodput perspectives, respectively, with corresponding deployment, scheduling, and placement strategies, showing that PD disaggregation can simultaneously improve throughput and reduce cost under appropriate SLO and hardware constraints\. On the other hand, as cluster hardware and interconnects become increasingly heterogeneous, Helix\[[15](https://arxiv.org/html/2604.15039#bib.bib3)\], Hetis\[[17](https://arxiv.org/html/2604.15039#bib.bib4)\], and LLM\-PQ\[[36](https://arxiv.org/html/2604.15039#bib.bib5)\]incorporate heterogeneous GPUs/networks and phase\-level differences into the optimization space to achieve throughput or latency gains through phase\-specialized hardware placement\. Meanwhile, DynamoLLM\[[24](https://arxiv.org/html/2604.15039#bib.bib6)\]and FREESH\[[9](https://arxiv.org/html/2604.15039#bib.bib7)\]emphasize system\-level resource reconfiguration and cross\-domain scheduling from the perspective of energy, cost, and carbon efficiency while meeting serving SLOs\. More critically, KVCache has evolved from per\-request ephemeral state into a first\-class system resource: Mooncake\[[22](https://arxiv.org/html/2604.15039#bib.bib8)\]introduces a global KVCache pool to improve cross\-node and cross\-request reuse; CacheBlend\[[34](https://arxiv.org/html/2604.15039#bib.bib10)\]and FusionRAG\[[30](https://arxiv.org/html/2604.15039#bib.bib37)\]significantly reduce TTFT through non\-prefix KVCache fusion reuse; KIVI\[[14](https://arxiv.org/html/2604.15039#bib.bib11)\], KVQuant\[[10](https://arxiv.org/html/2604.15039#bib.bib12)\], and H2O\[[35](https://arxiv.org/html/2604.15039#bib.bib13)\]further shrink KVCache volume via quantization or importance\-based eviction to improve long\-context serviceability\. However, no prior work jointly optimizes cross\-datacenter prefill offloading, heterogeneous deployment, and bandwidth/cache\-aware scheduling within a unified system\. This paper approaches the problem from a cross\-datacenter KVCache perspective, combining disaggregated inference with heterogeneous resource orchestration to build a low\-cost, high\-throughput serving system\.

## 7Conclusion

To address the practical deployment challenges of heterogeneous disaggregated inference, we propose the concept of cross\-datacenter KVCache, extending disaggregated serving from single homogeneous clusters to cross\-cluster heterogeneous deployments\. On this basis, we design the PrfaaS\-PD disaggregation architecture, which augments system serving throughput at low cost through heterogeneous PrfaaS clusters connected via commodity Ethernet\. We envision that the cross\-datacenter KVCache paradigm will co\-evolve with next\-generation models, hardware, and networks to enable highly efficient LLM serving at scale\.

## References

- \[1\]D\. Abts, J\. Ross, J\. Sparling, M\. Wong\-VanHaren, M\. Baker, T\. Hawkins, A\. Bell, J\. Thompson, T\. Kahsai, G\. Kimmell,et al\.\(2020\)Think fast: a tensor streaming processor \(tsp\) for accelerating deep learning workloads\.In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture \(ISCA\),pp\. 145–158\.Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p1.1)\.
- \[2\]S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p2.1)\.
- \[3\]I\. AI\(2026\)Ring\-2\.5\-1t\.Note:[https://github\.com/inclusionAI/Ring\-V2\.5](https://github.com/inclusionAI/Ring-V2.5)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.5.4.1)\.
- \[4\]J\. Ainslie, J\. Lee\-Thorp, M\. De Jong, Y\. Zemlyanskiy, F\. Lebrón, and S\. Sanghai\(2023\)Gqa: training generalized multi\-query transformer models from multi\-head checkpoints\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 4895–4901\.Cited by:[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.3.2.3)\.
- \[5\]I\. Beltagy, M\. E\. Peters, and A\. Cohan\(2020\)Longformer: the long\-document transformer\.arXiv preprint arXiv:2004\.05150\.Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.3.2.2)\.
- \[6\]R\. Child, S\. Gray, A\. Radford, and I\. Sutskever\(2019\)Generating long sequences with sparse transformers\.arXiv preprint arXiv:1904\.10509\.Cited by:[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p1.2)\.
- \[7\]L\. Corp\.\(2026\)SGLang\.Note:[https://github\.com/sgl\-project/sglang](https://github.com/sgl-project/sglang)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p1.1),[Table 3](https://arxiv.org/html/2604.15039#S2.T3),[Table 3](https://arxiv.org/html/2604.15039#S2.T3.4.2)\.
- \[8\]Groq\(2025\)What is a language processing unit?\.Note:[https://groq\.com/blog/the\-groq\-lpu\-explained](https://groq.com/blog/the-groq-lpu-explained)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p1.1),[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px3.p1.1)\.
- \[9\]X\. He, Z\. Fang, J\. Lian, D\. H\. Tsang, B\. Zhang, and Y\. Chen\(2025\)FREESH: fair, resource\-and energy\-efficient scheduling for llm serving on heterogeneous gpus\.arXiv preprint arXiv:2511\.00807\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[10\]C\. Hooper, S\. Kim, H\. Mohammadzadeh, M\. W\. Mahoney, Y\. S\. Shao, K\. Keutzer, and A\. Gholami\(2024\)Kvquant: towards 10 million context length llm inference with kv cache quantization\.Advances in Neural Information Processing Systems37,pp\. 1270–1303\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[11\]A\. Huang, A\. Li, A\. Kong, B\. Wang, B\. Jiao, B\. Dong, B\. Wang, B\. Chen, B\. Li, B\. Ma,et al\.\(2026\)Step 3\.5 flash: open frontier\-level intelligence with 11b active parameters\.arXiv preprint arXiv:2602\.10604\.Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p2.1)\.
- \[12\]A\. Liu, B\. Feng, B\. Wang, B\. Wang, B\. Liu, C\. Zhao, C\. Dengr, C\. Ruan, D\. Dai, D\. Guo,et al\.\(2024\)Deepseek\-v2: a strong, economical, and efficient mixture\-of\-experts language model\.arXiv preprint arXiv:2405\.04434\.Cited by:[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.2.1.3)\.
- \[13\]Y\. Liu, H\. Li, Y\. Cheng, S\. Ray, Y\. Huang, Q\. Zhang, K\. Du, J\. Yao, S\. Lu, G\. Ananthanarayanan,et al\.\(2024\)Cachegen: kv cache compression and streaming for fast large language model serving\.InProceedings of the ACM SIGCOMM 2024 Conference,pp\. 38–56\.Cited by:[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px2.p1.1)\.
- \[14\]Z\. Liu, J\. Yuan, H\. Jin, S\. Zhong, Z\. Xu, V\. Braverman, B\. Chen, and X\. Hu\(2024\)KIVI: a tuning\-free asymmetric 2bit quantization for kv cache\.InInternational Conference on Machine Learning,pp\. 32332–32344\.Cited by:[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[15\]Y\. Mei, Y\. Zhuang, X\. Miao, J\. Yang, Z\. Jia, and R\. Vinayak\(2025\)Helix: serving large language models over heterogeneous gpus and network via max\-flow\.InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1,pp\. 586–602\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[16\]Minimax\(2026\)MiniMax m2\.5: built for real\-world productivity\.Note:[https://www\.minimax\.io/news/minimax\-m25](https://www.minimax.io/news/minimax-m25)Cited by:[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.6.5.1)\.
- \[17\]Z\. Mo, J\. Liao, H\. Xu, Z\. Zhou, and C\. Xu\(2025\)Hetis: serving llms in heterogeneous gpu clusters with fine\-grained and dynamic parallelism\.InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,pp\. 1710–1724\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[18\]NVIDIA\(2025\)Nemotron 3 Nano: open, efficient mixture\-of\-experts hybrid Mamba\-Transformer model for Agentic reasoning\.Note:Technical reportExternal Links:[Link](https://arxiv.org/abs/2512.20848)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p2.1)\.
- \[19\]NVIDIA\(2025\)NVIDIA rubin cpx accelerates inference performance and efficiency for 1m\+ token context workloads\.Note:[https://developer\.nvidia\.com/blog/nvidia\-rubin\-cpx\-accelerates\-inference\-performance\-and\-efficiency\-for\-1m\-token\-context\-workloads/](https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p1.1),[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px3.p1.1)\.
- \[20\]NVIDIA\(2026\)Dynamo\.Note:[https://github\.com/ai\-dynamo/dynamo](https://github.com/ai-dynamo/dynamo)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p1.1)\.
- \[21\]P\. Patel, E\. Choukse, C\. Zhang, A\. Shah, Í\. Goiri, S\. Maleki, and R\. Bianchini\(2024\)Splitwise: efficient generative llm inference using phase splitting\.In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture \(ISCA\),pp\. 118–132\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[22\]R\. Qin, Z\. Li, W\. He, J\. Cui, H\. Tang, F\. Ren, T\. Ma, S\. Cai, Y\. Zhang, M\. Zhang,et al\.\(2024\)Mooncake: a kvcache\-centric disaggregated architecture for llm serving\.ACM Transactions on Storage\.Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p1.1),[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[23\]Z\. Qin, W\. Sun, D\. Li, X\. Shen, W\. Sun, and Y\. Zhong\(2024\)Lightning attention\-2: a free lunch for handling unlimited sequence lengths in large language models\.External Links:2401\.04658Cited by:[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.5.4.2)\.
- \[24\]J\. Stojkovic, C\. Zhang, Í\. Goiri, J\. Torrellas, and E\. Choukse\(2025\)Dynamollm: designing llm inference clusters for performance and energy efficiency\.In2025 IEEE International Symposium on High Performance Computer Architecture \(HPCA\),pp\. 1348–1362\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[25\]Taalas\(2025\)Taalas hc1\.Note:[https://taalas\.com/products](https://taalas.com/products)Cited by:[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px3.p1.1)\.
- \[26\]K\. Team, Y\. Zhang, Z\. Lin, X\. Yao, J\. Hu, F\. Meng, C\. Liu, X\. Men, S\. Yang, Z\. Li,et al\.\(2025\)Kimi linear: an expressive, efficient attention architecture\.arXiv preprint arXiv:2510\.26692\.Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[§1](https://arxiv.org/html/2604.15039#S1.p6.1),[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.2.1.1),[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.2.1.2),[§4\.1](https://arxiv.org/html/2604.15039#S4.SS1.p2.3)\.
- \[27\]Q\. Team\(2026\)Qwen3\.5: towards native multimodal agents\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.4.3.1)\.
- \[28\]vLLM Team at IBM\(2025\)Hybrid models as first\-class citizens in vLLM\.Note:[https://pytorch\.org/blog/hybrid\-models\-as\-first\-class\-citizens\-in\-vllm/](https://pytorch.org/blog/hybrid-models-as-first-class-citizens-in-vllm/)PyTorch Blog, November 2025Cited by:[§3\.2](https://arxiv.org/html/2604.15039#S3.SS2.p2.1)\.
- \[29\]vLLM Team\(2026\)VLLM\.Note:[https://github\.com/vllm\-project/vllm](https://github.com/vllm-project/vllm)Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p1.1),[Table 5](https://arxiv.org/html/2604.15039#S4.T5),[Table 5](https://arxiv.org/html/2604.15039#S4.T5.8.4)\.
- \[30\]J\. Wang, W\. Xie, M\. Zhang, B\. Zhang, J\. Dong, Y\. Zhu, C\. Lin, J\. Tang, Y\. Han, Z\. Ai,et al\.\(2026\)From prefix cache to fusion rag cache: accelerating llm inference in retrieval\-augmented generation\.arXiv preprint arXiv:2601\.12904\.Cited by:[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[31\]B\. Xiao, B\. Xia, B\. Yang, B\. Gao, B\. Shen, C\. Zhang, C\. He, C\. Lou, F\. Luo, G\. Wang,et al\.\(2026\)Mimo\-v2\-flash technical report\.arXiv preprint arXiv:2601\.02780\.Cited by:[§1](https://arxiv.org/html/2604.15039#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.15039#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.3.2.1)\.
- \[32\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.7.6.1)\.
- \[33\]S\. Yang, J\. Kautz, and A\. Hatamizadeh\(2025\)Gated delta networks: improving mamba2 with delta rule\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Table 1](https://arxiv.org/html/2604.15039#S2.T1.4.4.3.2)\.
- \[34\]J\. Yao, H\. Li, Y\. Liu, S\. Ray, Y\. Cheng, Q\. Zhang, K\. Du, S\. Lu, and J\. Jiang\(2025\)Cacheblend: fast large language model serving for rag with cached knowledge fusion\.InProceedings of the twentieth European conference on computer systems,pp\. 94–109\.Cited by:[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[35\]Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett,et al\.\(2023\)H2o: heavy\-hitter oracle for efficient generative inference of large language models\.Advances in Neural Information Processing Systems36,pp\. 34661–34710\.Cited by:[§5](https://arxiv.org/html/2604.15039#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[36\]J\. Zhao, B\. Wan, C\. Wu, Y\. Peng, and H\. Lin\(2024\)Llm\-pq: serving llm on heterogeneous clusters with phase\-aware partition and adaptive quantization\.InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming,pp\. 460–462\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.
- \[37\]Y\. Zhong, S\. Liu, J\. Chen, J\. Hu, Y\. Zhu, X\. Liu, X\. Jin, and H\. Zhang\(2024\)DistServe: disaggregating prefill and decoding for goodput\-optimized large language model serving\.In18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\),pp\. 193–210\.Cited by:[§6](https://arxiv.org/html/2604.15039#S6.p1.1)\.

Similar Articles

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

arXiv cs.CL

OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.