CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

arXiv cs.AI Papers

Summary

CONCORD is an asynchronous sparse aggregation framework for retrieval-augmented generation (RAG) under document isolation in device-cloud setups. It improves throughput and reduces communication by orders of magnitude by treating the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator.

arXiv:2606.15179v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud. Privacy and policy constraints often forbid raw document exchange, creating a document-isolated dual-end RAG setting. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting. We also design a certificate-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision. Steps that consult the cloud preserve the same greedy token as dense dual-end aggregation, while the remaining steps commit locally without remote evidence. Experiments on Natural Questions and WikiText-2 show that CONCORD improves end-to-end throughput over baselines by $1.66\times$ and $2.15\times$, respectively, while reducing per-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:44 AM

# CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation
Source: [https://arxiv.org/html/2606.15179](https://arxiv.org/html/2606.15179)
Xuedong Hu1,2, Zhiqing Tang2,1,✉\{\}^\{2,1,\\textrm\{\{\\char 0\\relax\}\}\}, Zhi Yao3,2, Tian Wang2,5,6, and Weijia Jia2,4Email: \{xuedonghu, yaozhi\}@mail\.bnu\.edu\.cn, \{zhiqingtang, tianwang, jiawj\}@bnu\.edu\.cnThis work was supported in part by the National Natural Science Foundation of China \(NSFC\) under Grant 62302048, Grant U25A20436, and Grant 62272050; in part by Guangdong Higher Education Association under Grant 24GQN97; in part by the Guangdong Provincial Higher Education Institutions under Grant 2024KTSCX219; and in part by Beijing Normal University at Zhuhai Education Reform Project under Grant jx2025037\.\(Corresponding author: Zhiqing Tang\.\)

###### Abstract

Retrieval\-augmented generation \(RAG\) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time\. As device\-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud\. Privacy and policy constraints often forbid raw document exchange, creating a document\-isolated dual\-end RAG setting\. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions\. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual\-end RAG under document isolation\. CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co\-generator\. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting\. We also design a certificate\-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision\. Steps that consult the cloud preserve the same greedy token as dense dual\-end aggregation, while the remaining steps commit locally without remote evidence\. Experiments on Natural Questions and WikiText\-2 show that CONCORD improves end\-to\-end throughput over baselines by1\.66×1\.66\\timesand2\.15×2\.15\\times, respectively, while reducing per\-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity\.

## IIntroduction

Retrieval\-augmented generation \(RAG\) has become an effective way to improve compact language models by incorporating external evidence at inference time rather than relying only on parametric memory\[[22](https://arxiv.org/html/2606.15179#bib.bib1),[11](https://arxiv.org/html/2606.15179#bib.bib2),[20](https://arxiv.org/html/2606.15179#bib.bib3),[15](https://arxiv.org/html/2606.15179#bib.bib4),[4](https://arxiv.org/html/2606.15179#bib.bib5),[32](https://arxiv.org/html/2606.15179#bib.bib6),[2](https://arxiv.org/html/2606.15179#bib.bib7),[16](https://arxiv.org/html/2606.15179#bib.bib8)\]\. In parallel, recent advances in device\-cloud collaborative inference have made it increasingly feasible to deploy small language models \(SLMs\) on edge devices while selectively relying on the cloud for additional computation or knowledge access\[[19](https://arxiv.org/html/2606.15179#bib.bib9),[41](https://arxiv.org/html/2606.15179#bib.bib10),[23](https://arxiv.org/html/2606.15179#bib.bib11),[9](https://arxiv.org/html/2606.15179#bib.bib12),[12](https://arxiv.org/html/2606.15179#bib.bib13),[31](https://arxiv.org/html/2606.15179#bib.bib14),[18](https://arxiv.org/html/2606.15179#bib.bib15),[29](https://arxiv.org/html/2606.15179#bib.bib16),[36](https://arxiv.org/html/2606.15179#bib.bib17),[34](https://arxiv.org/html/2606.15179#bib.bib37),[37](https://arxiv.org/html/2606.15179#bib.bib38)\]\. When these two trends converge, a new question arises: how should the device and the cloud jointly organize retrieval\-augmented generation when each side holds its own documents?

This question is particularly relevant in personal\-data applications, where the device serves local and privacy\-sensitive context while the cloud provides access to broader public knowledge\. Both sides may hold documents relevant to the next\-token decision, yet privacy, policy, or system constraints often forbid raw document exchange across the two ends\[[26](https://arxiv.org/html/2606.15179#bib.bib18)\], forming a document\-isolated dual\-end RAG setting\. The central problem is therefore not only how to use retrieval, but how to coordinate retrieval and generation across the device and the cloud under document isolation\.

Existing approaches fall into two streams\. The first reduces the dual\-end problem to single\-side generation by transferring remote documents or representations to one side\[[11](https://arxiv.org/html/2606.15179#bib.bib2),[22](https://arxiv.org/html/2606.15179#bib.bib1),[15](https://arxiv.org/html/2606.15179#bib.bib4),[4](https://arxiv.org/html/2606.15179#bib.bib5),[32](https://arxiv.org/html/2606.15179#bib.bib6),[2](https://arxiv.org/html/2606.15179#bib.bib7),[16](https://arxiv.org/html/2606.15179#bib.bib8),[17](https://arxiv.org/html/2606.15179#bib.bib19),[27](https://arxiv.org/html/2606.15179#bib.bib20)\], but such transfers either rebuild prefix\-dependent states or incur growing communication overhead\[[17](https://arxiv.org/html/2606.15179#bib.bib19),[27](https://arxiv.org/html/2606.15179#bib.bib20),[26](https://arxiv.org/html/2606.15179#bib.bib18)\]\. The second stream keeps both sides jointly involved and performs exact aggregation online\[[26](https://arxiv.org/html/2606.15179#bib.bib18)\]\. Both streams assume that remote evidence must be made fully available—either by transferring it to one side upfront or by exchanging dense representations at every decoding step\. An approach that sparsifies remote participation over both time and communication is therefore needed\. To this end, two coupled challenges must be addressed\.

![Refer to caption](https://arxiv.org/html/2606.15179v1/x1.png)Figure 1:Teaser comparison between exact synchronous aggregation and CONCORD\.The first challenge is how to decide whether a decoding step should continue waiting for remote participation\.Existing dual\-end methods use speculative scheduling to overlap communication with decoding, keeping the remote side on or near the critical path at nearly every step\[[26](https://arxiv.org/html/2606.15179#bib.bib18)\]\. Hiding latency through speculation alone does not remove the cost of frequent rollback and realignment when remote drafts are rejected\[[21](https://arxiv.org/html/2606.15179#bib.bib28),[6](https://arxiv.org/html/2606.15179#bib.bib29)\]\. In realistic deployments, the arrival delay of remote evidence reflects not only network conditions but also these residual rollback effects\. As this delay accumulates, the local side increasingly has enough evidence to decide on its own, so continued waiting mainly adds latency without changing the output\.

The second challenge is how to determine the minimal remote evidence needed to resolve the current greedy decision\.When a step does consult the remote side, both centralized and dual\-end methods transfer full evidence representations, whether as complete KV states\[[19](https://arxiv.org/html/2606.15179#bib.bib9),[23](https://arxiv.org/html/2606.15179#bib.bib11),[9](https://arxiv.org/html/2606.15179#bib.bib12),[12](https://arxiv.org/html/2606.15179#bib.bib13)\]or dense remote score vectors\[[26](https://arxiv.org/html/2606.15179#bib.bib18)\]\. Yet language\-model output distributions are typically concentrated\[[21](https://arxiv.org/html/2606.15179#bib.bib28),[6](https://arxiv.org/html/2606.15179#bib.bib29)\]: a small number of candidates occupy most of the probability mass, and only these candidates can affect the final decision\. Transferring evidence for the remaining candidates wastes bandwidth without changing the committed token\.

These two challenges are tightly coupled: reducing waiting alone is insufficient if consultation still triggers large remote transfers, while reducing payload alone is insufficient if the remote side is still invoked too frequently\. The key issue is how to jointly control when remote participation should occur and how much evidence should be revealed once consultation begins\. Figure[1](https://arxiv.org/html/2606.15179#S1.F1)illustrates this contrast\.

In this paper, we present CONCORD, an asynchronous sparse aggregation framework for dual\-end RAG under document isolation\. CONCORD introduces two coupled mechanisms\. First, waiting debt control uses the observed return of waiting to decide whether the current decoding step should continue waiting for remote participation, and shortens waiting when further waiting is unlikely to pay off\. Second, certificate\-guided minimal supplementation requests only the token\-level remote evidence that may still change the current greedy decision, and stops as soon as sufficient evidence is available\. In this way, CONCORD treats the cloud not as a continuously synchronized co\-generator, but as an asynchronously arriving source of remote evidence\. Unlike speculative scheduling, which keeps the remote side on the critical path and hides latency through parallelism, CONCORD removes the remote side from the critical path entirely on steps where local evidence suffices\. Under greedy decoding, once a step enters the supplementation pipeline, certificate success or exact fallback returns the same greedy token as dense consulted\-step aggregation, so the communication\-layer exactness target is preserved on consulted steps\. Together, these two components decouple the temporal decision of whether remote participation should occur from the communication decision of how much remote evidence should be revealed once consultation begins\.

The main contributions are summarized as follows\.

1. 1\.Dual\-end sparse aggregation formulation and CONCORD framework\.We formulate dual\-end RAG under document isolation as jointly minimizing generation loss and device\-cloud interaction cost along two dimensions: the temporal frequency of remote participation and the per\-consultation communication payload\. Based on this formulation, we propose CONCORD, an asynchronous sparse aggregation framework that reformulates the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co\-generator\.
2. 2\.Waiting debt control and certificate\-guided minimal supplementation\.We design two coupled mechanisms to address the two challenges identified above\. Waiting debt control tracks the observed return of remote participation through a lightweight debt queue and adaptively shortens waiting when further blocking is unlikely to change the output, with onlyO​\(1\)O\(1\)additional computation per token\. Certificate\-guided minimal supplementation uses an upper\-bound test to certify the current greedy decision early, and requests only the ambiguity\-critical token ids when certification fails, preserving the dense consulted\-step greedy result through certificate success or exact fallback\.
3. 3\.Evaluation on Natural Questions and WikiText\-2\.Experiments including both controlled single\-machine and real two\-machine deployments show that CONCORD improves end\-to\-end throughput over DRAGON by about1\.66×1\.66\\timesand2\.15×2\.15\\times, respectively, while reducing per\-token communication by over two orders of magnitude on both tasks and maintaining comparable answer quality and perplexity\.

The remainder of this paper is organized as follows\. Section[II](https://arxiv.org/html/2606.15179#S2)reviews related work\. Section[III](https://arxiv.org/html/2606.15179#S3)introduces the preliminaries and problem formulation\. Section[IV](https://arxiv.org/html/2606.15179#S4)presents the methodology of CONCORD\. Section[V](https://arxiv.org/html/2606.15179#S5)reports experimental results\. Finally, Section[VI](https://arxiv.org/html/2606.15179#S6)concludes the paper\.

## IIRelated Works

RAG with external memory and multiple documents\.Early retrieval\-augmented language models such as REALM and kNN\-LM attached non\-parametric memory or nearest\-neighbor retrieval to language\-model inference and training\[[11](https://arxiv.org/html/2606.15179#bib.bib2),[20](https://arxiv.org/html/2606.15179#bib.bib3)\]\. Subsequent work, including RAG, FiD, RETRO, REPLUG, SELF\-RAG, and Atlas, strengthened multi\-document reasoning through retrieval\-conditioned decoding or training\[[22](https://arxiv.org/html/2606.15179#bib.bib1),[15](https://arxiv.org/html/2606.15179#bib.bib4),[4](https://arxiv.org/html/2606.15179#bib.bib5),[32](https://arxiv.org/html/2606.15179#bib.bib6),[2](https://arxiv.org/html/2606.15179#bib.bib7),[16](https://arxiv.org/html/2606.15179#bib.bib8)\]\. RAGCache and Block\-Attention further reduced the serving overhead of long retrieved context through cache reuse and more efficient prefilling\[[17](https://arxiv.org/html/2606.15179#bib.bib19),[27](https://arxiv.org/html/2606.15179#bib.bib20)\]\. All these methods assume that retrieved evidence can be centralized into a single decoding context, either by merging documents directly or by transferring cached states\. Under document isolation, however, raw cross\-end document exchange is forbidden, so the aggregation must happen in the output space rather than in the input context\.

Edge\-cloud collaborative inference and service systems\.Collaborative intelligence systems have explored how to split DNN execution across the device, edge, and cloud for latency, energy, or privacy goals\[[19](https://arxiv.org/html/2606.15179#bib.bib9),[41](https://arxiv.org/html/2606.15179#bib.bib10),[23](https://arxiv.org/html/2606.15179#bib.bib11),[9](https://arxiv.org/html/2606.15179#bib.bib12),[12](https://arxiv.org/html/2606.15179#bib.bib13),[14](https://arxiv.org/html/2606.15179#bib.bib21)\]\. Service\-level work further studied edge\-cloud coordination under QoS and resource constraints\[[33](https://arxiv.org/html/2606.15179#bib.bib22),[35](https://arxiv.org/html/2606.15179#bib.bib23),[30](https://arxiv.org/html/2606.15179#bib.bib24)\]\. More recently, CE\-CoLLM, CEED, C2F, EdgeLLM, and VELO have revisited collaborative inference for large or foundation models, addressing partitioning, context\-aware offloading, QoS optimization, or model\-side acceleration\[[18](https://arxiv.org/html/2606.15179#bib.bib15),[7](https://arxiv.org/html/2606.15179#bib.bib25),[39](https://arxiv.org/html/2606.15179#bib.bib26),[10](https://arxiv.org/html/2606.15179#bib.bib27),[36](https://arxiv.org/html/2606.15179#bib.bib17),[37](https://arxiv.org/html/2606.15179#bib.bib38),[38](https://arxiv.org/html/2606.15179#bib.bib39),[24](https://arxiv.org/html/2606.15179#bib.bib40),[28](https://arxiv.org/html/2606.15179#bib.bib41)\]\. DRAGON extended this line to distributed RAG, formulating exact output aggregation across device\-side and cloud\-side corpora\[[26](https://arxiv.org/html/2606.15179#bib.bib18)\]\. None of these methods, however, addresses the question of when remote participation is worth its cost or how much remote evidence is actually needed per consultation\. They either assume continuous joint execution or focus on one\-time partitioning decisions, leaving the sparse\-participation regime unexplored\.

Speculative decoding and LLM serving\.Speculative decoding accelerates autoregressive generation by letting a draft process run ahead of a verifier and then accepting or correcting the draft\[[21](https://arxiv.org/html/2606.15179#bib.bib28),[6](https://arxiv.org/html/2606.15179#bib.bib29),[5](https://arxiv.org/html/2606.15179#bib.bib30),[3](https://arxiv.org/html/2606.15179#bib.bib31)\]\. Recent variants such as recurrent drafters, EAGLE, and AdaSpec further improved drafting efficiency or SLO\-aware serving\[[8](https://arxiv.org/html/2606.15179#bib.bib32),[25](https://arxiv.org/html/2606.15179#bib.bib33),[13](https://arxiv.org/html/2606.15179#bib.bib34)\]\. Serving systems such as DistServe and Sarathi\-Serve optimized prefill\-decode scheduling and throughput\-latency tradeoffs for centralized LLM stacks\[[40](https://arxiv.org/html/2606.15179#bib.bib35),[1](https://arxiv.org/html/2606.15179#bib.bib36)\]\. These ideas informed our use of rollback, preemption, and asynchronous execution, but they target single\-model or centralized pipelines and do not address the dual\-end question of when to consult the remote side and how much evidence to transfer per consultation\.

In summary, existing RAG methods assume centralized evidence access, edge\-cloud collaborative systems assume continuous joint execution, and speculative decoding targets single\-model pipelines\. CONCORD bridges these lines by introducing sparse, asynchronous remote participation for exact dual\-end aggregation under document isolation\.

## IIIPreliminaries

### III\-ARetrieval\-Augmented Generation

In retrieval\-augmented generation \(RAG\), a language model incorporates external documents retrieved from a corpus at inference time rather than relying only on parametric memory\[[11](https://arxiv.org/html/2606.15179#bib.bib2),[22](https://arxiv.org/html/2606.15179#bib.bib1),[15](https://arxiv.org/html/2606.15179#bib.bib4),[4](https://arxiv.org/html/2606.15179#bib.bib5),[32](https://arxiv.org/html/2606.15179#bib.bib6),[2](https://arxiv.org/html/2606.15179#bib.bib7),[16](https://arxiv.org/html/2606.15179#bib.bib8)\]\. Given the current prefixx<tx\_\{<t\}and a retrieved document setDD, each documentd∈Dd\\in Dinduces a document\-conditioned next\-token distributionp​\(xt∣d,x<t\)p\(x\_\{t\}\\mid d,x\_\{<t\}\)\. Following the output\-aggregation view of multi\-document RAG\[[22](https://arxiv.org/html/2606.15179#bib.bib1),[32](https://arxiv.org/html/2606.15179#bib.bib6)\], the target next\-token distribution is

p​\(xt∣x<t\)=∑d∈Dωt​\(d\)​p​\(xt∣d,x<t\),p\(x\_\{t\}\\mid x\_\{<t\}\)=\\sum\_\{d\\in D\}\\omega\_\{t\}\(d\)\\,p\(x\_\{t\}\\mid d,x\_\{<t\}\),\(1\)whereωt​\(d\)\\omega\_\{t\}\(d\)is the relevance weight of documentddat steptt\. In practice, the summation runs over a top\-kkretrieved subset\. Equation \([1](https://arxiv.org/html/2606.15179#S3.E1)\) shows that the aggregation operates on document\-conditioned output distributions, or equivalently on their vocabulary\-level scores\. This view suits distributed settings because it separates document\-local inference from the final decision over the shared vocabulary\.

### III\-BDevice\-Cloud Distributed RAG

To support personal\-data applications, we consider a dual\-end setting in which retrieved evidence is naturally split across the device and the cloud\. LetDdeviceD^\{\\text\{device\}\}denote the device\-side private document set andDcloudD^\{\\text\{cloud\}\}denote the cloud\-side public document set, with

D=Ddevice∪Dcloud\.D=D^\{\\text\{device\}\}\\cup D^\{\\text\{cloud\}\}\.Each side runs its own retrieval and document\-conditioned inference without exchanging raw documents\. At decoding steptt, the device and cloud produce vocabulary\-level scores fromDdeviceD^\{\\text\{device\}\}andDcloudD^\{\\text\{cloud\}\}, respectively\. The final distribution is obtained by aggregating contributions from both sides in the output space rather than by merging documents into one centralized context\.

This workflow is a distributed version of Equation \([1](https://arxiv.org/html/2606.15179#S3.E1)\)\. Working in the log domain, we write the device\-side and cloud\-side log\-scores asℓLt​\(v\)\\ell\_\{L\}^\{t\}\(v\)andℓRt​\(v\)\\ell\_\{R\}^\{t\}\(v\)for each tokenv∈Vv\\in V\. LetπLt=∑d∈Ddeviceωt​\(d\)\\pi\_\{L\}^\{t\}=\\sum\_\{d\\in D^\{\\text\{device\}\}\}\\omega\_\{t\}\(d\)andπRt=∑d∈Dcloudωt​\(d\)\\pi\_\{R\}^\{t\}=\\sum\_\{d\\in D^\{\\text\{cloud\}\}\}\\omega\_\{t\}\(d\)be the per\-side relevance weights\. The aggregated log\-scoreℓTt​\(v\)\\ell\_\{T\}^\{t\}\(v\)is then obtained by combining both sides through a weighted log\-sum\-exp operation; the exact formula is given in Section[IV\-C](https://arxiv.org/html/2606.15179#S4.SS3)\. Greedy decoding returns

yt=arg⁡maxv∈V⁡ℓTt​\(v\)\.y\_\{t\}=\\arg\\max\_\{v\\in V\}\\ell\_\{T\}^\{t\}\(v\)\.Unlike centralized RAG, the two sides cannot exchange documents and rerun a single\-model pipeline\. How to coordinate this process efficiently under document isolation is the problem we formulate next\.

### III\-CProblem Formulation

The goal of dual\-end aggregation is to preserve decision quality while reducing the system cost induced by device\-cloud interaction\. LetF=\(Fdevice,Fcloud\)F=\(F^\{\\text\{device\}\},F^\{\\text\{cloud\}\}\)denote the joint generation strategy of the two sides\. At each decoding step, the system may exchange auxiliary informationℳt\\mathcal\{M\}\_\{t\}between the two sides and then produce a distribution over the next token\. The optimization objective is to jointly minimize generation loss and interaction cost:

minF⁡1T​∑t=1T\(ℒt\+λ​Ct​\(ℳt,F\)\),\\min\_\{F\}\\;\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\Big\(\\mathcal\{L\}\_\{t\}\+\\,\\lambda\\,C\_\{t\}\(\\mathcal\{M\}\_\{t\},F\)\\Big\),whereℒt\\mathcal\{L\}\_\{t\}is the task loss at steptt,Ct​\(ℳt,F\)C\_\{t\}\(\\mathcal\{M\}\_\{t\},F\)is the system cost from communication and waiting, andλ\\lambdacontrols the quality\-efficiency tradeoff\. Exact dual\-end methods such as DRAGON\[[26](https://arxiv.org/html/2606.15179#bib.bib18)\]reduce this cost mainly by hiding synchronization delay through speculative scheduling\. We instead aim to reduce the frequency of remote participation and the volume of remote evidence transferred per consultation\.

This formulation highlights two dimensions that drive the rest of our design\. Along the temporal dimension, the system must decide whether the current step should continue waiting for remote participation\. Along the communication dimension, it must determine how much remote evidence to reveal before the greedy decision is fixed\. CONCORD addresses both dimensions jointly while preserving the output\-aggregation target defined above for consulted steps\.

## IVMethodology

### IV\-AFramework Overview

As illustrated in Figure[2](https://arxiv.org/html/2606.15179#S4.F2), CONCORD consists of a local decision branch and a remote evidence branch\. The device continues decoding with private evidence, while the cloud prepares public evidence asynchronously\. Following the optimization objective in Section[III](https://arxiv.org/html/2606.15179#S3), CONCORD reduces end\-to\-end system cost along both the temporal and communication dimensions while preserving the output\-aggregation targetyt=arg⁡maxv∈V⁡ℓTt​\(v\)y\_\{t\}=\\arg\\max\_\{v\\in V\}\\ell\_\{T\}^\{t\}\(v\)on steps that enter remote consultation\. It does so through two coupled mechanisms: waiting control and communication control\.

![Refer to caption](https://arxiv.org/html/2606.15179v1/x2.png)Figure 2:Overview of CONCORD\.
### IV\-BWaiting Control

The design of waiting control is motivated by an empirical observation that we verify in Section[V](https://arxiv.org/html/2606.15179#S5)\. In most decoding steps, the local decision already matches the dense dual\-end result, and only a small number of positions remain sensitive to remote evidence\. The cloud therefore serves as a*proofreading*source rather than a continuously synchronized co\-generator\. Instead of aggregating remote evidence at every step, the remote side intervenes only where public evidence may still overturn the local decision\.

We therefore replace dual\-end synchronous progression with asynchronous progression\. The device keeps generating with private evidence, while the cloud prepares proofreading evidence in the background and sends it whenever ready\. If the received evidence is sufficient to confirm the current local decision, the device keeps the local token and moves on; the concrete certificate rule is described in Section[IV\-C](https://arxiv.org/html/2606.15179#S4.SS3)\. If the local and remote drafts disagree, the system evaluates the aggregated distribution before committing\. The losing side then rolls back and realigns to the committed token\.

In this asynchronous framework, the main throughput bottleneck is rollback\-induced waiting\. Each rollback is not only a cost for the current step; it also propagates delay to later remote arrivals because the losing side must realign before preparing useful evidence again\. Frequent rejection of local drafts indicates that the remote side is providing useful corrections, so preparing evidence ahead of time is worthwhile\. Conversely, frequent rejection of remote drafts means the device waits for evidence that is ultimately discarded\. Rejection behavior is therefore a direct signal of the value of waiting\. We design a waiting controller that adapts the waiting budget according to the observed long\-run return of remote participation\.

LetTtLT\_\{t\}^\{L\}be the time when the local side enters the decision at steptt, and letTtRT\_\{t\}^\{R\}be the time when the current remote evidence becomes consumable\. The remote arrival delay is

Dt=TtR−TtL\.D\_\{t\}=T\_\{t\}^\{R\}\-T\_\{t\}^\{L\}\.This delay reflects both the current network and computation gap and the residual effect of earlier rollback\. WhenDtD\_\{t\}exceeds the current waiting budgetτt\\tau\_\{t\}, the step times out\. As timeouts accumulate, the local side is increasingly likely to have enough evidence to decide on its own, so the marginal value of waiting declines\. To suppress long\-run low\-return waiting, we introduce a waiting debt queue

qt\+1=min⁡\(Qmax,\[qt\+ξt−ρa​αt\]\+\),q\_\{t\+1\}=\\min\\Big\(Q\_\{\\max\},\\big\[q\_\{t\}\+\\xi\_\{t\}\-\\rho\_\{a\}\\alpha\_\{t\}\\big\]\_\{\+\}\\Big\),whereξt∈\{0,1\}\\xi\_\{t\}\\in\\\{0,1\\\}indicates that the current step times out or fails to obtain useful remote evidence,αt∈\{0,1\}\\alpha\_\{t\}\\in\\\{0,1\\\}indicates that remote evidence is effectively absorbed, andρa\>0\\rho\_\{a\}\>0controls the repayment rate\. Timeout or unproductive waiting increases debt, while effective proofreading repays it\. The system then selects one of two waiting budgets:

τt=\{min⁡\{τbase,τshort​\(qt\)\},qt≥Q,τbase,qt<Q\.\\tau\_\{t\}=\\begin\{cases\}\\min\\\{\\tau\_\{\\mathrm\{base\}\},\\tau\_\{\\mathrm\{short\}\}\(q\_\{t\}\)\\\},&q\_\{t\}\\geq Q,\\\\ \\tau\_\{\\mathrm\{base\}\},&q\_\{t\}<Q\.\\end\{cases\}Algorithm[1](https://arxiv.org/html/2606.15179#algorithm1)summarizes the resulting discrete rule\.

1Input:debt

qtq\_\{t\}, waiting budgets

τbase\\tau\_\{\\mathrm\{base\}\}and

τshort​\(⋅\)\\tau\_\{\\mathrm\{short\}\}\(\\cdot\), threshold

QQ, debt cap

QmaxQ\_\{\\max\}, repayment factor

ρa\\rho\_\{a\}, local and remote drafts;

2Output:committed token and updated debt

qt\+1q\_\{t\+1\};

3if*qt≥Qq\_\{t\}\\geq Q*then

4

τt←min⁡\{τbase,τshort​\(qt\)\}\\tau\_\{t\}\\leftarrow\\min\\\{\\tau\_\{\\mathrm\{base\}\},\\tau\_\{\\mathrm\{short\}\}\(q\_\{t\}\)\\\};

5

6else

7

τt←τbase\\tau\_\{t\}\\leftarrow\\tau\_\{\\mathrm\{base\}\};

8

9end if

10Wait up to

τt\\tau\_\{t\}for the remote draft;

11if*timeout*then

12Commit the local token without remote certification;

13

qt\+1←min⁡\(Qmax,qt\+1\)q\_\{t\+1\}\\leftarrow\\min\(Q\_\{\\max\},q\_\{t\}\+1\);

14Clear remote cache and advance remote epoch;

15return;

16

17end if

18Aggregate local and remote draft;

19Obtain

yty\_\{t\}and

αt\\alpha\_\{t\};

20if*αt=1\\alpha\_\{t\}=1*then

21

qt\+1←max⁡\(0,qt−ρa\)q\_\{t\+1\}\\leftarrow\\max\(0,q\_\{t\}\-\\rho\_\{a\}\);

22

23else

24Submit the local token and realign remote state;

25

qt\+1←min⁡\(Qmax,qt\+1\)q\_\{t\+1\}\\leftarrow\\min\(Q\_\{\\max\},q\_\{t\}\+1\);

26Clear remote cache and advance remote epoch;

27

28end if

Algorithm 1Waiting Debt ControlWhen debt is low, the system uses the base waiting budget and gives the cloud normal opportunities to participate\. Once debt crosses the thresholdQQ, the system shortens waiting and lets the device move forward earlier\. If remote evidence later becomes useful again, debt is repaid automatically and the controller relaxes back toward the base regime\.

### IV\-CCommunication Control

Language\-model output distributions are typically concentrated on a small number of candidates that occupy most of the probability mass\. Full remote transfer is therefore usually unnecessary; only the candidates competitive enough to overturn the current greedy decision matter\. We design the communication layer around this residual ambiguity and apply progressive supplementation guided by an upper\-bound certificate, transmitting only the ambiguity\-relevant remote evidence while preserving the dense consulted\-step greedy result\.

For any consulted step, letΩt=V\\Omega\_\{t\}=Vdenote the complete remote reference representation and let

ytdense=arg⁡maxv∈Ωt⁡ℓTt​\(v\)y\_\{t\}^\{\\mathrm\{dense\}\}=\\arg\\max\_\{v\\in\\Omega\_\{t\}\}\\ell\_\{T\}^\{t\}\(v\)be the greedy token under dense aggregation\. The local side first receives a known subsetKt⊆ΩtK\_\{t\}\\subseteq\\Omega\_\{t\}with exact cloud scoresℓ^Rt​\(v\)\\hat\{\\ell\}\_\{R\}^\{t\}\(v\)forv∈Ktv\\in K\_\{t\}and a tail boundbtb\_\{t\}on the unrevealed remainder\. Using the per\-side weightsπLt\\pi\_\{L\}^\{t\}andπRt\\pi\_\{R\}^\{t\}defined in Section[III](https://arxiv.org/html/2606.15179#S3)and writingLSE⁡\(a,b\)=log⁡\(ea\+eb\)\\operatorname\{LSE\}\(a,b\)=\\log\(e^\{a\}\+e^\{b\}\), the exact aggregated score for every revealed candidatev∈Ktv\\in K\_\{t\}is

ℓTt​\(v\)=LSE⁡\(log⁡πLt\+ℓLt​\(v\),log⁡πRt\+ℓ^Rt​\(v\)\)\.\\ell\_\{T\}^\{t\}\(v\)=\\operatorname\{LSE\}\\\!\\big\(\\log\\pi\_\{L\}^\{t\}\+\\ell\_\{L\}^\{t\}\(v\),\\ \\log\\pi\_\{R\}^\{t\}\+\\hat\{\\ell\}\_\{R\}^\{t\}\(v\)\\big\)\.Let the current revealed maximum be

Mt=maxv∈Kt⁡ℓTt​\(v\)\.M\_\{t\}=\\max\_\{v\\in K\_\{t\}\}\\ell\_\{T\}^\{t\}\(v\)\.For any unrevealed candidatev∈Ωt∖Ktv\\in\\Omega\_\{t\}\\setminus K\_\{t\}, the most optimistic score it can still attain under the current tail bound is

Ut​\(v\)=LSE⁡\(log⁡πLt\+ℓLt​\(v\),log⁡πRt\+bt\)\.U\_\{t\}\(v\)=\\operatorname\{LSE\}\\\!\\big\(\\log\\pi\_\{L\}^\{t\}\+\\ell\_\{L\}^\{t\}\(v\),\\ \\log\\pi\_\{R\}^\{t\}\+b\_\{t\}\\big\)\.If

Mt\>maxv∈Ωt∖Kt⁡Ut​\(v\)\+δ,M\_\{t\}\>\\max\_\{v\\in\\Omega\_\{t\}\\setminus K\_\{t\}\}U\_\{t\}\(v\)\+\\delta,whereδ≥0\\delta\\geq 0is a certification tolerance, then the current incumbent is already certified and

arg⁡maxv∈Kt⁡ℓTt​\(v\)=ytdense\.\\arg\\max\_\{v\\in K\_\{t\}\}\\ell\_\{T\}^\{t\}\(v\)=y\_\{t\}^\{\\mathrm\{dense\}\}\.Otherwise we collect the still\-competitive candidates in the ambiguity set

At=\{v∈Ωt∖Kt:Ut​\(v\)≥Mt−δ\}\.A\_\{t\}=\\left\\\{v\\in\\Omega\_\{t\}\\setminus K\_\{t\}:\\ U\_\{t\}\(v\)\\geq M\_\{t\}\-\\delta\\right\\\}\.We then rankAtA\_\{t\}by normalized local mass

p~t​\(v\)=exp⁡\(ℓLt​\(v\)\)∑u∈Atexp⁡\(ℓLt​\(u\)\),v∈At,\\tilde\{p\}\_\{t\}\(v\)=\\frac\{\\exp\(\\ell\_\{L\}^\{t\}\(v\)\)\}\{\\sum\_\{u\\in A\_\{t\}\}\\exp\(\\ell\_\{L\}^\{t\}\(u\)\)\},\\qquad v\\in A\_\{t\},and choose the minimal supplementation set

St⋆=arg⁡minS⊆At⁡\|S\|s\.t\.∑v∈Sp~t​\(v\)≥1−ϵ,\\begin\{gathered\}S\_\{t\}^\{\\star\}=\\arg\\min\_\{S\\subseteq A\_\{t\}\}\|S\|\\\\ \\text\{s\.t\.\}\\quad\\sum\_\{v\\in S\}\\tilde\{p\}\_\{t\}\(v\)\\geq 1\-\\epsilon,\\end\{gathered\}whereϵ∈\(0,1\)\\epsilon\\in\(0,1\)is a coverage threshold that controls how much local mass must be covered before supplementation stops\.

Each consulted step therefore proceeds through up to four stages: certificate check, token\-id query, sparse\-chunk supplementation, and forced completion\. Algorithm[2](https://arxiv.org/html/2606.15179#algorithm2)summarizes the procedure\.

1Input:partial state

\(Kt,bt\)\(K\_\{t\},b\_\{t\}\), candidate set

Ωt\\Omega\_\{t\}, local scores

ℓLt​\(⋅\)\\ell\_\{L\}^\{t\}\(\\cdot\), weights

\(πLt,πRt\)\(\\pi\_\{L\}^\{t\},\\pi\_\{R\}^\{t\}\), query budget

kk, round budget

RmaxR\_\{\\max\};

2Output:exact greedy token;

3

r←0r\\leftarrow 0;

4while*r<Rmaxr<R\_\{\\max\}*do

5Use

ℓLt​\(⋅\)\\ell\_\{L\}^\{t\}\(\\cdot\)and

\(πLt,πRt\)\(\\pi\_\{L\}^\{t\},\\pi\_\{R\}^\{t\}\)to compute

MtM\_\{t\}on

KtK\_\{t\}and upper bounds

Ut​\(v\)U\_\{t\}\(v\);

6if*the certificate is valid*then

7return

arg⁡maxv∈Kt⁡ℓTt​\(v\)\\arg\\max\_\{v\\in K\_\{t\}\}\\ell\_\{T\}^\{t\}\(v\);

8

9end if

10

At←\{v∈Ωt∖Kt:Ut​\(v\)≥Mt−δ\}A\_\{t\}\\leftarrow\\\{v\\in\\Omega\_\{t\}\\setminus K\_\{t\}:U\_\{t\}\(v\)\\geq M\_\{t\}\-\\delta\\\};

11Rank

AtA\_\{t\}by

p~t​\(v\)∝exp⁡\(ℓLt​\(v\)\)\\tilde\{p\}\_\{t\}\(v\)\\propto\\exp\(\\ell\_\{L\}^\{t\}\(v\)\);

12if*query enabled andAt≠∅A\_\{t\}\\neq\\emptyset*then

13Request up to

kktop\-ranked token ids from

AtA\_\{t\}and merge replies into

KtK\_\{t\};

14

15else

16Request the next sparse chunk and update

\(Kt,bt\)\(K\_\{t\},b\_\{t\}\);

17

18end if

19

r←r\+1r\\leftarrow r\+1;

20

21end while

22Force completion of the remaining

Ωt\\Omega\_\{t\};

23returnexact

arg⁡maxv∈Ωt⁡ℓTt​\(v\)\\arg\\max\_\{v\\in\\Omega\_\{t\}\}\\ell\_\{T\}^\{t\}\(v\);

Algorithm 2Certificate\-Guided SupplementationThe communication layer therefore requests only the remote evidence necessary to fix the current decision\. Candidates outsideAtA\_\{t\}are too weak to change the outcome, so their exact remote scores are never transmitted\. If the certificate holds, the device commits immediately; otherwise, it queries a few ambiguity\-critical token ids first and resorts to larger sparse chunks only when needed\. As a result, the transmitted payload scales with the residual ambiguity rather than with the full vocabulary size\.

### IV\-DUnified View

Waiting control and communication control can be viewed under a single lens\. The waiting budgetτt\\tau\_\{t\}governs whether a step keeps the remote side on the critical path, while the supplementation setSt⋆S\_\{t\}^\{\\star\}governs how much evidence is transmitted once consultation begins\. This yields the following unified objective:

min\{τt,St⋆\}\\displaystyle\\min\_\{\\\{\\tau\_\{t\},S\_\{t\}^\{\\star\}\\\}\}1T​∑t=1T𝔼​\[λ1​Latencyt​\(τt\)\+λ2​\|St⋆\|\+λ3​Fallbackt\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[\\lambda\_\{1\}\\,\\mathrm\{Latency\}\_\{t\}\(\\tau\_\{t\}\)\+\\lambda\_\{2\}\\,\|S\_\{t\}^\{\\star\}\|\+\\lambda\_\{3\}\\,\\mathrm\{Fallback\}\_\{t\}\\right\]s\.t\.1T​∑t=1T𝔼​\[ξt−ρa​αt\]≤0\.\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\big\[\\xi\_\{t\}\-\\rho\_\{a\}\\alpha\_\{t\}\\big\]\\leq 0\.The objective balances three costs: waiting latency, the size of remote evidence transmitted per consultation, and the cost of fallback to full completion when certification fails\. The constraint prevents long\-run accumulation of unproductive waiting by bounding the expected net debt growth\.

### IV\-ETheoretical Analysis

Time and Communication Complexity\.Dense consultation is a special case of CONCORD in which every step keeps the remote side on the critical path and communication control always falls through to forced completion\. CONCORD therefore does not incur a worse worst\-case cost than the dense baseline\. If waiting remains worthwhile and no certificate can terminate early, the procedure degenerates to full consultation\.

Along the time dimension, the waiting controller adds only constant overhead per token\. At each step it readsqtq\_\{t\}, compares it withQQ, selectsτt\\tau\_\{t\}, and updatesqt\+1q\_\{t\+1\}from the binary outcomes\(ξt,αt\)\(\\xi\_\{t\},\\alpha\_\{t\}\), so its computation and state complexity are bothO​\(1\)O\(1\)\. When remote evidence is repeatedly useful, fewer rollbacks amortize the waiting cost; when remote participation is repeatedly rejected, the controller shortens waiting and shifts the system toward local commitment\.

Along the token dimension, let\|Ωt\|\|\\Omega\_\{t\}\|be the size of the full remote reference representation at steptt,\|Kt\|\|K\_\{t\}\|the currently known subset, and\|St⋆\|\|S\_\{t\}^\{\\star\}\|the minimal supplementation size returned by the ambiguity rule\. In the worst case, repeated certificate failure triggers forced completion, so the transmitted evidence remainsO​\(\|Ωt\|\)O\(\|\\Omega\_\{t\}\|\), matching dense consultation\. In the typical sparse case, only ambiguity\-critical candidates need to be transmitted\. The effective communication cost per consulted step then becomesO​\(\|Kt\(0\)\|\+\|St⋆\|\)O\(\|K\_\{t\}^\{\(0\)\}\|\+\|S\_\{t\}^\{\\star\}\|\), where\|Kt\(0\)\|\|K\_\{t\}^\{\(0\)\}\|is the initial known\-set size and\|St⋆\|\|S\_\{t\}^\{\\star\}\|is the cumulative supplementation across all rounds\. When\|St⋆\|≪\|Ωt\|\|S\_\{t\}^\{\\star\}\|\\ll\|\\Omega\_\{t\}\|, the cost is governed by unresolved ambiguity rather than by the full remote score vector\.

Taken together, CONCORD preserves the dense consulted\-step cost bound in the worst case while reducing average waiting and transmission whenever disagreement is sparse over time and probability mass is concentrated on a few competitive candidates\. The gain is largest when only a small fraction of steps require remote correction and, within those steps, only a few token ids can overturn the greedy decision\.

## VExperiments

### V\-ASetup

Testbed\.We evaluate CONCORD in a three\-process dual\-end deployment consisting of a device\-side generation process, a cloud\-side generation process, and a retrieval service\. All controlled experiments first run on a single machine to isolate the effect of remote\-participation control\. The CPU is an Intel Xeon Platinum 8352V \(dual socket, 72 physical cores / 144 threads\), the memory is 125 GB, and the GPU setup is 4×\\timesRTX 4090D \(24 GB\)\. The device\-side generation process runs oncuda:3, the cloud\-side generation process oncuda:1, and the retrieval service oncuda:0\. The software stack is Python 3\.10, PyTorch 2\.5\.1, and Transformers 4\.57\.3\. Both ends communicate through TCP sockets with latency injected directly into the transport path for reproducibility\. Unless otherwise specified, RTT is set to 50 ms\.

Datasets\.We evaluate on two tasks\. For answer generation, we use the Natural Questions \(NQ\) development set, randomly sampling 500 examples and repeating evaluation with 3 random seeds; each example generates 10 new tokens\. For long\-form continuation, we use the WikiText\-2 test set, sampling 100 examples with the first 64 tokens as the prefix and the following 48 tokens as the target\. Under teacher\-forcing evaluation with a fixed model, per\-example PPL variance is low, so this subset size suffices to distinguish methods that share the same aggregation target \(Table[I](https://arxiv.org/html/2606.15179#S5.T1)\)\. Mechanism\-dissection experiments follow the same NQ pipeline but use 100 examples with 3 random seeds to keep repeated instrumentation affordable\.

Metrics\.For NQ, we report F1 and EM as quality metrics\. For WikiText\-2 long\-form continuation, we report teacher\-forcing perplexity \(PPL\) on the same examples used in the system experiments\. For system efficiency, NQ reports end\-to\-end queries per second \(QPS\), while WikiText\-2 reports generation throughput in tok/s\. Communication is measured uniformly as transmitted and received bytes per generated token\.

Models and baselines\.We use Qwen3\-1\.7B as the generation model on both ends\. Retrieval is always dual\-end and sharded: the device side and the cloud side connect to different local retrieval instances, so they access non\-overlapping evidence shards\. Each query retrieves four documents in total, with two retained on the device side and two on the cloud side\. We evaluate the following methods throughout the paper:

- •CRCG: centralized generation augmented with retrieval from the device\-side corpus only, using the context\-aggregation strategy\.
- •DRCG: on\-device generation augmented with evidence retrieved from the distributed corpus spanning both the device and the cloud, using the context\-aggregation strategy\.
- •DRDG/TW: distributed retrieval\-augmented generation using the output\-aggregation strategy with token\-wise synchronization during decoding\.
- •DRDG/SW: distributed retrieval\-augmented generation using the output\-aggregation strategy with sequence\-wise synchronization, i\.e\., one\-time aggregation of independently generated sequences from the device and the cloud\.
- •DRAGON: a speculative dual\-end output\-aggregation method that reduces token\-wise synchronization waiting through parallel generation and scheduling\.
- •CONCORD: our method, which reduces remote participation through waiting debt control and certificate\-guided supplementation\.
- •CONCORD\-NoSystem: CONCORD with waiting control removed while keeping the supplementation pipeline\.
- •CONCORD\-NoComm: CONCORD with the communication design removed while keeping waiting control\.

Among these methods, DRAGON is the only existing approach that performs exact output\-space aggregation under document isolation\. Other edge\-cloud methods such as CE\-CoLLM\[[18](https://arxiv.org/html/2606.15179#bib.bib15)\]and EdgeLLM\[[36](https://arxiv.org/html/2606.15179#bib.bib17)\]target model partitioning or offloading without cross\-end retrieval aggregation, so they are not directly comparable\. DRDG/TW and DRDG/SW already represent two extreme synchronization strategies \(per\-token and per\-sequence\), while DRAGON adds speculative scheduling\. The ablation variants NoSys and NoComm further serve as simpler alternatives: NoSys applies sparse communication without adaptive waiting, and NoComm applies adaptive waiting without sparse communication\.

Implementation details\.Within each end, retrieved evidence is first merged into a single intra\-end representation before inter\-end aggregation, so the comparison isolates coordination overhead rather than intra\-end fusion details\. On the communication path, CONCORD transmits compact remote evidence packets instead of full dense score vectors\. A fixed\-length message header is sent over TCP sockets, payloads are serialized as binary tensors and compressed before transmission\. When a step remains uncertified, the communication layer first queries ambiguity\-critical token ids and resorts to sparse chunks only when necessary\. The cloud retains the current\-step logit vector in GPU memory and responds to each token\-ID query by indexing the requested entries, so the per\-query overhead is a GPU gather plus one TCP round trip\. The tail boundbtb\_\{t\}is initialized to the global maximum of the remote log\-score vector and is tightened after each sparse\-chunk reply to the maximum over unrevealed positions\. On the decoding side, generation is preemptible\. Each model layer checks for a stop event so that rejected drafts can be interrupted early; the caller then rolls back KV\-cache state and resumes from the corrected token\.

### V\-BMain Results

Performance\.We first examine whether CONCORD preserves task quality while reducing the system path\. Figure[3](https://arxiv.org/html/2606.15179#S5.F3)summarizes the main NQ results\. CONCORD remains matched with DRAGON in answer quality: F1 is0\.267±0\.0060\.267\\pm 0\.006vs\.0\.268±0\.0080\.268\\pm 0\.008, and EM is0\.167±0\.0050\.167\\pm 0\.005vs\.0\.169±0\.0050\.169\\pm 0\.005\(CONCORD vs\. DRAGON\)\. The gap stays within across\-seed variation\. Among the other output\-aggregation baselines, DRDG/TW shares the same token\-wise exact aggregation target and therefore achieves comparable quality, whereas DRDG/SW aggregates two independently generated sequences only once, trading per\-token exactness for lower synchronization frequency\. Table[I](https://arxiv.org/html/2606.15179#S5.T1)shows the same pattern on WikiText\-2\. Methods that share the same output\-aggregation form \(DRDG/TW, DRAGON, CONCORD\) all achieve 17\.016 PPL, while CRCG and DRCG yield higher perplexity because they use different evidence organizations\.

![Refer to caption](https://arxiv.org/html/2606.15179v1/x3.png)Figure 3:Main results on Natural Questions\. All six methods are compared on answer quality \(F1, EM\), throughput \(QPS\), and communication cost \(bytes/token\)\.TABLE I:Quality results on WikiText\-2\.![Refer to caption](https://arxiv.org/html/2606.15179v1/x4.png)Figure 4:System efficiency on WikiText\-2\. All six methods are compared on generation throughput \(tok/s\) and communication cost \(bytes/token\)\.Efficiency\.The efficiency gap is much larger\. On NQ \(Figure[3](https://arxiv.org/html/2606.15179#S5.F3)\), CONCORD raises throughput from0\.782±0\.0060\.782\\pm 0\.006to1\.295±0\.0041\.295\\pm 0\.004QPS relative to DRAGON \(about1\.66×1\.66\\times\), while communication drops from253479\.7±1748\.7253479\.7\\pm 1748\.7to282\.7±25\.3282\.7\\pm 25\.3bytes per token, a reduction of about 99\.9%\. Figure[4](https://arxiv.org/html/2606.15179#S5.F4)shows the same trend on WikiText\-2: under the same aggregation target, throughput rises from 9\.29 to 20\.00 tok/s \(about2\.15×2\.15\\times\) and per\-token communication falls from 243640\.3 to 160\.9 bytes \(again about 99\.9%\)\. Among the other baselines, DRDG/TW transmits a full dense score vector at every token and blocks until the cloud replies, yielding the highest communication cost and the lowest throughput\. DRDG/SW avoids per\-token synchronization but pays for it with degraded quality\. CRCG and DRCG involve no inter\-end score exchange, yet they cannot match the quality of output\-aggregation methods\. These results confirm that the same quality target can be reached at a much lower system cost when remote participation is sparsified along both dimensions\.

### V\-CAblation Studies

To isolate the contribution of each component, we conduct ablation studies on 100 NQ development examples with 3 random seeds at RTT = 50 ms\. The four compared variants are DRAGON, CONCORD\-NoSystem \(NoSys\), CONCORD\-NoComm \(NoComm\), and CONCORD \(Figure[5](https://arxiv.org/html/2606.15179#S5.F5)\)\. All variants preserve answer quality \(F1 and EM remain within across\-seed variation\), so the discussion below focuses on efficiency\.

Waiting control\.Waiting control mainly affects time\-domain sparsification\. Removing it pushes the remote participation ratio back to 1\.000, meaning the remote side again remains on the critical path at nearly every step\. Throughput then falls from 1\.286 to 1\.123 even though communication stays low at 331\.0 bytes/token\. This mirrors the causal chain described in Section[IV](https://arxiv.org/html/2606.15179#S4): without debt\-aware waiting, the system still performs sparse communication but keeps paying unnecessary blocking cost\.

Communication control\.Communication control determines how much evidence each consulted step transmits\. When it is disabled, throughput decreases from 1\.286 to 1\.083, but the larger change is in communication cost, which rises from 261\.0 to 4981\.6 bytes/token, a19\.1×19\.1\\timesincrease\. The system still avoids some ineffective waiting, but consulted steps now consume much larger remote payloads\.

Unified design\.Neither component alone recovers the full benefit\. Relative to DRAGON, CONCORD increases throughput from 0\.765 to 1\.286 QPS while reducing communication from 256517\.7 to 261\.0 bytes/token\. The two partial variants each recover only one side of this gain\. NoSystem keeps communication small but leaves too much waiting on the critical path; NoComm retains some time\-domain sparsity but loses the minimal\-evidence advantage\. The best operating point comes from combining both mechanisms\.

![Refer to caption](https://arxiv.org/html/2606.15179v1/x5.png)Figure 5:Ablation results on Natural Questions\. DRAGON, NoSys, NoComm, and CONCORD are compared on throughput \(QPS\), remote participation ratio, and communication cost \(bytes/token\)\.
### V\-DOther Analyses

Case study\.Beyond aggregate throughput, CONCORD also reshapes how individual remote interactions happen\. Figure[6](https://arxiv.org/html/2606.15179#S5.F6)summarizes this shift from both percentile and distributional perspectives\. In the left panel, DRAGON exhibits TTFT P50/P90 of 456\.6/576\.7 ms versus 96\.1/152\.2 ms for CONCORD, while ITL P50/P90 drops from 78\.5/151\.6 ms to 35\.2/92\.0 ms\. The middle and right panels show the same trend in CDF form\.

On the communication side, average messages per answer slightly increase from 23\.24 to 28\.77, yet bytes per token collapse from2\.04×1052\.04\\times 10^\{5\}to2\.70×1022\.70\\times 10^\{2\}\. Large coarse transfers are replaced by smaller ambiguity\-targeted requests, consistent with certificate\-guided supplementation\.

![Refer to caption](https://arxiv.org/html/2606.15179v1/x6.png)Figure 6:Latency comparison between DRAGON and CONCORD\. Left: TTFT and ITL at P50/P90 percentiles\. Middle: TTFT cumulative distribution\. Right: ITL cumulative distribution\.Deployment robustness\.We further validate the trend under a real two\-machine deployment\. The cloud machine uses an Intel Xeon Platinum 8352V CPU with an RTX 4090D GPU, while the device machine uses an Intel Core i9\-10900K CPU with an RTX 2070 SUPER GPU\. Although the device machine is more powerful than a typical mobile device, the setup creates meaningful compute asymmetry \(the device GPU has roughly one\-third the throughput of the cloud GPU\), and the gains stem from reduced waiting and communication rather than absolute hardware speed\. Two link conditions are tested: direct LAN \(*Native*\) and the same link with injected delay and jitter \(*Delay\+Jitter*\)\. As Figure[7](https://arxiv.org/html/2606.15179#S5.F7)shows, the relative advantage of CONCORD persists across both tasks and both conditions\. On NQ, CONCORD improves throughput by1\.66×1\.66\\times\(Native\) and1\.36×1\.36\\times\(Delay\+Jitter\); on WikiText\-2, the corresponding gains are2\.27×2\.27\\timesand1\.85×1\.85\\times\. Communication drops from roughly 198k–252k bytes/token for DRAGON to 168\.6–353\.5 bytes/token for CONCORD across all four settings\.

![Refer to caption](https://arxiv.org/html/2606.15179v1/x7.png)Figure 7:Two\-machine deployment results\. DRAGON and CONCORD are compared under Native and Delay\+Jitter link conditions on both NQ and WikiText\-2, measured by throughput \(tok/s\) and communication cost \(bytes/token\)\.Statistical analysis\.To quantify the exactness boundary, Table[II](https://arxiv.org/html/2606.15179#S5.T2)reports consultation\-path diagnostics from an instrumented 500\-example, 3\-seed run\. About21\.66±1\.55%21\.66\\pm 1\.55\\%of generated tokens are committed locally without consultation, confirming that CONCORD does not require remote participation at every step\. Among consulted steps, the remote side is often consequential: the final decision differs from the local top\-1 on22\.29±1\.33%22\.29\\pm 1\.33\\%of them\. Certificate success dominates \(99\.98±0\.03%99\.98\\pm 0\.03\\%\) and forced fallback is almost absent \(0\.017±0\.030%0\.017\\pm 0\.030\\%\); the residual fraction corresponds to steps still in progress at sequence termination\. These numbers confirm the intended role separation: waiting control decides when consultation is worthwhile, while the communication layer remains effectively exact once consultation begins\.

TABLE II:Consultation\-path diagnostics\.

## VIConclusion

This paper presented CONCORD, a sparse aggregation framework for dual\-end RAG under document isolation\. By treating the cloud as an asynchronous evidence source, CONCORD couples waiting debt control with certificate\-guided supplementation so that remote participation is reduced along both the temporal and communication dimensions\. On consulted steps the greedy decision matches dense dual\-end aggregation exactly, while steps that time out commit locally without remote evidence\. Experiments on Natural Questions and WikiText\-2 showed that CONCORD preserved answer quality and perplexity, improved throughput over DRAGON by about 66% and 115%, and reduced per\-token communication by about 99\.9%\. The current study is limited to greedy decoding in a two\-end setting with a single model\. For sampling\-based decoding, the certificate mechanism can be extended to a probabilistic test that bounds the distance between the sparse and dense output distributions\. For multi\-node topologies, the debt controller can operate independently on each device\-cloud link, since it relies only on local waiting outcomes\. More broadly, the asynchronous sparse participation model may apply to other collaborative inference services where partial remote evidence arrives with variable delay\.

## References

- \[1\]\(2024\)Taming throughput\-latency tradeoff in llm inference with sarathi\-serve\.In18th USENIX symposium on operating systems design and implementation \(OSDI 24\),pp\. 117–134\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[2\]A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi\(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InThe Twelfth International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.15179#S3.SS1.p1.4)\.
- \[3\]N\. Bhendawade, I\. Belousova, Q\. Fu, H\. Mason, M\. Rastegari, and M\. Najibi\(2024\)Speculative streaming: fast llm inference without auxiliary models\.arXiv preprint arXiv:2402\.11131\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[4\]S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. B\. Van Den Driessche, J\. Lespiau, B\. Damoc, A\. Clark,et al\.\(2022\)Improving language models by retrieving from trillions of tokens\.InInternational conference on machine learning,pp\. 2206–2240\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.15179#S3.SS1.p1.4)\.
- \[5\]T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao\(2024\)Medusa: simple llm inference acceleration framework with multiple decoding heads\.InInternational Conference on Machine Learning,pp\. 5209–5235\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[6\]C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper\(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p4.1),[§I](https://arxiv.org/html/2606.15179#S1.p5.1),[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[7\]Y\. Chen, Z\. Niu, M\. Roveri, and G\. Casale\(2025\)Ceed: collaborative early exit neural network inference at the edge\.InIEEE INFOCOM 2025\-IEEE Conference on Computer Communications,pp\. 1–10\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[8\]Y\. Cheng, A\. Zhang, X\. Zhang, C\. Wang, and Y\. Wang\(2024\)Recurrent drafter for fast speculative decoding in large language models\.arXiv preprint arXiv:2403\.09919\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[9\]A\. E\. Eshratifar, M\. S\. Abrishami, and M\. Pedram\(2021\)JointDNN: an efficient training and inference engine for intelligent mobile cloud computing services\.IEEE transactions on mobile computing20\(2\),pp\. 565–576\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p5.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[10\]Z\. Feng, L\. Lu, Q\. Li, Y\. Chai, Z\. Zhang, Y\. Zhang, Y\. Teng, and D\. Guo\(2025\)Distributed inference optimization for large language model in edge\-cloud collaborative networks\.InICC 2025\-IEEE International Conference on Communications,pp\. 6161–6166\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[11\]K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang\(2020\)Retrieval augmented language model pre\-training\.InInternational conference on machine learning,pp\. 3929–3938\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.15179#S3.SS1.p1.4)\.
- \[12\]L\. Hu, G\. Sun, and Y\. Ren\(2020\)CoEdge: exploiting the edge\-cloud collaboration for faster deep learning\.IEEE Access8,pp\. 100533–100541\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p5.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[13\]K\. Huang, H\. Wu, Z\. Shi, H\. Zou, M\. Yu, and Q\. Shi\(2025\)AdaSpec: adaptive speculative decoding for fast, slo\-aware large language model serving\.InProceedings of the 2025 ACM Symposium on Cloud Computing,pp\. 361–374\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[14\]Y\. Huang, X\. Qiao, P\. Ren, L\. Liu, C\. Pu, S\. Dustdar, and J\. Chen\(2022\)A lightweight collaborative deep neural network for the mobile web in edge cloud\.IEEE Transactions on Mobile Computing21\(7\),pp\. 2289–2305\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[15\]G\. Izacard and E\. Grave\(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume,pp\. 874–880\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.15179#S3.SS1.p1.4)\.
- \[16\]G\. Izacard, P\. Lewis, M\. Lomeli, L\. Hosseini, F\. Petroni, T\. Schick, J\. Dwivedi\-Yu, A\. Joulin, S\. Riedel, and E\. Grave\(2023\)Atlas: few\-shot learning with retrieval augmented language models\.Journal of Machine Learning Research24\(251\),pp\. 1–43\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.15179#S3.SS1.p1.4)\.
- \[17\]C\. Jin, Z\. Zhang, X\. Jiang, F\. Liu, S\. Liu, X\. Liu, and X\. Jin\(2025\)Ragcache: efficient knowledge caching for retrieval\-augmented generation\.ACM Transactions on Computer Systems44\(1\),pp\. 1–27\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1)\.
- \[18\]H\. Jin and Y\. Wu\(2025\)Ce\-collm: efficient and adaptive large language models through cloud\-edge collaboration\.In2025 IEEE International Conference on Web Services \(ICWS\),pp\. 316–323\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1),[§V\-A](https://arxiv.org/html/2606.15179#S5.SS1.p4.2)\.
- \[19\]Y\. Kang, J\. Hauswald, C\. Gao, A\. Rovinski, T\. Mudge, J\. Mars, and L\. Tang\(2017\)Neurosurgeon: collaborative intelligence between the cloud and mobile edge\.ACM SIGARCH Computer Architecture News45\(1\),pp\. 615–629\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p5.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[20\]U\. Khandelwal, O\. Levy, D\. Jurafsky, L\. Zettlemoyer, and M\. Lewis\(2020\)Generalization through memorization: nearest neighbor language models\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1)\.
- \[21\]Y\. Leviathan, M\. Kalman, and Y\. Matias\(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p4.1),[§I](https://arxiv.org/html/2606.15179#S1.p5.1),[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[22\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.15179#S3.SS1.p1.4)\.
- \[23\]E\. Li, Z\. Zhou, and X\. Chen\(2018\)Edge intelligence: on\-demand deep learning model co\-inference with device\-edge synergy\.InProceedings of the 2018 workshop on mobile edge communications,pp\. 31–36\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p5.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[24\]Y\. Li, J\. Guo, Z\. Tang, X\. Ding, J\. Wang, T\. Wang, and W\. Jia\(2025\)Cloud\-edge system for scheduling unpredictable llm requests with combinatorial bandit\.IEEE Transactions on Services Computing\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[25\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)Eagle: speculative sampling requires rethinking feature uncertainty\.InInternational Conference on Machine Learning,pp\. 28935–28948\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[26\]S\. Liu, Z\. Zheng, X\. Huang, F\. Wu, G\. Chen, and J\. Wu\(2025\)DRAGON: enhancing on\-device model performance with distributed retrieval\-augmented generation\.InProceedings of the Twenty\-sixth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing,pp\. 221–230\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p2.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§I](https://arxiv.org/html/2606.15179#S1.p4.1),[§I](https://arxiv.org/html/2606.15179#S1.p5.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1),[§III\-C](https://arxiv.org/html/2606.15179#S3.SS3.p1.6)\.
- \[27\]D\. Ma, Y\. Wang, and L\. Tian\(2025\)Block\-attention for efficient prefilling\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1)\.
- \[28\]F\. Mou, Z\. Tang, W\. Jia, and W\. Zhao\(2026\)Adaptive request scheduling and load balancing for edge deployed large language models\.IEEE Transactions on Services Computing\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[29\]A\. Narayan, D\. Biderman, S\. Eyuboglu, A\. May, S\. Linderman, J\. Zou, and C\. Re\(2025\)Cost\-efficient collaboration between on\-device and cloud language models\.InForty\-second International Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1)\.
- \[30\]L\. Nkenyereye, K\. Baeg, and W\. Chung\(2023\)Deep reinforcement learning for containerized edge intelligence inference request processing in iot edge computing\.IEEE Transactions on Services Computing16\(6\),pp\. 4328–4344\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[31\]P\. Patel, E\. Choukse, C\. Zhang, A\. Shah, Í\. Goiri, S\. Maleki, and R\. Bianchini\(2024\)Splitwise: efficient generative llm inference using phase splitting\.In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture \(ISCA\),pp\. 118–132\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1)\.
- \[32\]W\. Shi, S\. Min, M\. Yasunaga, M\. Seo, R\. James, M\. Lewis, L\. Zettlemoyer, and W\. Yih\(2024\)Replug: retrieval\-augmented black\-box language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8371–8384\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§I](https://arxiv.org/html/2606.15179#S1.p3.1),[§II](https://arxiv.org/html/2606.15179#S2.p1.1),[§III\-A](https://arxiv.org/html/2606.15179#S3.SS1.p1.4)\.
- \[33\]M\. Sun and Z\. Zhou\(2020\)IoT services configuration in edge\-cloud collaboration networks\.In2020 IEEE International Conference on Web Services \(ICWS\),pp\. 468–472\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[34\]X\. Wang, Z\. Tang, J\. Guo, T\. Meng, C\. Wang, T\. Wang, and W\. Jia\(2025\)Empowering edge intelligence: a comprehensive survey on on\-device ai models\.ACM Computing Surveys57\(9\),pp\. 1–39\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1)\.
- \[35\]Z\. Xiang, S\. Deng, Y\. Zheng, D\. Wang, J\. Tehari, and Z\. Zheng\(2021\)Energy\-effective iot services in balanced edge\-cloud collaboration systems\.In2021 IEEE International Conference on Web Services \(ICWS\),pp\. 219–229\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[36\]D\. Xu, W\. Yin, H\. Zhang, X\. Jin, Y\. Zhang, S\. Wei, M\. Xu, and X\. Liu\(2025\)Edgellm: fast on\-device llm inference with speculative decoding\.IEEE Transactions on Mobile Computing24\(4\),pp\. 3256–3273\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1),[§V\-A](https://arxiv.org/html/2606.15179#S5.SS1.p4.2)\.
- \[37\]Z\. Yao, Z\. Tang, J\. Lou, P\. Shen, and W\. Jia\(2024\)VELO: a vector database\-assisted cloud\-edge collaborative llm qos optimization framework\.In2024 IEEE International Conference on Web Services \(ICWS\),pp\. 865–876\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[38\]Z\. Yao, Z\. Tang, W\. Yang, and W\. Jia\(2025\)Enhancing llm qos through cloud\-edge collaboration: a diffusion\-based multi\-agent reinforcement learning approach\.IEEE Transactions on Services Computing18\(3\),pp\. 1412–1427\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[39\]M\. Zhao, J\. Shi, Z\. Zhang, Y\. Ling, G\. Zhu, D\. Zhao, and H\. Ma\(2025\)C 2 f: enabling context\-aware edge\-cloud collaborative inference for foundation models\.InIEEE INFOCOM 2025\-IEEE Conference on Computer Communications,pp\. 1–10\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.
- \[40\]Y\. Zhong, S\. Liu, J\. Chen, J\. Hu, Y\. Zhu, X\. Liu, X\. Jin, and H\. Zhang\(2024\)DistServe: disaggregating prefill and decoding for goodput\-optimized large language model serving\.In18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\),pp\. 193–210\.Cited by:[§II](https://arxiv.org/html/2606.15179#S2.p3.1)\.
- \[41\]Z\. Zhou, X\. Chen, E\. Li, L\. Zeng, K\. Luo, and J\. Zhang\(2019\)Edge intelligence: paving the last mile of artificial intelligence with edge computing\.Proceedings of the IEEE107\(8\),pp\. 1738–1762\.Cited by:[§I](https://arxiv.org/html/2606.15179#S1.p1.1),[§II](https://arxiv.org/html/2606.15179#S2.p2.1)\.

Similar Articles

PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

Hugging Face Daily Papers

This paper introduces PAAC, a privacy-aware agentic framework for device-cloud collaboration that uses a decoupled architecture and LLM-driven sanitization to protect sensitive data while maintaining high performance.

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

arXiv cs.CL

Disco-RAG proposes a discourse-aware retrieval-augmented generation framework that integrates discourse signals through intra-chunk discourse trees and inter-chunk rhetorical graphs to improve knowledge synthesis in LLMs. The method achieves state-of-the-art results on QA and summarization benchmarks without fine-tuning.

@vintcessun: Feeding too many documents into RAG causes retrieval quality to drop from 75% to 40%? Vector search is diluted by a large amount of irrelevant content, causing a sharp drop in hit rate in real deployment. Root cause: heterogeneous documents are retrieved together, noise drowns out signal. Multi-agent orchestration seems intelligent but actually introduces a precision-fidelity paradox—poor configuration leads to failure in both aspects. The paper proposes MA…

X AI KOLs Timeline

This paper identifies 'vector search dilution' in RAG systems when scaling to large heterogeneous document collections, where accuracy dropped from 75% to 40% in a real-world deployment. The proposed MASDR-RAG method uses domain scoping via organizational metadata before retrieval, improving P@10 from 0.77 to 0.86 with low cost and easy deployment.