CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

arXiv cs.AI Papers

Summary

CoMIC is a cloud-edge framework for LLM agents that uses collaborative memory and insight circulation to improve long-horizon task performance without requiring parameter updates, achieving gains in progress rate and action grounding across multiple tasks.

arXiv:2606.00756v1 Announce Type: new Abstract: Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textsc{CoMIC}, a parameter-update-free cloud-edge framework for Collaborative Memory and Insights Circulation. \textsc{CoMIC} follows a \textit{Centralized Reflection, Decentralized Execution} design: edge agents execute locally using subgoal-oriented hierarchical memory and selective re-expansion of relevant histories, while a cloud-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross-agent guidance keyed by semantic subgoal identifiers. Across five long-horizon agent tasks spanning symbolic planning and text interaction, \textsc{CoMIC} improves progress rate and action grounding for weak edge agents and yields task-dependent success-rate gains without updating model parameters.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:48 PM

# CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems
Source: [https://arxiv.org/html/2606.00756](https://arxiv.org/html/2606.00756)
Yannan Wang, Beijing Jiaotong University yannanwang@bjtu\.edu\.cn &Longli Yang11footnotemark:1 Beijing Jiaotong University longli\_yang@163\.com &Zhen Liu Beijing Jiaotong University zhliu@bjtu\.edu\.cn Abhishek Kumar The Alan Turing Institute akumar@turing\.ac\.uk &Carsten Maple University of Warwick CM@warwick\.ac\.ukEqual contribution\.Visiting PhD student at WMG, University of Warwick\.Corresponding author\.Also with the Alan Turing Institute, London, UK\.

###### Abstract

Deploying lightweight Large Language Model \(LLM\) agents on edge servers can reduce latency and move agentic services closer to users, but resource\-constrained edge models often struggle with long\-horizon tasks that require persistent memory, subgoal tracking, and reflection\. Fine\-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context\. We proposeCoMIC, a parameter\-update\-free cloud\-edge framework for Collaborative Memory and Insights Circulation\.CoMICfollows aCentralized Reflection, Decentralized Executiondesign: edge agents execute locally using subgoal\-oriented hierarchical memory and selective re\-expansion of relevant histories, while a cloud\-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross\-agent guidance keyed by semantic subgoal identifiers\. Across five long\-horizon agent tasks spanning symbolic planning and text interaction,CoMICimproves progress rate and action grounding for weak edge agents and yields task\-dependent success\-rate gains without updating model parameters\.

## 1Introduction

While Large Language Model \(LLM\)\-based agents have demonstrated significant potential in autonomous decision\-making\[[2](https://arxiv.org/html/2606.00756#bib.bib3),[24](https://arxiv.org/html/2606.00756#bib.bib33)\], their reliance on classic memory paradigms—which concatenate entire interaction histories into prompts—becomes highly inefficient for long\-horizon tasks due to context explosion and high token consumption\[[19](https://arxiv.org/html/2606.00756#bib.bib31),[6](https://arxiv.org/html/2606.00756#bib.bib9)\]\. Recent studies mitigate this by decomposing memory into cross\-trial and in\-trial \(working\) memories to improve utilization efficiency\[[17](https://arxiv.org/html/2606.00756#bib.bib27),[23](https://arxiv.org/html/2606.00756#bib.bib36),[7](https://arxiv.org/html/2606.00756#bib.bib11)\]\. However, maintaining and retrieving these hierarchical memories introduces considerable computational and storage overheads\. Consequently, such architectures are primarily designed for resource\-rich environments and are often unsuitable for deployment on resource\-constrained platforms\.

Deploying lightweight LLM agents on edge servers has emerged as a practical solution to bring autonomous capabilities closer to users\[[11](https://arxiv.org/html/2606.00756#bib.bib20)\]\. However, as illustrated in Appendix[B](https://arxiv.org/html/2606.00756#A2)\(Figure[4](https://arxiv.org/html/2606.00756#A2.F4)\), constrained by limited computation and context capacities, edge deployment often compromises reasoning in complex, long\-horizon tasks\[[20](https://arxiv.org/html/2606.00756#bib.bib30)\]\. While continuously fine\-tuning models based on task experience is a conventional approach, performing frequent parameter updates across highly diverse tasks and heterogeneous edge nodes incurs substantial computational overhead, making it difficult to scale in practice\[[21](https://arxiv.org/html/2606.00756#bib.bib37)\]\. Therefore, optimizing text\-based memory interaction mechanisms without parameter updates represents a more scalable alternative\[[22](https://arxiv.org/html/2606.00756#bib.bib32)\]\. Nevertheless, purely local execution restricts resource\-bounded edge agents to isolated experiences, lacking the cross\-agent sharing and unified modeling necessary to support complex tasks efficiently\.

The constraints faced by edge\-deployed agents closely parallel a fundamental strategy in human cognition\[[15](https://arxiv.org/html/2606.00756#bib.bib25)\]: under limited attention and time, individuals often rely on lightweight heuristics for immediate action while reserving deeper deliberation for offline review\. Inspired by this observation, and adhering to the paradigm of not updating model parameters, we propose a cloud\-edge collaborative framework for memory\-enhanced edge LLM agents,CollaborativeMemory andInsightsCirculation \(CoMIC\)\. Operating under a "Centralized Reflection, Decentralized Execution" paradigm, this framework offloads the computationally intensive tasks of cross\-trial long\-term memory maintenance and complex logical reflection to the cloud\. Meanwhile, edge nodes maintain only lightweight, selectively expandable memories for immediate decision\-making, thereby enabling breakthroughs in complex long\-horizon tasks within resource\-constrained environments\. In summary, our contributions are as follows\.

- •To the best of our knowledge,CoMICis the first collaborative framework designed to enhance the long\-horizon decision\-making capabilities of lightweight edge LLM agents\. Edge agents execute decisions autonomously driven by subgoals\. They employ an asynchronous trajectory upload mechanism to ensure unblocked independent execution, and dynamically combine local hierarchical memory with global guidance from the cloud\.
- •The cloud LLM serves as a global critic to achieve cross\-edge experience summarization\. It independently evaluates individual trajectories asynchronously uploaded by edge agents to distill high\-quality experiences through reflection\. Subsequently, by utilizing the semantic identifiers of subtasks for cross\-edge indexing, it critiques evaluated experiences of compatible subgoals from different edge agents, thereby obtaining selected global guidance for matching edge contexts\.
- •Extensive experiments on multiple long\-horizon decision\-making benchmarks demonstrate thatCoMICoutperforms state\-of\-the\-art memory\-augmented baselines\. It significantly improves task success rates and execution capabilities while effectively bounding context token consumption, all without requiring any parameter updates\.

## 2Preliminary

### 2\.1Task Setting

We consider a set of edge agentsℰ=\{e1,⋯,eN\}\\mathcal\{E\}=\\\{e^\{1\},\\cdots,e^\{N\}\\\}deployed in resource\-constrained environments\. For a given global taskgg, each edge agent interacts with the environment through a sequence of decision steps\. Because long\-horizon tasks typically require multiple dependent decisions, the execution process is organized as a sequence of subgoal episodes rather than a flat history\.

At steptt, the edge agenteie^\{i\}maintains the current subgoalgtig\_\{t\}^\{i\}, receives an observationotio\_\{t\}^\{i\}, executes an actionatia\_\{t\}^\{i\}, and obtains the corresponding outcomertir\_\{t\}^\{i\}\. The interaction trajectory of this subgoal episodettis defined as

τ^ti=\(g,gti,oti,ati,rti\),\\hat\{\\tau\}^\{i\}\_\{t\}=\(g,g\_\{t\}^\{i\},o\_\{t\}^\{i\},a\_\{t\}^\{i\},r\_\{t\}^\{i\}\),\(1\)wheret∈\{1,…,Tg\}t\\in\\\{1,\\dots,T\_\{g\}\\\}, andTgT\_\{g\}denotes the total number of subgoal episodes that the global taskggcan be divided into\. This formulation models the execution of long\-horizon tasks as a structured progression of subgoals, providing the fundamental unit for subsequent memory organization and cloud coordination\.

### 2\.2Memory\-Centric Cloud\-Edge Paradigm

Within this setting, the edge layer is responsible for online execution\. Each edge agent maintains a working memory that preserves the local context required for immediate decision\-making\. This context includes the active subgoal and the recent interaction history most relevant to the current step\. The cloud layer operates outside this execution loop, enhancing future decisions through reusable memory derived from past experiences\.

##### Trajectory Evaluation

The first mechanism evaluates a completed trajectory or subgoal episode\. Through this mechanism, the cloud analyzes the local execution record and produces trajectory\-grounded reflections tied to the corresponding context\. This pathway supports cloud\-side evidence admission and later reuse without interrupting online execution at the edge\.

##### Global Guidance

The second feedback mechanism operates at a global level\. Instead of focusing solely on a single trajectory, it distills reusable guidance from admitted experiences accumulated across compatible agents and tasks, generating higher\-level knowledge applicable beyond a single episode\. Consequently, the edge focuses on timely actions, while the cloud exposes selectedGlobal Guidanceas the single advisory channel for future decisions\. The subsequent section instantiates this memory\-centric cloud\-edge paradigm withinCoMIC\.

## 3Methodology

In this section, we formally presentCoMIC, a memory\-centric cloud\-edge collaborative framework\. This framework improves the long\-horizon decision\-making capabilities of lightweight edge LLM agents by summarizing cross\-agent experiences in the cloud and returning selected guidance to the edge agents\.

### 3\.1Overview of System Design

Figure[1](https://arxiv.org/html/2606.00756#S3.F1)illustrates the workflow ofCoMIC\. Edge nodes handle local decision\-making and environment interaction, while the cloud performs experience reflection and cross\-edge knowledge aggregation\. This design enables asynchronous execution at the edge with cloud\-assisted global reflection\.

![Refer to caption](https://arxiv.org/html/2606.00756v1/x1.png)Figure 1:System architecture and workflow ofCoMIC\. The edge organizes long\-horizon tasks into subgoal episodes, interacts with the environment, and maintains hierarchical local memory\. Completed trajectories are uploaded asynchronously to the cloud for evaluation and aggregation\. The cloud uses trajectory\-level reflections for evidence admission and returns selectedGlobal Guidanceas the single advisory channel for later episodes without interrupting ongoing execution\.The workflow initiates at the end layer with a task request issued by a user\. At the edge, the local agent organizes this request as a sequence of subgoal episodes, interacts with the environment, and records the resulting traces into structured trajectories\. To ensure real\-time responsiveness, trajectories are uploaded asynchronously without blocking local execution\.

Upon receiving the uploaded trajectories, the cloud processes them using an LLM acting as a global critic\. The critic evaluates trajectory\-level evidence and aggregates admitted experiences into reusable global guidance\.

Finally, selectedGlobal Guidanceis returned to the edge and asynchronously assembled into later decision prompts to improve subsequent decisions\. Detailed designs of the edge memory framework, the cloud critic, and the dynamic collaboration mechanism are presented in §[3\.2](https://arxiv.org/html/2606.00756#S3.SS2), §[3\.3](https://arxiv.org/html/2606.00756#S3.SS3), and §[3\.4](https://arxiv.org/html/2606.00756#S3.SS4), respectively\.

### 3\.2Edge Framework

The edge agent ofCoMICestablishes a planning\-execution processing mode for local tasks based on subgoal\-oriented working memory\. This framework retains immediate interactive key steps entirely on the edge while offloading global information processing to the cloud\.

#### 3\.2\.1Subgoal\-Driven Execution and Trajectory

For a global taskgg, the edge agenteie^\{i\}decomposes it into subgoal\-driven episodes\. By strictly aligning each local decision with the active subgoal, the agent structurally reduces the reasoning burden on the edge\. Formally,CoMICrepresents the global trajectoryτgi\\tau\_\{g\}^\{i\}as an ordered set of all subgoal episodes:

τgi=\{τ^1i,τ^2i,…,τ^Tgi\},\\tau\_\{g\}^\{i\}=\\\{\\hat\{\\tau\}\_\{1\}^\{i\},\\hat\{\\tau\}\_\{2\}^\{i\},\\dots,\\hat\{\\tau\}\_\{T\_\{g\}\}^\{i\}\\\},\(2\)where eachτ^ti\\hat\{\\tau\}\_\{t\}^\{i\}records the interaction experiences associated with a specific subgoal\.

#### 3\.2\.2Hierarchical Memory and Summarizer

The edge agent’s local memory employs a hierarchical structure\. For the currently active episode, it maintains detailed action–observation pairs to inform precise local decisions\. Conversely, for completed earlier episodes, trajectory segments are compressed into abstract summaries when suitable; in compact symbolic PDDL domains such as Blocksworld and Gripper, concise predicate\-level states are preserved to avoid losing object grounding\. To selectively retrieve detailed trajectories without overflowing context limits, the system dynamically reconstructs historical contexts via selective re\-expansion\. Formally, given an index setℐ\\mathcal\{I\}of previously summarized subgoals relevant to the current decision, the agent reconstructs the historical contextHtiH\_\{t\}^\{i\}:

Hti​\(ℐ\)=\{Summary​\(τ^ti\)\|t∉ℐ\}∪\{τ^ti\|t∈ℐ\}H\_\{t\}^\{i\}\(\\mathcal\{I\}\)=\\left\\\{\\text\{Summary\}\(\\hat\{\\tau\}\_\{t\}^\{i\}\)\\middle\|t\\notin\\mathcal\{I\}\\right\\\}\\cup\\left\\\{\\hat\{\\tau\}\_\{t\}^\{i\}\\middle\|t\\in\\mathcal\{I\}\\right\\\}\(3\)Completed subgoals are compressed into summaries, while reusable cloud guidance is kept as a separate prompt\-level advisory signal\.

### 3\.3LLM\-based Cloud Critic Framework

The Cloud Critic operates asynchronously outside the immediate execution loop of the current edge episode, directly processing the uploaded interaction trajectories into structured reflections and reusable knowledge to guide future decisions at the edge\.

#### 3\.3\.1Single\-Trajectory Canonicalization and Evaluation

Cloud evaluation operates on individual subgoal trajectories independently\. Uploaded trajectories are buffered in short\-term memory without blocking edge execution\. The immediate input to the cloud is therefore a single trajectoryτ^ti\\hat\{\\tau\}\_\{t\}^\{i\}coupled with its contextual features \(e\.g\., task and subgoal identifiers\), which provide the necessary explanations for the cloud to identify the corresponding subgoal episode of the trajectory\.

Before evaluation, the contextual features of each trajectory are canonicalized into unified semantic identifiers to align heterogeneous edge inputs\.

After canonicalization, the cloud constructs a trajectory\-level textual record by combining the episode trajectory with its normalized features:

xti=Serialize​\(τ^ti,m~ti\),x\_\{t\}^\{i\}=\\mathrm\{Serialize\}\\\!\\left\(\\hat\{\\tau\}\_\{t\}^\{i\},\\tilde\{m\}\_\{t\}^\{i\}\\right\),\(4\)wherem~ti\\tilde\{m\}\_\{t\}^\{i\}denotes the canonicalized feature set andxtix\_\{t\}^\{i\}is the serialized evaluation record\. This serialization formats the completed episode into a standard representation for cloud evaluation\.

The cloud critic performs trajectory\-level evaluation onxtix\_\{t\}^\{i\}\. To avoid redundant LLM invocations, previously processed trajectories are identified through cache identity\. When no exact match is found, related prior experiences and synthesized global insights enrich the evaluation context, enabling the critic to assess the trajectory with reference to accumulated knowledge\. The cloud critic prompt is shown in Fig\.[2\(a\)](https://arxiv.org/html/2606.00756#S3.F2.sf1)\.

#### 3\.3\.2Critic\-Guided Experience Distillation

Based on the evaluation of the trajectory, the Cloud Critic further acts as a gated reflection module\. The evaluation outputCtiC\_\{t\}^\{i\}forms a reflective record comprising a summary, insights, suggestions, and a self\-reported confidence scoresselfs\_\{\\mathrm\{self\}\}from the LLM\. To assess the actual reliability of this reflection, we define an admission score for each evaluated record:

sadm​\(xti\)=clip\[0,1\]​\(sself−λctx​𝕀retry−λshort​𝕀short\),s\_\{\\mathrm\{adm\}\}\(x\_\{t\}^\{i\}\)=\\mathrm\{clip\}\_\{\[0,1\]\}\\left\(s\_\{\\mathrm\{self\}\}\-\\lambda\_\{\\mathrm\{ctx\}\}\\,\\mathbb\{I\}\_\{\\mathrm\{retry\}\}\-\\lambda\_\{\\mathrm\{short\}\}\\,\\mathbb\{I\}\_\{\\mathrm\{short\}\}\\right\),\(5\)wheresselfs\_\{\\mathrm\{self\}\}denotes the self\-reported confidence produced by the LLM,𝕀retry\\mathbb\{I\}\_\{\\mathrm\{retry\}\}indicates that the evaluation relied on a retry with retrieved context, and𝕀short\\mathbb\{I\}\_\{\\mathrm\{short\}\}indicates that the trajectory is too short to provide sufficient evidence\.

#### 3\.3\.3Memory Organization

Memory within the cloud is organized into two components: a short\-term buffering layer and a cloud knowledge base\.

##### Short\-term Memory

The cloud maintains a short\-term buffer for recently uploaded trajectories and intermediate evaluations:

ℳSTM=𝒯STM∪ℰSTM,\\mathcal\{M\}\_\{\\mathrm\{STM\}\}=\\mathcal\{T\}\_\{\\mathrm\{STM\}\}\\cup\\mathcal\{E\}\_\{\\mathrm\{STM\}\},\(6\)where𝒯STM\\mathcal\{T\}\_\{\\mathrm\{STM\}\}contains recently uploaded subgoal episodesτ^ti\\hat\{\\tau\}\_\{t\}^\{i\}together with their basic metadata, andℰSTM\\mathcal\{E\}\_\{\\mathrm\{STM\}\}contains their temporary evaluation records for a short duration\.

##### Cloud Knowledge Base

The cloud knowledge base comprises experience memory and global guidance\. For experience memory, when a trajectory’s admission scoresadms\_\{\\mathrm\{adm\}\}satisfies the threshold criteria, its evaluation is distilled into generalized experience unitsℳexp=\{⟨O,A,R⟩\}\\mathcal\{M\}\_\{\\mathrm\{exp\}\}=\\left\\\{\\langle O,A,R\\rangle\\right\\\}, where each tuple records an observation, an action, and the outcome\. The admitted experience set is defined as

ℳexp\+=\{m∈ℳexp\|sadm​\(m\)≥γkb\},\\mathcal\{M\}\_\{\\mathrm\{exp\}\}^\{\+\}=\\left\\\{m\\in\\mathcal\{M\}\_\{\\mathrm\{exp\}\}\\;\\middle\|\\;s\_\{\\mathrm\{adm\}\}\(m\)\\geq\\gamma\_\{\\mathrm\{kb\}\}\\right\\\},\(7\)whereγkb\\gamma\_\{\\mathrm\{kb\}\}denotes the knowledge\-base admission threshold\.

These experiences form the foundation for experience aggregation in the cloud\. For a grouping identifieruuof a given subgoal episode, the system extracts the corresponding experience group by indexing the admitted experience setℳexp\+\\mathcal\{M\}\_\{\\mathrm\{exp\}\}^\{\+\}within the experience memory:

𝒢​\(u\)=\{m∈ℳexp\+\|κ​\(m\)=u\},\\mathcal\{G\}\(u\)=\\left\\\{m\\in\\mathcal\{M\}\_\{\\mathrm\{exp\}\}^\{\+\}\\;\\middle\|\\;\\kappa\(m\)=u\\right\\\},\(8\)whereκ​\(m\)\\kappa\(m\)denotes the grouping identity associated with experience itemmm\. The retrieved group𝒢​\(u\)\\mathcal\{G\}\(u\)constitutes the historical experiences retrieved to enrich the evaluation context and also forms the foundation for subsequent experience aggregation in the cloud\.

Regarding the guidance store, it is formally defined as:

ℳ=ℳmanual∪ℳglobal\.\\mathcal\{M\}=\\mathcal\{M\}\_\{\\mathrm\{manual\}\}\\cup\\mathcal\{M\}\_\{\\mathrm\{global\}\}\.\(9\)Here,ℳmanual\\mathcal\{M\}\_\{\\mathrm\{manual\}\}represents predefined general task rules, whileℳglobal\\mathcal\{M\}\_\{\\mathrm\{global\}\}represents reusable guidance generated by aggregating the experience groups above\.

### 3\.4Cloud\-Edge Collaboration Mechanism

The cloud\-edge collaboration mechanism operates in two stages: cross\-edge experience aggregation in the cloud and collaborative context assembly at the edge agent\.

#### 3\.4\.1Cross\-Edge Aggregation

The cloud periodically aggregates admitted experiences to synthesize global guidance:

G​\(u\)=fagg​\(𝒢​\(u\)\),G\(u\)=f\_\{\\mathrm\{agg\}\}\\\!\\left\(\\mathcal\{G\}\(u\)\\right\),\(10\)where the identifieruuprioritizes exact semantic subgoal matching, falling back to task\-level alignment when necessary\.

The synthesized guidanceG​\(u\)G\(u\)comprises a reusable execution summary, empirical insights, prospective suggestions, and an aggregated credibility score\. It is committed to the global knowledge baseℳglobal\\mathcal\{M\}\_\{\\mathrm\{global\}\}and dynamically dispatched to edge agents addressing matching subgoals\.

You are a cloud\-side trajectory evaluation assistant\.

Your job is to evaluate one execution trajectory and return exactly one JSON object\.

Rules:

1\. Use only the trajectory and optional context\.

2\. Do not invent facts or unstated outcomes\.

3\. If evidence is incomplete, lower self\_reported\_confidence\.

4\. Keep summary, insights, and suggestions trajectory\-grounded\.

5\. Output JSON only\.

Return keys:

\{"summary", "insights", "suggestions", "self\_reported\_confidence"\}

Task / Subgoals / Metadata: \{trajectory\_text\}

Optional Context: \{optional\_context\}

\(a\)Cloud critic prompt\.
Note: A subgoal is a milestone goal toward the final goal\.

If an unfinished subgoal exists, output "Action: \{action\}"\.

If the previous subgoal has been completed, output

"Subgoal: \{subgoal\}\\nAction: \{action\}"\.

Instructions:

1\. Do not output two consecutive subgoals\.

2\. Subgoal must be one line\.

3\. If an action fails, use "check valid actions"\.

4\. Use "retrieve\(subgoal\_id\)" only when hidden detailed history is needed\.

\{examples\}

Goal: \{goal\}

Global Guidance: \{global SOP / insight / suggestion\}

\{serialized\_history\}

Action:

\(b\)Edge\-agent prompt\.

Figure 2:Excerpted prompt templates aligned with the current implementation\.Left:the cloud critic prompt condenses the trajectory\-evaluation system prompt and user template\.Right:the edge\-agent prompt condenses the instruction block and the singleGlobal Guidancesection assembled before action generation\.
#### 3\.4\.2Collaborative Context Assembly on Edge Agent

Upon receiving selected global guidance from the cloud, the edge agent renders it into the uniqueGlobal Guidanceprompt block\. When initiating a new subgoal episode at time stept\+1t\+1, the agent employs a prompt assembly function𝒜\\mathcal\{A\}to assemble the complete decision promptPt\+1iP\_\{t\+1\}^\{i\}strictly based on the global taskgg, the current subgoalgt\+1ig\_\{t\+1\}^\{i\}, the immediate observationot\+1io\_\{t\+1\}^\{i\}, the reconstructed historyHt\+1i​\(ℐt\+1\)H\_\{t\+1\}^\{i\}\(\\mathcal\{I\}\_\{t\+1\}\), and the cloud guidance matched via the task identifierut\+1u\_\{t\+1\}:

Pt\+1i=𝒜​\(g,gt\+1i,ot\+1i,Ht\+1i​\(ℐt\+1\),G​\(ut\+1\)\)P\_\{t\+1\}^\{i\}=\\mathcal\{A\}\\left\(g,g\_\{t\+1\}^\{i\},o\_\{t\+1\}^\{i\},H\_\{t\+1\}^\{i\}\(\\mathcal\{I\}\_\{t\+1\}\),G\(u\_\{t\+1\}\)\\right\)\(11\)whereHt\+1i​\(ℐt\+1\)H\_\{t\+1\}^\{i\}\(\\mathcal\{I\}\_\{t\+1\}\)is the historical context reconstructed fromℐt\+1\\mathcal\{I\}\_\{t\+1\}, andG​\(ut\+1\)G\(u\_\{t\+1\}\)denotes the guidance selected from the global knowledge base for the current stage; the runtime update procedure is summarized in Appendix[H](https://arxiv.org/html/2606.00756#A8)\.

## 4Experiments

### 4\.1Experimental Setup

All experiments are conducted without any parameter fine\-tuning for both cloud and edge LLMs\. The edge\-side simulations are implemented using Python 3\.8 and PyTorch 2\.0\.0, and are conducted on a Linux workstation equipped with a single NVIDIA GeForce RTX 4090 GPU and an Intel Xeon Gold 6430 CPU\. Detailed environments are provided in Appendix[C](https://arxiv.org/html/2606.00756#A3)\.

##### Baseline

We select widely used LLM agent frameworks as our baselines for comparison\. These primarily include general\-purpose memory\-based agent frameworks \(adapted fromAgentBoard\[[3](https://arxiv.org/html/2606.00756#bib.bib4)\]\), denoted asStandard, andHiAgent\[[6](https://arxiv.org/html/2606.00756#bib.bib9)\], denoted asLocal, a state\-of\-the\-art baseline specifically designed for long\-horizon tasks that features subtask\-partitioning memory\. Furthermore, to accommodate the physical characteristics of resource\-constrained edge nodes in cloud\-edge collaborative environments, we utilize the same lightweightMistral\-7b\[[4](https://arxiv.org/html/2606.00756#bib.bib5)\]as our edge agents for the backbone model ofHiAgent, rather than the GPT\-4\(gpt\-4\-turbo\)\[[10](https://arxiv.org/html/2606.00756#bib.bib19)\]used in its original paper\.

##### Evaluation Tasks & Cloud\-Edge Configuration

We evaluate the models on five classic long\-horizon tasks that typically require more than 20 execution steps:Blocksworld,Gripper,Tyreworld,Barman, andJericho\. Employingdeepseek\-chatas the cloud\-side LLM critic, we design two edge deployment scenarios:Scenario A\(Homogeneous Lightweight Edge utilizingMistral\-7b\) andScenario B\(Heterogeneous Mixed Edge using both GPT\-4\(gpt\-4\-turbo\) andMistral\-7b\)\.

Across all configurations, the edge agent’s memory budget is fixed at 100 items, and the maximum exploration horizon is capped at 30 decision steps per subgoal episode\.

##### Evaluation Protocol and Metrics

Evaluation metrics are categorized into edge\-side metrics \(Progress Rate \(PR\),Success Rate \(SR\),Grounding Accuracy \(GA\),Steps, andContext\) and cloud\-side metrics \(Coordination, Critic Funnel, and Overhead metrics\)\. Detailed definitions are provided in Appendix[D](https://arxiv.org/html/2606.00756#A4)\.

### 4\.2Main Results

The experimental results demonstrate thatCoMICenhances the long\-horizon execution capabilities of weak edge agents by incorporating cloud\-side reflection, all without requiring updates to the lightweight edge models’ parameters\. The results of Scenario A indicate that, compared to purely local execution,CoMICsubstantially improves Success Rate \(SR\), Progress Rate \(PR\), and Grounding Accuracy \(GA\)\. Furthermore, the results of Scenario B reveal thatCoMICcan leverage the cloud critic to summarize higher\-quality global guidance from the high\-quality trajectories generated by stronger edge agents\. Although the absolute performance gains are bounded by the foundational capabilities of the weak base model, this high\-quality global guidance still effectively drives targeted behavioral improvements in weak edge agents\. Ultimately, cloud\-side processing effectively advances task progression and action grounding\.

### 4\.3Analysis

This section analyzes the experimental performance ofCoMICunder different settings\.

#### 4\.3\.1Edge\-side Analysis

##### Scenario A

Table[1](https://arxiv.org/html/2606.00756#S4.T1)compares theStandardMistral\-7bedge agent with the weak edge agents under the homogeneousCoMICdeployment\.

Table 1:Edge\-side analysis for Scenario A\. We compare the local\-only Mistral\-7B edge agent with the weak edge under the dual\-weakCoMICdeployment\. Each collaborative row reports the absolute value together with the change relative to the local baseline\.![Refer to caption](https://arxiv.org/html/2606.00756v1/x2.png)\(a\)Scenario A\.
![Refer to caption](https://arxiv.org/html/2606.00756v1/x3.png)\(b\)Scenario B\.

Figure 3:Progress Rate vs\. Context Token Consumption\.The plots show context token consumption across datasets against the corresponding progress rates, where asterisks denote averages over all environments\. Scenario A highlights thatCoMICimproves task progression while reducing context cost relative to theStandardbaseline, whereas Scenario B shows that trajectories generated by stronger edge agent yield selective gains without uniformly reducing context usage\.As detailed in Table[1](https://arxiv.org/html/2606.00756#S4.T1), averaged across all tasks,CoMICincreases the Success Rate \(SR\) from 0\.00 to 6\.50, improves the Progress Rate \(PR\) from 5\.87 to 15\.99, and elevates Grounding Accuracy \(GA\) from 32\.53 to 73\.94 \(detailed visualizations for PR vs\. Steps and GA are provided in Appendix[F](https://arxiv.org/html/2606.00756#A6)\)\. Crucially, these substantial performance improvements are accompanied by a reduction in interaction costs\. As shown in Figure[3](https://arxiv.org/html/2606.00756#S4.F3), the average execution steps decrease from 30\.00 to 28\.90, and context token consumption is reduced by 27\.88% relative to theStandardbaseline\. This demonstrates thatCoMICsuccessfully decouples task progression from excessive token accumulation\.

At the individual task level, the most substantial gains appear in Blocksworld and Tyreworld\. Conversely, tasks like Gripper and Jericho exhibit non\-uniform improvements across metrics due to differing environmental characteristics; however, the setting successfully bounds context consumption across all domains\.

##### Scenario B

The experimental results of Scenario B \(detailed in Appendix[E](https://arxiv.org/html/2606.00756#A5)\) indicate that collaboration between heterogeneous edge agents effectively enhances the quality of cloud\-side reflection\. High\-quality trajectories generated by stronger edge agents enable the cloud critic to summarize more effective global guidance for the entire cluster\. However, the improvement in Success Rate \(SR\) for weak agents remains constrained by the foundational capabilities of theMistral\-7bmodel\. Specifically, for complex long\-horizon tasks, the weak base model’s inherent limitations in effectively decomposing tasks into subgoals and its capacity to comprehend global guidance restrict its overall task completion\. Nevertheless, the measurable gains inPRandGAdemonstrate thatCoMICsubstantially strengthens the execution capability of weak edge agents at the subgoal level\.

#### 4\.3\.2Cloud\-side Analysis

The cloud layer achieves significant memory reuse and stable guidance delivery\. Under Scenario A, the cloud achieves an averageHitrate of 45\.12% across all tasks, with an average response latency \(Lat\.\) of 68\.68 s and an acknowledgement rate \(ACK\) of 71\.00%\. This indicates that the selected guidance is not only successfully generated but also proves acceptable to the edge nodes, demonstrating the cloud\-edge collaboration ofCoMIC\. However, due to agent heterogeneity, there is a significant performance gap between different collaboration scenarios\.

Detailed statistics on cloud coordination, analytical insights regarding this heterogeneity, and system\-level resource overheads are thoroughly documented in Appendix[G](https://arxiv.org/html/2606.00756#A7)\.

#### 4\.3\.3Component Ablation

To isolate cloud\-side processing, we compare three weak\-edge settings using the sameMistral\-7bbackbone:Local\(w/o cloud\), Scenario A \(w/ cloud\), and Scenario B \(w/ Hetero\. Cloud\)\. Table[2](https://arxiv.org/html/2606.00756#S4.T2)shows the strongest task\-level case, while full results are provided in Appendix[J](https://arxiv.org/html/2606.00756#A10)\.

Table 2:Ablation study ofCoMICon Blocksworld\. “w/o Cloud” corresponds toLocal, “w/ Cloud” to Scenario A, and “w/ Hetero\. Cloud” to Scenario B\. Deltas are relative to w/o Cloud\.As shown in Table[2](https://arxiv.org/html/2606.00756#S4.T2), the purely local baselineLocal\(w/o Cloud\) fails entirely on Blocksworld \(0\.00SR\), whereas both cloud\-enabled configurations achieve a 20\.00SR\. Correspondingly,PRsurges from 6\.67 to 32\.22 \(w/ Cloud\) and 33\.33 \(w/ Hetero\. Cloud\) alongside a reduction in interaction steps\. While context token consumption slightly exceeds thew/o Cloud, this outcome demonstrates thatCoMICprioritizes enabling the edge agent to advance task progression and achieve successful completion first, subsequently minimizing context token usage as much as possible\.

The ablation results indicate that cloud\-side reflection partially mitigates the limitations of local memory, especially for subgoal\-level progress and valid action selection\. However, the results do not show that cloud reflection fully overcomes weak\-model limitations: as detailed in Appendix[J](https://arxiv.org/html/2606.00756#A10), several tasks retain low or zero success rate, and heterogeneous guidance provides only selective improvements\. This suggests thatCoMIC’s effectiveness depends on three conditions: the edge model must generate viable subgoals, the cloud critic must admit reliable experiences, and the dispatched guidance must be concise enough for the weak edge model to use\.

## 5Limitations

WhileCoMICimproves the execution capabilities of weak edge agents, its end\-to\-end task success remains fundamentally constrained by the base model’s inherent planning ability\. An excessively weak backbone struggles to generate viable subgoals or fully comprehend the global guidance from the cloud critic\. Furthermore, the achieved context savings depend heavily on specific comparison settings and normalization, necessitating careful memory admission, precise guidance dispatch, and rigorous evaluation\.

## 6Conclusion

In this paper, we proposeCoMIC, a cloud\-edge collaborative framework designed to address the long\-horizon decision\-making challenges faced by lightweight Large Language Model \(LLM\) agents on resource\-constrained edge devices\. This framework enables edge agents to distributedly execute subtasks to achieve long\-horizon goals, while utilizing a cloud critic to centrally reflect upon and provide feedback on the interaction trajectories of all edge agents, thereby effectively improving edge agent performance without requiring any parameter updates\. Extensive experiments demonstrate that, compared to existing baselines,CoMICimproves progress and grounding, with modest and task\-dependent success\-rate gains\.

Future work will explore optimizing prompt engineering to refine the granularity of cloud guidance, as well as integrating reinforcement learning to dynamically adjust the confidence thresholds for cloud reflections\.

## References

- \[1\]\(2025\)Introducing apple’s on\-device and server foundation models\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px1.p1.1)\.
- \[2\]A\. Borzilov, A\. Skrynnik, and A\. Panov\(2025\)CoSMAC: a benchmark for evaluating communication and coordination in llm\-based agents\.InLLM\-based Multi\-Agent Systems: Towards Responsible, Reliable, and Scalable Agentic Systems,Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p1.1)\.
- \[3\]M\. Chang, J\. Zhang, Z\. Zhu, C\. Yang, Y\. Yang, Y\. Jin, Z\. Lan, L\. Kong, and J\. He\(2024\)Agentboard: an analytical evaluation board of multi\-turn llm agents\.Advances in neural information processing systems37,pp\. 74325–74362\.Cited by:[§4\.1](https://arxiv.org/html/2606.00756#S4.SS1.SSS0.Px1.p1.1)\.
- \[4\]D\. S\. Chaplot\(2023\)Albert q\. jiang, alexandre sablayrolles, arthur mensch, chris bamford, devendra singh chaplot, diego de las casas, florian bressand, gianna lengyel, guillaume lample, lucile saulnier, lélio renard lavaud, marie\-anne lachaux, pierre stock, teven le scao, thibaut lavril, thomas wang, timothée lacroix, william el sayed\.arXiv preprint arXiv:2310\.068253\.Cited by:[§4\.1](https://arxiv.org/html/2606.00756#S4.SS1.SSS0.Px1.p1.1.5)\.
- \[5\]C\. Ding, Z\. Lu, F\. Juefei\-Xu, V\. N\. Boddeti, Y\. Li, and J\. Cao\(2022\)Towards transmission\-friendly and robust cnn models over cloud and device\.IEEE Transactions on Mobile Computing22\(10\),pp\. 6176–6189\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px1.p1.1)\.
- \[6\]M\. Hu, T\. Chen, Q\. Chen, Y\. Mu, W\. Shao, and P\. Luo\(2025\)Hiagent: hierarchical working memory management for solving long\-horizon agent tasks with large language model\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 32779–32798\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.00756#S4.SS1.SSS0.Px1.p1.1)\.
- \[7\]Y\. Li, C\. Qian, Y\. Xia, R\. Shi, Y\. Dang, Z\. Xie, Z\. You, W\. Chen, C\. Yang, W\. Liu,et al\.\(2025\)Cross\-task experiential learning on llm\-based multi\-agent collaboration\.arXiv preprint arXiv:2505\.23187\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p1.1)\.
- \[8\]Z\. Lin, S\. Bi, and Y\. A\. Zhang\(2021\)Optimizing ai service placement and resource allocation in mobile edge intelligence systems\.IEEE Transactions on Wireless Communications20\(11\),pp\. 7257–7271\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px1.p2.1)\.
- \[9\]Z\. Liu, C\. Zhao, F\. Iandola, C\. Lai, Y\. Tian, I\. Fedorov, Y\. Xiong, E\. Chang, Y\. Shi, R\. Krishnamoorthi,et al\.\(2024\)Mobilellm: optimizing sub\-billion parameter language models for on\-device use cases\.InForty\-first International Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px1.p3.1)\.
- \[10\]R\. OpenAI\(2023\)Gpt\-4 technical report\. arxiv 2303\.08774\.View in Article2\(5\),pp\. 1\.Cited by:[§4\.1](https://arxiv.org/html/2606.00756#S4.SS1.SSS0.Px1.p1.1)\.
- \[11\]G\. Qu, Q\. Chen, W\. Wei, Z\. Lin, X\. Chen, and K\. Huang\(2025\)Mobile edge intelligence for large language models: a contemporary survey\.IEEE Communications Surveys & Tutorials\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p2.1)\.
- \[12\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px2.p1.1)\.
- \[13\]G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px1.p3.1)\.
- \[14\]L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin,et al\.\(2024\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 186345\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px2.p1.1)\.
- \[15\]D\. M\. Wegner\(1987\)Transactive memory: a contemporary analysis of the group mind\.InTheories of group behavior,pp\. 185–208\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p3.1)\.
- \[16\]Z\. Xi, W\. Chen, X\. Guo, W\. He, Y\. Ding, B\. Hong, M\. Zhang, J\. Wang, S\. Jin, E\. Zhou,et al\.\(2025\)The rise and potential of large language model based agents: a survey\.Science China Information Sciences68\(2\),pp\. 121101\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px2.p1.1)\.
- \[17\]Z\. Xi, Y\. Ding, W\. Chen, B\. Hong, H\. Guo, J\. Wang, X\. Guo, D\. Yang, C\. Liao, W\. He,et al\.\(2025\)Agentgym: evaluating and training large language model\-based agents across diverse environments\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 27914–27961\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p1.1)\.
- \[18\]M\. Xu, H\. Du, D\. Niyato, J\. Kang, Z\. Xiong, S\. Mao, Z\. Han, A\. Jamalipour, D\. I\. Kim, X\. Shen,et al\.\(2024\)Unleashing the power of edge\-cloud generative ai in mobile networks: a survey of aigc services\.IEEE Communications Surveys & Tutorials26\(2\),pp\. 1127–1170\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px1.p1.1)\.
- \[19\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p1.1)\.
- \[20\]Z\. Yao, Z\. Tang, W\. Yang, and W\. Jia\(2025\)Enhancing llm qos through cloud\-edge collaboration: a diffusion\-based multi\-agent reinforcement learning approach\.IEEE Transactions on Services Computing\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p2.1)\.
- \[21\]A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang\(2024\)Agenttuning: enabling generalized agent abilities for llms\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 3053–3077\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p2.1)\.
- \[22\]Z\. Zhang, Q\. Dai, X\. Bo, C\. Ma, R\. Li, X\. Chen, J\. Zhu, Z\. Dong, and J\. Wen\(2025\)A survey on the memory mechanism of large language model\-based agents\.ACM Transactions on Information Systems43\(6\),pp\. 1–47\.Cited by:[Appendix A](https://arxiv.org/html/2606.00756#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.00756#S1.p2.1)\.
- \[23\]Z\. Zhang, Q\. Dai, R\. Li, X\. Bo, X\. Chen, and Z\. Dong\(2025\)Learn to memorize: optimizing llm\-based agents with adaptive memory framework\.arXiv preprint arXiv:2508\.16629\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p1.1)\.
- \[24\]L\. Zhu, X\. Huang, and J\. Sang\(2025\)A llm\-based controllable, scalable, human\-involved user simulator framework for conversational recommender systems\.InProceedings of the ACM on Web Conference 2025,pp\. 4653–4661\.Cited by:[§1](https://arxiv.org/html/2606.00756#S1.p1.1)\.

## Appendix ARelated Work

##### LLM Agents in Cloud\-Edge Systems\.

Cloud\-based large language models \(LLMs\) can leverage abundant computational and storage resources to support large\-scale models and sophisticated reasoning capabilities\[[18](https://arxiv.org/html/2606.00756#bib.bib28)\]\. However, achieving such sophisticated reasoning capabilities demands prohibitive computational and memory resources, rendering the direct deployment of these large\-parameter models on edge servers practically infeasible\[[5](https://arxiv.org/html/2606.00756#bib.bib6)\]\. While edge\-deployed lightweight LLMs are undergoing rapid development, the implementation of LLM\-based agents at the network edge remains heavily constrained by stringent resource limitations\[[1](https://arxiv.org/html/2606.00756#bib.bib2)\]\. Specifically, the bounded computational, memory, and storage resources on edge servers significantly restrict the model scale of edge agents, thereby limiting their reasoning depth when tackling complex, long\-horizon tasks\.

Mobile Edge Intelligence \(MEI\) sits between on\-device AI and cloud\-based AI, featuring a modest scale of computing resources located close to users, which is more capable than edge devices yet less powerful than cloud centers\[[8](https://arxiv.org/html/2606.00756#bib.bib14)\]\. Edge servers provide an optimal platform to host models capable of reasoning, ensuring both the responsiveness and the necessary cognitive capacity required for agentic workflows\.

Due to prohibitive resource footprints, current industrial solutions primarily focus on sub\-10B parameter models\[[9](https://arxiv.org/html/2606.00756#bib.bib13)\]\. For instance, Google’s Gemini Nano \(ranging from 1\.8B to 3\.25B parameters\) utilizes 4\-bit quantization but is confined to basic features such as text summarization and smart replies\[[13](https://arxiv.org/html/2606.00756#bib.bib24)\]\. However, as task complexity escalates, there is an inevitable demand for edge\-deployed LLM agents to possess advanced capabilities for planning, learning, and reflection in long\-horizon scenarios\. This necessity has become a fundamental driver for the development of next\-generation edge intelligence\.

##### LLM Agent Memory\.

Memory constitutes the definitive hallmark of an LLM\-based Agent, distinguishing it from traditional large language models\[[22](https://arxiv.org/html/2606.00756#bib.bib32)\]\. It plays a critical role in how an agent accumulates knowledge, processes historical experiences, and retrieves pertinent information to support complex decision\-making\.\[[14](https://arxiv.org/html/2606.00756#bib.bib26)\]From the perspective of cognitive science, working memory enables an individual to maintain and process information in real time, providing the essential foundation for sophisticated cognitive tasks such as reasoning, comprehension, and learning\[[16](https://arxiv.org/html/2606.00756#bib.bib29)\]\. The memory of an LLM\-based Agent can be categorized into two dimensions\[[12](https://arxiv.org/html/2606.00756#bib.bib23)\]: in a narrow sense, it refers to the core historical information necessary to complete a current task—primarily characterized by the sequence of actions and observations within a single trial; in a broader sense, it encompasses the comprehensive collection of all relevant knowledge, including successful and failed experiences across multiple trials as well as external auxiliary information\. Collectively, these dimensions empower the agent to accumulate knowledge, optimize decision\-making strategies, and prevent repetitive errors\.

However, on edge\-deployed LLM agents, limited computational resources hinder efficient reasoning and reflection, while constrained storage resources restrict the retention of long\-term, multi\-trial, and detailed memories\. As a result, edge LLM agents cannot accumulate experience and learn in the same way as typical LLM agents\. These limitations highlight the urgent need for a cloud\-edge collaborative LLM agent framework\.

## Appendix BMore Details on Background

Figure[4](https://arxiv.org/html/2606.00756#A2.F4)illustrates the detailed cloud\-edge\-end system architecture\.

![Refer to caption](https://arxiv.org/html/2606.00756v1/x4.png)Figure 4:Cloud\-edge\-end system\.Deploying lightweight LLM agents at the edge servers enables localized services for end users\. However, constrained by scale\-out deployment costs, widely distributed edge nodes cannot host massive models with parameter scales comparable to cloud servers, which strictly limits the reasoning capabilities, memory capacities, and computational resources of edge agents\. When facing the demands of complex long\-horizon tasks, these resource\-bounded and isolated edge agents are unable to extract sufficient prior experiences solely from their limited local memories, highlighting the critical gap between localized resource constraints and complex task requirements\.
## Appendix CRuntime Environments

Table[3](https://arxiv.org/html/2606.00756#A3.T3)summarizes the runtime environments used for different experimental components\. The edge\-side experiments and baselines are implemented in Python 3\.8 with PyTorch 2\.0\.0\. We keep all LLM parameters fixed throughout the experiments and do not perform parameter fine\-tuning\.

Table 3:Runtime environments for different experimental components\.
## Appendix DDetailed Evaluation Metrics

The evaluation metrics used in our study are categorized into edge\-side metrics, cloud\-side metrics, and cloud\-edge collaboration metrics, aligning with our system analysis structure\.

Edge\-side metricsinclude:\(i\)Progress Rate \(PR\), evaluating the degree of task completion;\(ii\)Success Rate \(SR\), measuring the percentage of successfully completed tasks;\(iii\)Grounding Accuracy \(GA\), quantifying the proportion of valid executed actions in the environment;\(iv\)Steps, counting the average execution steps required to finish or terminate a task; and\(v\)Context, measuring the scale of prompt tokens consumed during execution\.

Cloud\-side Metricsevaluate the cloud’s response capability and delivery effectiveness\. These includeCache Hit Rate \(Hit\),Confidence Score \(Conf\.\), cloud response latency \(Lat\.\), generatedInsights Count \(Ins\.\), and feedbackACK Rate \(ACK\)\.

Cloud\-Edge Collaboration Metricsevaluate the internal processing and resource consumption of the cloud memory driven by edge\-cloud interaction, comprising two components:\(i\)Cloud Critic Metricsevaluate the experience critic process\. This category encompassescloud pipeline metrics—tracking the edge agent’s experienceUploads, evaluatedResponses, and admitted rules \(KB Admit\)—andcloud guidance metrics, which trace the guidance lifecycle through generatedInsights, acknowledged guidance \(ACK\), and successfully executed actions \(Adopt\); and\(ii\)Cloud Overhead Metricsquantify the total cost of resource consumption \(Cloud Total\), which comprises the distillation pipeline cost \(Pipeline Total\) and the guidance retrieval cost \(Guidance Total\)\.

## Appendix EDetailed Results for Scenario B

Table[4](https://arxiv.org/html/2606.00756#A5.T4)reports the detailed experimental results under the Heterogeneous Mixed Edge setting \(Scenario B\)\.

Table 4:Edge\-side analysis for Scenario B, comparing the weak edge in Scenario A and Scenario B to measure the added effect of stronger\-peer trajectories\.
## Appendix FAdditional Edge\-side Analysis

![Refer to caption](https://arxiv.org/html/2606.00756v1/x5.png)Figure 5:Progress Rate vs\. Execution Steps in Scenario A\.Compared to the standard baseline, the edge agent under theCoMICframework achieves significantly higher progress rates within the same or fewer execution steps across multiple environments, demonstrating enhanced action efficiency\.
![Refer to caption](https://arxiv.org/html/2606.00756v1/x6.png)Figure 6:Grounding Accuracy\.Assisted by cloud guidance, the local model avoids invalid actions\.

As shown in Figure[6](https://arxiv.org/html/2606.00756#A6.F6), the edge agent under theCoMICframework achieves significantly higher progress rates within the same or fewer execution steps across multiple environments, demonstrating enhanced action efficiency\. Notably, it achieves breakthroughs from zero in Blocksworld, Tyreworld, and Barman, obtaining substantial improvements\. This indicates that edge agents relying solely on their local capabilities struggle to satisfy the requirements of such complex long\-horizon tasks\. However, under the cloud\-edge collaborative system, the cloud critic extracts reusable experience from weak\-edge trajectories\. This demonstrates that even when the agent’s capability is weak, valuable empirical knowledge can still be mined and returned as advisory guidance\. The results in Figure[6](https://arxiv.org/html/2606.00756#A6.F6)also illustrate that the selected guidance helps the edge agent accomplish subgoal episodes, with greener regions in the heatmap denoting higher grounding accuracy\.

![Refer to caption](https://arxiv.org/html/2606.00756v1/x7.png)Figure 7:Progress Rate vs\. Execution Steps in Scenario B\.Compared to Scenario A, the weak edge in Scenario B shows selective progress gains while maintaining comparable execution length across environments\.
![Refer to caption](https://arxiv.org/html/2606.00756v1/x8.png)Figure 8:Grounding Accuracy\.Guidance from stronger edge agents yields task\-dependent gains\.

A similar phenomenon is observed in Figure[8](https://arxiv.org/html/2606.00756#A6.F8)and Figure[8](https://arxiv.org/html/2606.00756#A6.F8)\. As discussed in §[4\.3\.3](https://arxiv.org/html/2606.00756#S4.SS3.SSS3)and §[5](https://arxiv.org/html/2606.00756#S5), the weak edge agent in Scenario B is constrained by its base model’s limitations, preventing it from matching the planning capabilities of strong edge agents or fully comprehending high\-quality guidance\. Despite these bottlenecks, its overall performance remains greater than or equal to the baseline across all datasets\. Moreover, Figure[8](https://arxiv.org/html/2606.00756#A6.F8)reveals that within reasonably planned subgoal episodes, the weak edge agent in Scenario B receives more effective guidance from the heterogeneous cloud\. Consequently, its overall grounding accuracy \(executability\) is generally higher than that observed in Scenario A\.

## Appendix GAdditional Cloud\-side Analysis

This section provides detailed coordination statistics, experience funnel metrics, and total system overhead analysis for the cloud layer\.

Table 5:Cloud\-side coordination statistics for Scenario A and Scenario B\.According to the detailed numerical comparisons in Table[5](https://arxiv.org/html/2606.00756#A7.T5), Scenario A outperforms Scenario B across key metrics\. This is not because the cloud generates guidance of intrinsically higher value in Scenario A\. Rather, because the edge agents in this scenario utilize the same base model, their generated trajectories are similar at the abstract level\. Consequently, the global guidance selected by the cloud critic is more applicable and easier for these base models to understand and execute\.

Conversely, in the mixed\-strength setting \(Scenario B\), the guidance selected by the cloud critic incorporates more abstract patterns from trajectories of the strong agents\. Because weak edge agents struggle to perform long\-term planning for sub\-tasks like their stronger counterparts, they are less able to use this abstract, long\-term guidance from the cloud\. This weaker alignment with the selected guidance contributes to the decline in theHitandACKrates\. However, as noted in the Edge\-side Analysis \(§[4\.3\.1](https://arxiv.org/html/2606.00756#S4.SS3.SSS1)\), when the weak agents in Scenario B do successfully select the correct sub\-tasks, their episode completion metrics improve significantly\. This explains why the weak LLM agents in Scenario B ultimately achieve better edge\-side performance than those in Scenario A, despite the apparent drop in cloud\-side coordination efficiency\.

## Appendix HRuntime Guidance Update Procedure

Algorithm[1](https://arxiv.org/html/2606.00756#alg1)summarizes the implementation\-level lifecycle that connects cloud\-side trajectory evaluation with edge\-side runtime guidance\. It is not a synchronous control loop executed at every decision step: trajectory admission runs after completed subgoal episodes are uploaded, global guidance is synthesized in the cloud background, and runtime retrieval is invoked by the edge only when the current subgoal remains active and recent execution indicates that guidance is needed\. The latest observation remains the authoritative environment state during action selection\.

Algorithm 1Asynchronous lifecycle of trajectory admission and runtime guidance0:Completed subgoal episode

τ^ti\\hat\{\\tau\}\_\{t\}^\{i\}, metadata

m~ti\\tilde\{m\}\_\{t\}^\{i\}, current identifier

ut\+1u\_\{t\+1\}, observation

ot\+1io\_\{t\+1\}^\{i\}, history indices

ℐt\+1\\mathcal\{I\}\_\{t\+1\}, and guidance store

ℳ=ℳmanual∪ℳglobal\\mathcal\{M\}=\\mathcal\{M\}\_\{\\mathrm\{manual\}\}\\cup\\mathcal\{M\}\_\{\\mathrm\{global\}\}\.

0:Updated guidance store

ℳ\\mathcal\{M\}and next prompt

Pt\+1iP\_\{t\+1\}^\{i\}with at most oneGlobal Guidanceblock\.

1:Cloud admission path

2:Serialize:

xti←Serialize​\(τ^ti,m~ti\)x\_\{t\}^\{i\}\\leftarrow\\mathrm\{Serialize\}\(\\hat\{\\tau\}\_\{t\}^\{i\},\\tilde\{m\}\_\{t\}^\{i\}\)\.

3:if

xti∈cachex\_\{t\}^\{i\}\\in\\mathrm\{cache\}then

4:Reusethe cached trajectory evaluation record\.

5:else

6:Upload

xtix\_\{t\}^\{i\}to

ℳSTM\\mathcal\{M\}\_\{\\mathrm\{STM\}\}\.

7:Evaluate

xtix\_\{t\}^\{i\}with the Cloud Critic and compute

sadm​\(xti\)s\_\{\\mathrm\{adm\}\}\(x\_\{t\}^\{i\}\)\.

8:if

sadm​\(xti\)≥γkbs\_\{\\mathrm\{adm\}\}\(x\_\{t\}^\{i\}\)\\geq\\gamma\_\{\\mathrm\{kb\}\}and no hard blockthen

9:Admitreusable experience into

ℳexp\+\\mathcal\{M\}\_\{\\mathrm\{exp\}\}^\{\+\}\.⊳\\trianglerightfollowing Eq\. \([7](https://arxiv.org/html/2606.00756#S3.E7)\)

10:endif

11:endif

12:Cloud aggregation path

13:foreach eligible exact or coarse group

uudo

14:Retrieve

𝒢​\(u\)←\{m∈ℳexp\+∣κ​\(m\)=u\}\\mathcal\{G\}\(u\)\\leftarrow\\\{m\\in\\mathcal\{M\}\_\{\\mathrm\{exp\}\}^\{\+\}\\mid\\kappa\(m\)=u\\\}\.

15:Update

ℳglobal\\mathcal\{M\}\_\{\\mathrm\{global\}\}with

G​\(u\)←fagg​\(𝒢​\(u\)\)G\(u\)\\leftarrow f\_\{\\mathrm\{agg\}\}\(\\mathcal\{G\}\(u\)\)\.⊳\\trianglerightfollowing Eq\. \([10](https://arxiv.org/html/2606.00756#S3.E10)\)

16:endfor

17:Edge runtime path

18:ifruntime guidance is triggeredthen

19:Select

Gsel​\(ut\+1\)G\_\{\\mathrm\{sel\}\}\(u\_\{t\+1\}\)from

ℳ\\mathcal\{M\}using the SOP Selector\.

20:Groundany recommended step against

ot\+1io\_\{t\+1\}^\{i\}and valid actions when provided\.

21:else

22:Set

Gsel​\(ut\+1\)←∅G\_\{\\mathrm\{sel\}\}\(u\_\{t\+1\}\)\\leftarrow\\varnothing\.

23:endif

24:Assemble

Pt\+1i←𝒜​\(g,gt\+1i,ot\+1i,Ht\+1i​\(ℐt\+1\),Gsel​\(ut\+1\)\)P\_\{t\+1\}^\{i\}\\leftarrow\\mathcal\{A\}\(g,g\_\{t\+1\}^\{i\},o\_\{t\+1\}^\{i\},H\_\{t\+1\}^\{i\}\(\\mathcal\{I\}\_\{t\+1\}\),G\_\{\\mathrm\{sel\}\}\(u\_\{t\+1\}\)\)\.

25:Executewith

ot\+1io\_\{t\+1\}^\{i\}as the authoritative environment state\.

26:return

ℳ\\mathcal\{M\}and

Pt\+1iP\_\{t\+1\}^\{i\}\.

## Appendix IAdditional Cloud\-Edge Collaboration Analysis

Figure[10](https://arxiv.org/html/2606.00756#A9.F10)and Figure[10](https://arxiv.org/html/2606.00756#A9.F10)present the detailed cloud critic metrics, tracking the lifecycle of experience upload, evaluation, and guidance generation for Scenario A and Scenario B, respectively\. Furthermore, Figure[11](https://arxiv.org/html/2606.00756#A9.F11)and Figure[12](https://arxiv.org/html/2606.00756#A9.F12)detail the corresponding resource consumption overheads of the cloud\-side memory mechanism across both scenarios\.

![Refer to caption](https://arxiv.org/html/2606.00756v1/x9.png)Figure 9:Cloud\-Edge Collaboration Metrics in Scenario A\.Cloud pipeline metrics include Upload, Response, and KB Admit; cloud guidance metrics include Insight, ACK, and Adopt across datasets\.
![Refer to caption](https://arxiv.org/html/2606.00756v1/x10.png)Figure 10:Cloud\-Edge Collaboration Metrics in Scenario B\.Cloud pipeline metrics include Upload, Response, and KB Admit; cloud guidance metrics include Insight, ACK, and Adopt across datasets\.

![Refer to caption](https://arxiv.org/html/2606.00756v1/x11.png)Figure 11:Cloud Overhead Metrics in Scenario A\.Resource consumption of the cloud\-side memory mechanism, consisting of \(a\) Cloud Pipeline Metrics \(Pipeline Total\) and \(b\) Cloud Guidance Metrics \(Guidance Total\)\. \(c\) Cloud\-side Total Overhead \(Cloud Total\) presents the sum and the respective proportions of \(a\) and \(b\)\.![Refer to caption](https://arxiv.org/html/2606.00756v1/x12.png)Figure 12:Cloud Overhead Metrics in Scenario B\.Resource consumption of the cloud\-side memory mechanism, consisting of \(a\) Cloud Pipeline Metrics \(Pipeline Total\) and \(b\) Cloud Guidance Metrics \(Guidance Total\)\. \(c\) Cloud\-side Total Overhead \(Cloud Total\) presents the sum and the respective proportions of \(a\) and \(b\)\.
## Appendix JDetailed Results for Ablation

Table[6](https://arxiv.org/html/2606.00756#A10.T6)reports the task\-level ablation results\. Deltas are computed relative tow/o Cloud, and Context is normalized by thew/o Cloudbaseline for each task\.

Table 6:Task\-level ablation comparison under the same weakMistral\-7Bexecutor\.w/o Cloudcorresponds to Local,w/ Cloudto Scenario A, andw/ Hetero\. Cloudto Scenario B\.

Similar Articles

Learning Agent-Compatible Context Management for Long-Horizon Tasks

arXiv cs.AI

Introduces AdaCoM, an external LLM-based context manager for frozen agents, using reinforcement learning to improve long-horizon task performance by preserving task constraints and pruning stale content, with experiments on web search and deep research benchmarks.