Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

arXiv cs.CL 06/05/26, 04:00 AM Papers
latent-communication multi-agent llm embeddings hidden-states kv-cache framework
Summary
This paper presents a unified framework for latent communication in LLM-based multi-agent systems, categorizing methods by what information is communicated, sender-receiver alignment, and fusion technique, and reviews eighteen representative methods from 2024-2026.
arXiv:2606.05711v1 Announce Type: new Abstract: Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.
Original Article
View Cached Full Text
Cached at: 06/05/26, 08:07 AM
# A Unified Framework for Latent Communication in LLM-based Multi-Agent Systems
Source: [https://arxiv.org/html/2606.05711](https://arxiv.org/html/2606.05711)
###### Abstract

Multi\-agent systems built on large language models \(LLMs\) have become a prevailing paradigm for tackling complex reasoning, planning, and tool\-use tasks\. The dominant communication protocol in such systems is*natural language*: agents exchange messages token\-by\-token, verbalising their internal reasoning so that peers can read, verify, and respond\. While convenient and interpretable, this protocol suffers from three structural drawbacks — high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language\. A growing body of work therefore explores an alternative protocol —*latent communication*— in which agents exchange continuous representations \(embeddings, hidden states, or KV\-caches\) directly, bypassing the bottleneck of text generation\. This paper presents a*unified framework*for organising the rapidly expanding literature on latent communication\. We analyse existing methods along three orthogonal axes:\(1\) WHATinformation is communicated \(*Embeddings*,*Hidden States*,*KV\-Caches*, or other continuous state\);\(2\) WHICHsender–receiver alignment is used \(*latent\-space alignment*and*layer alignment*\); and\(3\) HOWthe communicated information is fused into the receiver \(*concatenation*,*prepending*,*mathematical operations*,*cross\-attention*, or*cache restoration*\)\. Under this 3\-axis framework, we systematically categoriseeighteenrepresentative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges — including cross\-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between*latent communication*and*latent chain\-of\-thought*\. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work\.

*Keywords*Latent Communication⋅\\cdotMulti\-Agent LLMs⋅\\cdotKV\-Cache⋅\\cdotHidden States⋅\\cdotEmbeddings⋅\\cdotAgent Communication⋅\\cdotSurvey

## 1\. Introduction

Multi\-agent systems built on top of large language models \(LLMs\) have rapidly become a workhorse for complex reasoning, planning, code generation, scientific question answering, and tool orchestration\(Wuet al\.,[2023](https://arxiv.org/html/2606.05711#bib.bib1); Honget al\.,[2023](https://arxiv.org/html/2606.05711#bib.bib2); Liet al\.,[2023](https://arxiv.org/html/2606.05711#bib.bib3); Liuet al\.,[2026b](https://arxiv.org/html/2606.05711#bib.bib30)\)\. In the canonical architecture, several specialised LLM agents — each typically instantiated as a separate model call with its own role prompt — collaborate by exchanging*natural language*messages\. A planner proposes a strategy in text; a critic reads the proposal and replies in text; a coder edits the plan in text; and so on\. The result is a visible, inspectable, human\-readable communication trace that doubles as an audit log and a debugging surface\. The way such a system partitions a complex task across agents —*which*subtask to assign to*which*agent — is itself a non\-trivial design choice, and recent work has begun to study adaptive task\-decomposition strategies empirically\(Liuet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib29)\)\.

Despite its success, the*text\-only*communication protocol is being increasingly questioned\. Three structural limitations stand out:

1. 1\.Inference cost\.Every message forces the sender to*decode*its internal reasoning into a token sequence, and forces the receiver to*re\-encode*that sequence back into a representation\. For anLL\-layer model with vocabulary sizeVVand a message ofTTtokens, the per\-message overhead is𝒪\(L⋅T⋅d\)\\mathcal\{O\}\(L\\cdot T\\cdot d\)extra FLOPs on top of the agent’s own reasoning\.
2. 2\.Information loss during discretization\.The sender’s hidden state — a high\-dimensional vector that summarises its entire context — must be*compressed*into a single token drawn from a vocabulary of sizeVV\. The mutual information between the hidden state and the chosen token is bounded bylog2⁡V\\log\_\{2\}Vbits, typically≤15\\leq 15bits in modern tokenisers, whereas the hidden state itself carries tens of thousands of bits\. Alternative reasoning paths, calibrated confidences over alternatives, and fine\-grained semantic distinctions are simply discarded\.
3. 3\.Redundancy and ambiguity of natural language\.Generated text is optimised for linguistic fluency rather than task\-relevant information density\. Idioms, hedging, and vague referents add overhead; disagreements about role assignment or background knowledge can render entire messages irrecoverable\.

In response, a new line of work — collectively called*latent communication*— has emerged\. The core idea is to let agents exchange theircontinuous internal representationsdirectly: embeddings at the input layer, hidden states from intermediate layers, or key–value \(KV\) caches from the attention mechanism\. By skipping the language bottleneck, latent communication can preserve more information, save inference time, and avoid the failure modes of natural language\. The downside is interpretability: the channel is opaque to humans and harder to inspect, debug, or align\.

The field has grown explosively\. The accompanying repository*Awesome\-Latent\-Communication*already tracks more than fifteen distinct methods, and the diversity of design choices is striking: some methods transmit embeddings, others transmit hidden states, still others transmit KV\-caches\. Some methods align the last layer of the sender to the first layer of the receiver; others align all layers\. Some fuse information by concatenation; others by prepending, addition, or learned cross\-attention\. Some are training\-free; others require distillation\. A new researcher entering the area is therefore confronted by a fragmented landscape with no shared vocabulary\.

##### Contributions\.

This paper introduces a*unified framework*that organises the literature along three orthogonal axes and uses it to systematically categorise eighteen representative works\. Specifically:

- •We propose a 3\-axis decomposition —WHAT\(types of communicated information\),WHICH\(sender–receiver alignment\), andHOW\(information fusion strategy\) — that uniquely determines the design space of any latent communication protocol\.
- •Under this framework, we analyse eighteen methods published between 2024 and 2026, summarise their key innovations, strengths, and limitations, and slot each into a unified comparison table\.
- •We extract five generalisable*takeaways*about the design trade\-offs \(e\.g\., “KV\-cache carries more information than hidden states but is more architecture\-dependent”\) that we believe will inform future method design\.
- •We identify six open problems — including cross\-architecture alignment, security of latent channels, and the unification of*latent communication*with*latent chain\-of\-thought*— that we expect to shape the next generation of research\.

##### Organisation\.

The remainder of the paper is organised as follows\.[Section˜2](https://arxiv.org/html/2606.05711#S2)introduces preliminary concepts\.[Section˜3](https://arxiv.org/html/2606.05711#S3)makes the*case for latent communication*by quantifying the limitations of natural language\.[Section˜4](https://arxiv.org/html/2606.05711#S4)presents the unified framework along the WHAT / WHICH / HOW axes\.[Section˜5](https://arxiv.org/html/2606.05711#S5)walks through the eighteen representative methods under the framework\.[Section˜6](https://arxiv.org/html/2606.05711#S6)discusses the dominant*training\-free*implementation paradigm\.[Section˜7](https://arxiv.org/html/2606.05711#S7)surveys empirical results\.[Section˜8](https://arxiv.org/html/2606.05711#S8)lays out open problems\.[Section˜9](https://arxiv.org/html/2606.05711#S9)relates latent communication to adjacent research areas\.[Section˜10](https://arxiv.org/html/2606.05711#S10)concludes\.

## 2\. Background and Preliminaries

This section fixes the notation and terminology used throughout the paper\.

### 2\.1 Multi\-Agent LLM Systems

A*multi\-agent LLM system*\(MAS\) consists ofNNLLM agents𝒜=\{A1,A2,…,AN\}\\mathcal\{A\}=\\\{A\_\{1\},A\_\{2\},\\ldots,A\_\{N\}\\\}, each equipped with a role\-specific system prompt, optional tool access, and a communication channel\. At each step, an agentAiA\_\{i\}\(the*sender*\) produces a message that is delivered to one or more peer agents \(the*receivers*\)\. A controller — explicit or implicit — decides the order of speakers\. The communication channel is the focus of this paper: classical systems use a*natural language channel*\([Section˜2\.2](https://arxiv.org/html/2606.05711#S2.SS2)\); the methods surveyed in this paper use a*latent channel*\([Section˜2\.2](https://arxiv.org/html/2606.05711#S2.SS2)\)\.

### 2\.2 Natural Language vs\. Latent Communication

- •Natural Language Communication \(NL\-Comm\)\.The sender generates a discrete token sequencey=\(y1,y2,…,yT\)y=\(y\_\{1\},y\_\{2\},\\ldots,y\_\{T\}\)by sampling from a vocabulary𝒱\\mathcal\{V\}\. The receiver*re\-encodes*the token sequence into its own embedding space\. The two\-step pipeline —*sender decode→\\rightarrowtoken transport→\\rightarrowreceiver encode*— is what we refer to as the*language bottleneck*\.
- •Latent Communication \(Latent\-Comm\)\.The sender exposes one of its internal continuous representations — the input embedding, the hidden state of a particular layer/token, or the KV\-cache — and the receiver injects this representation into its own computation*without*round\-tripping through the vocabulary\.

![Refer to caption](https://arxiv.org/html/2606.05711v1/figs/preliminary.png)Figure 1:Comparison of natural\-language and latent communication pipelines, including \(*left*\) a Transformer block with its accessible intermediate representations, \(*top\-right*\) a comparison of token\-level vs\. hidden\-state reasoning information density, and \(*bottom\-right*\) the prefill/decode phases that produce per\-token KV\-caches\.A high\-level comparison of the two pipelines is shown in[Figure˜1](https://arxiv.org/html/2606.05711#S2.F1)\.

### 2\.3 Prefill and Decode

LLM inference is split into two phases that we will repeatedly refer to:

- •Prefill phase\.Given a promptx=\(x1,…,xT\)x=\(x\_\{1\},\\ldots,x\_\{T\}\), the model processes the entire sequence in parallel and produces the first output token\. All key–value pairs computed during prefill are stored in the KV\-cache\.
- •Decode phase\.The model generates one token at a time\. At each stept\>Tt\>T, it takes the previously generated tokenyt−1y\_\{t\-1\}and the cached KV from earlier steps, and produces a new tokenyty\_\{t\}\(and a new KV entry\)\.

The distinction matters for latent communication because the*kind*of internal state available differs between the two phases\. During prefill, the sender has access to per\-token hidden states and KV\-caches for*every*input token\. During decode, the sender has only the hidden state of the most recently generated token plus an ever\-growing KV\-cache\.

### 2\.4 Embedding, Hidden State, KV\-Cache, Activation

We adopt the following precise definitions, which the rest of the paper relies on:

Embedding\.A continuous vector𝐞i∈ℝd\\mathbf\{e\}\_\{i\}\\in\\mathbb\{R\}^\{d\}that maps a discrete input symbolxix\_\{i\}to a dense semantic space\. Embeddings are the*input*to the first Transformer block\.

Hidden state\.The output of a complete Transformer block, denoted𝐡i\(ℓ\)∈ℝd\\mathbf\{h\}\_\{i\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}for tokeniiat layerℓ\\ell\. Hidden states are the*stable, layer\-wise semantic representations*passed between adjacent Transformer blocks\. When the receiver consumes a hidden state, it typically receives one of the intermediate\-layer outputs\.

KV\-Cache\.The collection of per\-token key and value tensors computed in each self\-attention layer during prefill, denoted𝒦𝒱=\{\(𝐤i\(ℓ\),𝐯i\(ℓ\)\)i=1T\}ℓ=1L\\mathcal\{KV\}=\\\{\(\\mathbf\{k\}\_\{i\}^\{\(\\ell\)\},\\mathbf\{v\}\_\{i\}^\{\(\\ell\)\}\)\_\{i=1\}^\{T\}\\\}\_\{\\ell=1\}^\{L\}\. The KV\-cache is what the model reuses to make decode efficient\.

Activation\.A more general term: any intermediate output of a sub\-module \(attention projection, feed\-forward transformation, etc\.\)\.*Hidden states are a subset of activations that serve as stable layer\-wise representations*\. Methods that transmit “activations” often transmit a more granular quantity \(e\.g\., a single attention head’s output\) than methods that transmit “hidden states\.”

A schematic of these quantities in a Transformer block is included in the left panel of[Figure˜1](https://arxiv.org/html/2606.05711#S2.F1)\.

### 2\.5 Why Now?

Latent communication has become practical only recently\. Three enabling trends converged around 2023–2024:

1. 1\.Open\-weight LLMs at scale\.Methods that pipe a sender’s hidden state into a receiver’s forward pass require*white\-box*access to both models\. The release of Llama, Qwen, Mistral, and similar families has made such access routine\.
2. 2\.KV\-cache engineering\.The KV\-cache has gone from an implementation detail to a first\-class optimisation target, with rich infrastructure for compression, sharing, and off\-loading\. Methods that transmit KV\-caches piggy\-back on this infrastructure\.
3. 3\.Multi\-agent frameworks\.Frameworks like LangGraph, AutoGen, CrewAI, and MetaGPT have lowered the cost of orchestrating multiple LLM agents, making the*latent channel*itself a meaningful object of study rather than a curiosity\.

## 3\. The Case for Latent Communication

Before diving into the framework, we articulate the case*for*and*against*latent communication\. We argue that the trade\-off is context\-dependent: latent communication is preferable when \(a\) the agents are tightly coupled, \(b\) the cost of natural language overhead dominates, and \(c\) the channel can be made interpretable enough for downstream debugging\.

### 3\.1 Limitations of Natural Language Communication

#### 3\.1\.1 High Inference Cost

Consider a two\-agent system where agentA1A\_\{1\}produces aTT\-token message to agentA2A\_\{2\}\. The total cost is:

- •A1A\_\{1\}’s decode ofTTtokens:𝒪\(L⋅T⋅d\)\\mathcal\{O\}\(L\\cdot T\\cdot d\)FLOPs, whereLLis the number of layers andddis the hidden dimension\. The KV\-cache read/write is the dominant term\.
- •A2A\_\{2\}’s re\-encoding ofTTtokens: the same𝒪\(L⋅T⋅d\)\\mathcal\{O\}\(L\\cdot T\\cdot d\)FLOPs in prefill\.
- •The token\-by\-token transport itself: negligible\.

So the*overhead*of natural language communication is roughly2×2\\timesthe cost of generating the message, even before accounting forA2A\_\{2\}’s own reasoning\. Latent communication can reduce this to a single embedding/hidden\-state/KV\-cache transport that the receiver injects*without re\-encoding*\.

#### 3\.1\.2 Information Loss During Discretization

The pipeline is

𝐡context→linear𝐳∈ℝV→sampley∈𝒱,\\mathbf\{h\}\_\{\\text\{context\}\}\\xrightarrow\{\\text\{linear\}\}\\mathbf\{z\}\\in\\mathbb\{R\}^\{V\}\\xrightarrow\{\\text\{sample\}\}y\\in\\mathcal\{V\},\(1\)where𝐡context\\mathbf\{h\}\_\{\\text\{context\}\}is the sender’s high\-dimensional hidden state,𝐳\\mathbf\{z\}is the logit vector over the vocabulary, andyyis the sampled token\. The mutual informationI\(𝐡context;y\)I\(\\mathbf\{h\}\_\{\\text\{context\}\};y\)is upper\-bounded byH\(y\)≤log2⁡\|𝒱\|≈15–17H\(y\)\\leq\\log\_\{2\}\|\\mathcal\{V\}\|\\approx 15\\text\{\-\-\}17bits\. Meanwhile,𝐡context\\mathbf\{h\}\_\{\\text\{context\}\}itself typically lives inℝd\\mathbb\{R\}^\{d\}withd≥4096d\\geq 4096and is parameterised by 32\-bit floats, so its*raw*representational capacity exceeds40,00040\{,\}000bits\. The compression factor is therefore on the order of10310^\{3\}–10410^\{4\}\.

Concretely: a hidden state encodes not just*which*token to say next, but also the*alternatives*considered, their*relative probabilities*, the*salience*of different parts of the context, and*uncertainty*\. All of this is lost the moment we sample a single token\. A visual comparison of these information densities \([Figure˜2](https://arxiv.org/html/2606.05711#S3.F2)\(a\)\) and the resulting communication pipelines \([Figure˜2](https://arxiv.org/html/2606.05711#S3.F2)\(b\)\) is given in[Figure˜2](https://arxiv.org/html/2606.05711#S3.F2)\.

![Refer to caption](https://arxiv.org/html/2606.05711v1/figs/F-InfoDensity.png)\(a\)Information density:≈15\\approx 15bits per token\.
![Refer to caption](https://arxiv.org/html/2606.05711v1/figs/F-Pipeline-Compare.png)\(b\)Pipeline comparison: NL\-Comm vs\. Latent\-Comm\.

Figure 2:Why latent communication wins on information density\.*Left \(a\):*Bar chart comparing the information content of a discrete token \(≈15\\approx 15bits\) with that of a single hidden state of the last token \(≈40,000\\approx 40\{,\}000bits\)\. The gap of three to four orders of magnitude motivates the move to latent communication\.*Right \(b\):*Pipeline comparison\. NL\-Comm routes a sender’s hidden state through a vocabulary bottleneck; Latent\-Comm exchanges a continuous vector directly, preserving orders of magnitude more information per communication step\.
#### 3\.1\.3 Redundancy and Ambiguity of Natural Language

Generated text is optimised for linguistic coherence \(a stylistic objective from pre\-training\) rather than for*task\-relevant information density*\. Sentences are padded with politeness markers, hedging, and reformulation\. References to prior context are often under\-specified \(“the previous step”, “that approach”\), forcing the receiver to reconstruct the referent\.

When sender and receiver disagree on background knowledge, role assignment, or terminology, the natural language channel can become lossy in a*semantic*sense that goes beyond the numerical bits/token argument\. In contrast, latent channels operate on the agents’ own representational manifolds and avoid this kind of semantic mismatch — at the cost of interpretability\.

### 3\.2 Advantages of Natural Language Communication

Latent communication is not a universal replacement\. Natural language retains one decisive advantage:

- •High interpretability\.A natural language message is immediately readable by humans\. This is essential for*debugging*,*alignment auditing*,*safety review*, and*human–AI interaction*\. Latent messages, in contrast, are opaque vectors that require auxiliary tooling to interpret\.

In practice, the field has converged on a hybrid view: natural language for tasks where human oversight is needed \(e\.g\., final answers, justifications\) and latent communication for intermediate, agent\-to\-agent signalling\. A schematic of this hybrid view is shown in[Figure˜2](https://arxiv.org/html/2606.05711#S3.F2)\(b\)\.

### 3\.3 When to Prefer Latent Communication

Synthesising the above, latent communication tends to win when*all*of the following hold:

1. 1\.The two agents are*tightly coupled*\(e\.g\., a planner feeding directly into an executor\)\.
2. 2\.The communication is*intermediate*— the user does not need to see the message\.
3. 3\.The sender and receiver share \(or can be aligned to\) a*common latent space*\(e\.g\., same backbone, or compatible architectures\)\.
4. 4\.*Latency*is a binding constraint \(e\.g\., real\-time pipelines, edge deployment, or large agent counts\)\.

Conversely, natural language wins when interpretability, cross\-organisation interoperability, or human oversight is required\.

## 4\. A Unified Framework for Latent Communication

We now present the central contribution of this paper: a*unified framework*that organises all existing latent communication methods along three orthogonal axes\. We claim thateverylatent communication method can be uniquely described by a triple:

Method=\(WHAT⏟type of information,WHICH⏟alignment,HOW⏟fusion\)\.\\text\{Method\}=\(\\underbrace\{\\text\{WHAT\}\}\_\{\\text\{type of information\}\},\\ \\underbrace\{\\text\{WHICH\}\}\_\{\\text\{alignment\}\},\\ \\underbrace\{\\text\{HOW\}\}\_\{\\text\{fusion\}\}\)\.\(2\)
The framework is summarised schematically in[Figure˜3](https://arxiv.org/html/2606.05711#S4.F3)\.

![Refer to caption](https://arxiv.org/html/2606.05711v1/figs/F-Framework-Overview.png)Figure 3:The unified 3\-axis framework\. The three axes —WHAT\(types of communicated information\),WHICH\(sender–receiver alignment\), andHOW\(information fusion strategy\) — together span the design space of latent communication methods\.At a glanceWHEREdoes the information come from?WHATis its format?WHICHlayer/head in the receiver does it target?HOWis it combined? These three questions \(WHAT / WHICH / HOW\) uniquely determine any latent communication protocol\.

### 4\.1 Axis 1 — WHAT: Types of Communicated Information

The first axis asks:*what continuous quantity does the sender expose to the receiver?*The dominant choices in the literature areEmbeddings,Hidden States, andKV\-Caches, with several methods exploring*other*quantities \(state deltas, persistent memory, attention\-only signals\)\.

#### 4\.1\.1 Embeddings

The sender transmits its input embedding𝐞i∈ℝd\\mathbf\{e\}\_\{i\}\\in\\mathbb\{R\}^\{d\}for one or more tokens\. Embeddings are the lowest\-level continuous representation; they are model\-agnostic in the sense that*any*model with a compatible embedding dimension can in principle consume them\.CIPHER\(Liuet al\.,[2024](https://arxiv.org/html/2606.05711#bib.bib4)\)is the canonical example: it computes a*weighted*embedding where the weights are derived from the sender’s output logits, so that the embedding encodes the sender’s*full*vocabulary distribution rather than a single sampled token\.

Strengths\.Architecture\-light \(only the embedding table needs to be shared\)\. Simple to implement\. Robust to backbone changes\.

Limitations\.Embeddings are the*least informative*of the three options\. They do not encode the agent’s intermediate computations or its attended context\.

#### 4\.1\.2 Hidden States

The sender transmits the hidden state𝐡i\(ℓ\)\\mathbf\{h\}\_\{i\}^\{\(\\ell\)\}of tokeniiat layerℓ\\ell\. Hidden states are richer than embeddings: they encode the agent’s*intermediate*reasoning, including the effect of attention over its context\.AC\(Yeet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib5)\),Interlat\(Du and others,[2026](https://arxiv.org/html/2606.05711#bib.bib6)\),SDE\(Yanget al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib7)\),ThoughtComm\(Li and others,[2025](https://arxiv.org/html/2606.05711#bib.bib8)\), andMixture of Thoughts\(Fein\-Ashleyet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib9)\)all use hidden states as the communicated quantity\.

Strengths\.Encodes intermediate computation\. Often training\-free\. Easy to align to the receiver’s first layer \([Section˜4\.2](https://arxiv.org/html/2606.05711#S4.SS2)\)\.

Limitations\.Less informative than the full KV\-cache \(it does not include the keys needed to attend back to earlier tokens\)\. Architecture\-dependent: the receiver must share a similar backbone\.

#### 4\.1\.3 KV\-Caches

The sender transmits its per\-token, per\-layer KV\-cache\. The receiver can then*resume*generation as if it had pre\-filled the sender’s context\.KVComm\(Wang and others,[2025b](https://arxiv.org/html/2606.05711#bib.bib10)\),Cache\-to\-Cache\(Liu and others,[2025](https://arxiv.org/html/2606.05711#bib.bib11)\),LatentMAS\(Wang and others,[2025c](https://arxiv.org/html/2606.05711#bib.bib12)\),Q\-KVComm\(Park and others,[2025](https://arxiv.org/html/2606.05711#bib.bib13)\),LRAgent\(Jeonet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib14)\),RelayCaching\(Genget al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib15)\),Agent Memory\(Shkolnikov,[2026](https://arxiv.org/html/2606.05711#bib.bib16)\),Agent Primitives\(Jinet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib17)\), andEdge LLM Handover\(Leeet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib18)\)all use KV\-caches\.

Strengths\.Maximally informative \(it contains the keys, values, and token positions needed for the receiver to attend over the sender’s context\)\. Compatible with the existing KV\-cache compression infrastructure\.

Limitations\.Largest payload \(proportional to sequence length×\\timesnumber of layers×\\timesnumber of heads×\\timeshead dimension\)\. Most architecture\-dependent: a KV\-cache from a 4096\-d Llama cannot be directly consumed by a 5120\-d Qwen\. Requires careful alignment across architectures\.

#### 4\.1\.4 Other Communicated Quantities

A small but growing set of methods transmits*non\-standard*quantities:

- •State delta trajectory\(SDE\(Yanget al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib7)\)\): the*change*in hidden state at each layer, rather than the state itself\. This compresses the information into a direction in latent space and has been shown to be more robust when sender and receiver architectures differ slightly\.
- •Persistent KV\-cache memory\(Agent Memory\(Shkolnikov,[2026](https://arxiv.org/html/2606.05711#bib.bib16)\)\): a disk\-persistent 4\-bit\-quantised KV\-cache, used to offload the cache to edge devices\.
- •Visual\-latent wormhole\(Vision Wormhole\(Liuet al\.,[2026a](https://arxiv.org/html/2606.05711#bib.bib19)\)\): a sender’s hidden state is*rendered*into a VLM’s visual input space, exploiting the VLM’s visual pathway as a universal channel\.
- •Centralised workspace state\(BIGMAS\(Haoet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib20)\)\): a shared workspace in which agents deposit and read structured latent messages, mediated by an orchestrator\.

#### 4\.1\.5 Comparative Summary

[Table˜1](https://arxiv.org/html/2606.05711#S4.T1)summarises which information type each method uses\.

Table 1:Types of communicated information used by representative methods \(✓\\checkmark= yes\)\.Takeaway 1 — Information–Cost–Dependence Trade\-offThere is a clear ordering along three dimensions:information richness\(KV\-Cache\>\>Hidden State\>\>Embedding\);transport cost\(KV\-Cache\>\>Hidden State\>\>Embedding\); andarchitecture dependence\(KV\-Cache\>\>Hidden State\>\>Embedding\)\. A method’s position in this space is largely determined by its chosen WHAT\.

Takeaway 2 — Prefill vs\. Decode ChoiceKV\-Cache methods are concentrated in the*prefill*phase \(because the cache is naturally produced there\), while Embedding/Hidden State methods are concentrated in the*decode*phase \(because the last\-token hidden state is what the model uses to predict the next token\)\. When the receiver only consumes prefill\-phase information, the*sender’s decode phase can be skipped entirely*, yielding a major inference speed\-up — the key insight behind all KV\-cache methods\.

### 4\.2 Axis 2 — WHICH: Sender–Receiver Alignment

The second axis asks:*which parts of the sender correspond to which parts of the receiver?*Alignment has two sub\-dimensions:*latent information alignment*\(does the sender’s semantic space match the receiver’s?\) and*layer alignment*\(which layer of the sender feeds into which layer of the receiver?\)\.

#### 4\.2\.1 Latent Information Alignment

If the sender and receiver are*the same model*\(e\.g\., two instances of Llama\-3\-8B\), their latent spaces are*identical by construction*— no alignment is needed\. If they are*different*but architecturally compatible \(e\.g\., two Llama\-3 fine\-tunes\), the spaces are*close*but not identical; methods such asInterlat\(Du and others,[2026](https://arxiv.org/html/2606.05711#bib.bib6)\)andCache\-to\-Cache\(Liu and others,[2025](https://arxiv.org/html/2606.05711#bib.bib11)\)apply*learned*projection heads to bridge the gap\. If they are*architecturally heterogeneous*\(e\.g\., Llama\-3 and Qwen\-2\), a*Universal Visual Codec*\(Liuet al\.,[2026a](https://arxiv.org/html/2606.05711#bib.bib19)\)or a*learned interaction layer*\(Fein\-Ashleyet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib9)\)is needed\.

[Table˜2](https://arxiv.org/html/2606.05711#S4.T2)indicates which methods perform explicit alignment\.

Table 2:Methods performing explicit latent information alignment\.
#### 4\.2\.2 Layer Alignment

The second sub\-axis specifies the*layer\-to\-layer*correspondence between sender and receiver\. Two natural extremes appear repeatedly:

- •Last→\\rightarrowFirst\.The sender exposes the hidden state of its*last*layer, and the receiver injects it at its*first*layer\. Used by CIPHER, AC, Interlat\. This is the simplest mapping and works well when the sender’s last layer is the most semantically rich\.
- •All→\\rightarrowCorresponding\.The sender exposes the hidden state of*every*layer, and the receiver injects each one into the*corresponding*layer \(i\.e\., layerℓ\\ellof the sender feeds layerℓ\\ellof the receiver\)\. Used by Cache\-to\-Cache, LatentMAS, SDE\. This preserves the layer\-wise structure of the sender’s computation and is the natural choice for KV\-cache methods\.

Intermediate variants include:

- •Selected→\\rightarrowSelected\.The sender selectsn≥1n\\geq 1layers via a heuristic or learned gate, and the receiver injects them at the same indices\. Used by AC, KVComm, Q\-KVComm\.
- •Sparse top\-kkattention\.A sub\-variant of selected→\\rightarrowselected in which the receiver attends over only the top\-kkmost relevant layers \(used by KVComm\)\.

Takeaway 3 — Layer Mapping StrategiesFor*homogeneous*agents \(same backbone\), both the “last→\\rightarrowfirst” and “all→\\rightarrowcorresponding” strategies are simple, training\-free, and competitive\. The “selected→\\rightarrowselected” strategy adds complexity but can yield accuracy or latency gains when the agent has many layers and the relevant information is concentrated in a few\. For*heterogeneous*agents, learned alignment \(projection, universal codec, or interaction layers\) becomes necessary\.

### 4\.3 Axis 3 — HOW: Information Fusion Strategy

The third axis asks:*how is the communicated information incorporated into the receiver’s computation?*The major options are:

#### 4\.3\.1 Concatenation

The sender’s latent is concatenated with the receiver’s prompt embedding \(or hidden state\) along the token axis\. This is the simplest fusion and is used by CIPHER, Interlat, and several early hidden\-state methods\.

#### 4\.3\.2 Prepending \(Token\-axis Prepend\)

The sender’s latent is*prepended*to the receiver’s KV\-cache\. This is the natural fusion for KV\-cache methods: the receiver can attend over the sender’s context as if it were the first few tokens of its own prompt\. Used by KVComm, LatentMAS, and others\.

#### 4\.3\.3 Mathematical Operation

The sender’s latent is*combined*with the receiver’s hidden state \(or KV\-cache\) by an element\-wise operation: addition, subtraction, or a small learned linear projection\. Used by AC \(addition of last\-token hidden states\), SDE \(addition of state deltas\), and others\.

#### 4\.3\.4 Cross\-Attention

The receiver attends over a*set*of sender latents using a learned cross\-attention layer\. Used byMixture of Thoughts\(Fein\-Ashleyet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib9)\), where a primary expert attends over a top\-KKset of peer experts’ projected hidden states\.

#### 4\.3\.5 Cache Restoration / Direct Injection

The receiver*replaces*part of its own KV\-cache with the sender’s KV\-cache\. Used by RelayCaching\(Genget al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib15)\)and Agent Memory\(Shkolnikov,[2026](https://arxiv.org/html/2606.05711#bib.bib16)\), where the goal is to*avoid*recomputation rather than to mix information\.

#### 4\.3\.6 Comparative Table

[Table˜3](https://arxiv.org/html/2606.05711#S4.T3)lists the fusion strategy for each method\.

Table 3:Fusion strategies by method\.Takeaway 4 — Fusion Strategy Spectrum*Concatenation*and*prepending*are the most common, simplest, and often training\-free\.*Mathematical operations*\(addition, learned linear\) are slightly more expressive but require architectural compatibility\.*Cross\-attention*is the most expressive but requires training\.*Cache restoration*is the most efficient \(avoids recomputation entirely\) but is the most restrictive in scope\.

### 4\.4 Combining the Axes

The three axes are*orthogonal*: a method’s WHAT, WHICH, and HOW can be chosen largely independently\. This means the design space has a multiplicative rather than additive structure\. With three options for WHAT, three for WHICH, and five for HOW, there are 45 conceptually distinct positions; the 18 methods surveyed in this paper occupy about 17 of them, suggesting the design space is not yet saturated\.

A bird’s\-eye view of how all 18 methods fit into the framework is given in[Figure˜4](https://arxiv.org/html/2606.05711#S4.F4)\.

![Refer to caption](https://arxiv.org/html/2606.05711v1/figs/F-Method-Tree.png)Figure 4:Method categorisation tree\. Each leaf corresponds to one of the 18 methods surveyed in this paper\. Green leaves are training\-free; orange leaves require training\.

## 5\. Method Analysis under the Framework

This section provides a one\-paragraph analysis for each of the eighteen methods, structured as: \(a\) core idea, \(b\) framework placement \(WHAT / WHICH / HOW\), \(c\) strengths, \(d\) limitations, and \(e\) reported results and code\. Methods are grouped by the WHAT axis for narrative flow\.

### 5\.1 Embedding\-Based Methods

CIPHER\(Liuet al\.,[2024](https://arxiv.org/html/2606.05711#bib.bib4)\)Core idea\.*CIPHER*is the first method to communicate*embeddings*rather than tokens\. The sender’s output*logits*over the vocabulary are converted to weights, and a weighted sum of the embedding table entries is computed — yielding a*soft embedding*that encodes the entire vocabulary distribution in a single vector\. The receiver concatenates this soft embedding to its own prompt at every decode step\.Framework\.WHAT = weighted Embedding\. WHICH = last layer of sender→\\rightarrowfirst layer of receiver\. HOW = Concatenation\.Strengths\.Training\-free\. Backbone\-light \(only the embedding table must be shared, not the rest of the model\)\. Robust to model mismatches\.Limitations\.Embeddings carry the*least*information of the three options\. Performance gains are modest compared to KV\-cache methods\.Results & code\.Improves over token\-level multi\-agent debate on several reasoning and QA benchmarks\. ICLR 2024\.[Code](https://github.com/chaudatascience/cipher_multiagent_debate)\.

### 5\.2 Hidden\-State\-Based Methods

AC — Communicating Activations\(Yeet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib5)\)Core idea\.AChas the sender transmit the hidden state of the*last token*at a*selected layer*\(typically a middle layer such as 16 in a 32\-layer model\)\. The receiver combines this hidden state with its own last\-token hidden state via a simple mathematical operation \(e\.g\., addition\)\.Framework\.WHAT = Hidden State \(last token, selected layer\)\. WHICH = same layer in sender and receiver\. HOW = Mathematical operation\.Strengths\.Training\-free\. Encodes intermediate reasoning\. Approximately 27% accuracy improvement over natural language communication on a representative benchmark suite\.Limitations\.Hidden state is not as informative as KV\-cache\. The choice of*which*layer to transmit requires heuristic tuning\.Results & code\.Up to \+27% on math/reasoning benchmarks over NL\-Comm\. ICML 2025\.

Interlat\(Du and others,[2026](https://arxiv.org/html/2606.05711#bib.bib6)\)Core idea\.Interlattransmits the*last\-layer, last\-token*hidden state from sender to receiver and concatenates it with the receiver’s prompt embeddings\. The receiver then proceeds with normal prefill\. The authors introduce a small learned projection to align the sender’s and receiver’s last\-layer spaces when the agents are*different*fine\-tunes of the same backbone\.Framework\.WHAT = Hidden State\. WHICH = last layer of sender→\\rightarrowfirst layer of receiver, with optional learned projection\. HOW = Concatenation\.Strengths\.Simple, training\-free for same\-model agents\. Up to24×24\\timesspeedup over NL\-Comm on long\-context multi\-agent tasks\.Limitations\.Last\-layer / last\-token state is less informative than the full cache\. Performance depends on alignment quality\.Results & code\.Up to24×24\\timeslatency reduction on long\-context tasks; competitive accuracy with NL\-Comm\. ACL 2026\.[Code](https://github.com/XiaoDu-flying/Interlat)\.

SDE — State Delta Trajectory\(Yanget al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib7)\)Core idea\.SDEtransmits not the hidden state itself, but the*change*in hidden state at each layer \(the “state delta”\) during a reasoning step\. The deltas form a trajectory in latent space that, in the receiver, is*added*to the corresponding layer/token of the receiver’s hidden state\.Framework\.WHAT = State\-delta trajectory \(a non\-standard continuous quantity\)\. WHICH = all layers\. HOW = Mathematical operation \(addition\)\.Strengths\.More robust to small architectural mismatches than raw hidden states\. SOTA on several complex reasoning benchmarks\.Limitations\.The state\-delta representation is unconventional and has not been widely adopted\.Results & code\.SOTA on complex reasoning among NL\-Comm baselines\.[Code](https://github.com/LittleDinoC/StateDelta)\.

ThoughtComm\(Li and others,[2025](https://arxiv.org/html/2606.05711#bib.bib8)\)Core idea\.ThoughtCommtreats each agent’s*thought*\(intermediate hidden state\) as a first\-class message\. The sender exposes its current hidden state; the receiver combines it with its own current hidden state through a learnable gating mechanism\.Framework\.WHAT = Hidden State\. WHICH = corresponding layer\. HOW = Math operation \(gated combination\)\.Strengths\.Naturally aligned with the agent’s*internal*monologue\. Works across homogeneous backbones\.Limitations\.No public code at the time of writing\.

Mixture of Thoughts \(MoT\)\(Fein\-Ashleyet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib9)\)Core idea\.MoTis a*heterogeneous*latent communication method\. A router selects a top\-KKset of frozen LLM “experts” per query, and a*primary*expert performs cross\-attention over the projected hidden states of the active peers\. Crucially, the projection is done by*uniformly placed interaction layers*, which map each expert’s hidden space to a shared latent space\. This is the first method to support*cross\-architecture*hidden\-state communication without pairwise translators\.Framework\.WHAT = Hidden State \(projected\)\. WHICH = learned interaction layers\. HOW = Cross\-attention\.Strengths\.First latent communication method to natively support*heterogeneous*experts\. Single\-pass inference \(no iterative aggregation\)\. Beats prior SOTA on both in\-distribution and out\-of\-distribution benchmarks\.Limitations\.Requires training the router and interaction layers\.Results & code\.\+0\.38% on 5 ID benchmarks and \+2\.92% on 3 OOD benchmarks over prior SOTA \(Avengers\)\.[Code](https://github.com/jacobfa/mot)\.

### 5\.3 KV\-Cache\-Based Methods

KVComm\(Wang and others,[2025b](https://arxiv.org/html/2606.05711#bib.bib10)\)Core idea\.KVCommtransmits a*selected subset*of the sender’s KV\-cache \(a few selected layers\) to the receiver\. Within the same layer index, the sender’s KV is*prepended*to the receiver’s KV\. A Gaussian\-prior\-based selection mechanism picks the most informative layers\.Framework\.WHAT = KV\-Cache \(selected layers\)\. WHICH = selected→\\rightarrowcorresponding\. HOW = Prepend\.Strengths\.Training\-free\. Reduces transmission cost relative to full\-cache methods\. Achieves strong latency improvements\.Limitations\.Performance depends on the layer\-selection heuristic\.Results & code\.Strong latency reduction on multi\-agent QA pipelines\. ICLR 2026\.

Cache\-to\-Cache \(C2C\)\(Liu and others,[2025](https://arxiv.org/html/2606.05711#bib.bib11)\)Core idea\.Cache\-to\-Cacheis the first method to transmit the*entire*KV\-cache from sender to receiver\. All layers of the sender feed into the corresponding layers of the receiver\. A small*learned fuser*blends the two caches at each layer\.Framework\.WHAT = KV\-Cache \(all layers\)\. WHICH = all→\\rightarrowcorresponding\. HOW = Mathematical operation \(learned fuser\)\.Strengths\.Maximally informative\. Strong empirical results on multi\-agent reasoning\.Limitations\.Highest transport cost among the surveyed methods\. Fuser must be trained\.Results & code\.Significant accuracy gains on multi\-agent benchmarks\.[Code](https://github.com/thu-nics/C2C)\.

LatentMAS\(Wang and others,[2025c](https://arxiv.org/html/2606.05711#bib.bib12)\)Core idea\.LatentMASextends Cache\-to\-Cache by*interleaving*prefill and decode: the sender’s KV\-cache is exposed*both*during its prefill phase*and*accumulated during its decode phase\. The receiver gets the full prefill\+decode KV\-cache, prepended at every layer\.Framework\.WHAT = KV\-Cache \(prefill \+ decode\)\. WHICH = all→\\rightarrowcorresponding\. HOW = Prepend\.Strengths\.Maximally informative\. Training\-free\. Strong results on collaborative reasoning\.Limitations\.Largest transport cost\.Results & code\.SOTA on collaborative reasoning benchmarks\.[Code](https://github.com/Gen-Verse/LatentMAS)\.

Q\-KVComm\(Park and others,[2025](https://arxiv.org/html/2606.05711#bib.bib13)\)Core idea\.Q\-KVCommcompresses the sender’s KV\-cache using an*adaptive*quantisation scheme that achieves 5–6×\\timescompression while preserving semantic fidelity\. The compressed cache is then transmitted to the receiver and prepended at the corresponding layers\.Framework\.WHAT = KV\-Cache \(compressed\)\. WHICH = all→\\rightarrowcorresponding\. HOW = Prepend\.Strengths\.Lowers transport cost by 5–6×\\times\. Compatible with existing KV\-cache infrastructure\.Limitations\.Quantisation introduces small semantic drift\.Results & code\.5–6×\\timescompression with negligible accuracy loss\.

LRAgent — Multi\-LoRA KV Sharing\(Jeonet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib14)\)Core idea\.LRAgentaddresses the*multi\-LoRA*setting: when several agents share the same backbone but use different LoRA adapters, the*base*component of the KV\-cache is identical across agents, while the*adapter*component is small and low\-rank\. LRAgent shares the base component and stores the adapter component in low\-rank form\. A custom*Flash\-LoRA\-Attention*kernel reconstructs adapter contributions without materialising the full cache\.Framework\.WHAT = KV\-Cache \(base \+ low\-rank adapter\)\. WHICH = same backbone\. HOW = Additive fusion inside fused kernel\.Strengths\.Drastically reduces memory for multi\-LoRA agents\. Training\-free at inference\. Approximates fully\-shared caching throughput\.Limitations\.Specific to multi\-LoRA setting\.Results & code\.Memory overhead close to fully\-shared caching; accuracy close to non\-shared baseline\. ICML 2026\.

RelayCaching\(Genget al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib15)\)Core idea\.RelayCachingobserves that when an agent’s*decoded*output becomes part of a downstream agent’s*prompt*, the*decoding\-phase*KV\-cache of the upstream agent is highly consistent with the*prefill\-phase*KV\-cache that the downstream agent would have computed\. RelayCaching directly*transplants*the upstream decoding cache into the downstream prefill, with sparse selective recomputation at the few affected layers/positions\.Framework\.WHAT = KV\-Cache \(decoding\-phase\)\. WHICH = same model\. HOW = Cache restoration \+ sparse recomputation\.Strengths\.Training\-free\.\>\>80% cache reuse\. Up to 4\.7×\\timesTTFT reduction\.Limitations\.Same\-model assumption; deviations at the boundary require recomputation\.Results & code\.80%\+ cache reuse, 4\.7×\\timesTTFT speedup on math, code, and general knowledge tasks\.

Agent Memory\(Shkolnikov,[2026](https://arxiv.org/html/2606.05711#bib.bib16)\)Core idea\.Agent Memorytargets*edge devices*with limited RAM\. Each agent’s KV\-cache is persisted to disk in 4\-bit quantised form \(safetensors\) and reloaded into attention layers on demand, eliminating the∼\\sim15\.7 s/agent re\-prefill cost at 4K context\.Framework\.WHAT = KV\-Cache \(Q4 quantised, disk\-persistent\)\. WHICH = same agent across phases\. HOW = Cache restoration \+ cross\-phase context injection\.Strengths\.Frees up RAM; enables multi\-agent inference on edge\. Up to 136×\\timesTTFT speedup\.Limitations\.Quantisation introduces perplexity drift \(Llama \+2\.8%, DeepSeek \+3\.0%, Gemma−0\.7%\-0\.7\\%\)\.Results & code\.TTFT speedups: Gemma 3 12B 22×\\times–136×\\times; DeepSeek\-Coder\-V2\-Lite 16B 11×\\times–76×\\times; Llama 3\.1 8B 24×\\times–111×\\times\.[Code](https://github.com/yshk-mxim/agent-memory)\.

Agent Primitives\(Jinet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib17)\)Core idea\.Agent Primitivesdecomposes a multi\-agent system into a small library of*reusable latent primitives*\(e\.g\., Review; Voting and Selection; Planning and Execution\)\. Intra\-primitive messaging uses shared KV\-cache rather than natural language\. An*organiser agent*composes primitives per query, and a*knowledge pool*stores previously successful configurations\.Framework\.WHAT = KV\-Cache \(inter\-primitive\)\. WHICH = same backbone within primitive\. HOW = Primitive chaining\.Strengths\.Modular; reuses successful configurations\. \+12\.0–16\.5% average accuracy over single\-agent baselines\. 3–4×\\timeslower token usage and latency than text\-based MAS\. Only 1\.3–1\.6×\\timesoverhead vs\. single\-agent inference\.Limitations\.Specific to MAS architectures built around the proposed primitives\.

Edge LLM Handover\(Leeet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib18)\)Core idea\.Edge LLM Handoveraddresses the*mobility*setting: when a user equipment \(UE\) hands over between edge base stations during an LLM session, the system jointly optimises*how much context to re\-prefill from raw tokens*vs\.*how much KV\-cache to transfer over the backhaul*, minimising worst\-case handover delay\.Framework\.WHAT = KV\-Cache \(transferred over backhaul\)\. WHICH = same edge LLM\. HOW = Hybrid: partial re\-prefill \+ partial KV transfer\.Strengths\.Tractable, step\-wise solution; constructive multi\-UE rate\-scheduling policy\. Outperforms baselines across a wide range of backhaul capacities, prefill speeds, and context sizes\.Limitations\.Simulation\-only evaluation\.

### 5\.4 Hybrid / Heterogeneous Methods

Vision Wormhole\(Liuet al\.,[2026a](https://arxiv.org/html/2606.05711#bib.bib19)\)Core idea\.Vision Wormholereconceptualises the*visual interface*of a VLM as a*continuous communication channel*\. A sender’s reasoning trace is encoded into a shared continuous reference space \(the*Universal Visual Codec*, UVC\) and injected into the receiver’s*visual pathway*, bypassing tokenisation\. The hub\-and\-spoke topology reduces the alignment cost from𝒪\(N2\)\\mathcal\{O\}\(N^\{2\}\)pairwise translators to𝒪\(N\)\\mathcal\{O\}\(N\)encoders/decoders, enabling cross\-architecture latent transfer across disjoint model manifolds\.Framework\.WHAT = Hidden State \(in UVC\)\. WHICH = hub\-and\-spoke via UVC\. HOW = Visual injection\.Strengths\.First method to support*fully heterogeneous*cross\-architecture latent communication\. Tested on Qwen\-VL, Gemma, SmolVLM2, LFM2\.5\-VL across nine reasoning benchmarks\.Limitations\.Requires label\-free distillation training\. Code in progress\.Results & code\.Reduces end\-to\-end wall\-clock time in most settings; positive macro\-averageΔ\\Delta\-accuracy\.

BIGMAS — Brain\-Inspired Graph MAS\(Haoet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib20)\)Core idea\.BIGMASorganises specialised LLM agents as nodes in a*dynamically constructed directed graph*\. A*GraphDesigner*builds the topology per problem, and an*Orchestrator*mediates access to a*centralised shared workspace*\. The architecture is inspired by the*global workspace theory*of human cognition\.Framework\.WHAT = Shared workspace contents \(hybrid latent/text\)\. WHICH = common message\-space contract\. HOW = Global workspace fusion\.Strengths\.Topology adapts to the problem\. Centralised workspace avoids the local\-view bottleneck of pairwise communication\.Limitations\.Specific to graph\-structured MAS\. The exact storage format of the workspace contents is not explicitly specified\.Results & code\.Outperforms ReAct and Tree of Thoughts on Game24, Six Fives, and Tower of London with six frontier LLMs \(standard \+ LRMs\)\. Gains are orthogonal to model\-level reasoning improvements\.

### 5\.5 Survey and Aggregation Works

The Five Ws of Multi\-Agent Communication\(Chenet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib21)\)Core idea\.*The Five Ws Survey*unifies MARL, Emergent Language \(EL\), and LLM\-based multi\-agent communication under a single “Five Ws” \(*Who*,*Whom*,*When*,*What*,*Why*\) taxonomy\. It surveys hand\-designed protocols, end\-to\-end learned communication, emergent symbolic communication, and natural\-language priors\.Value for this paper\.A meta\-survey that contextualises our framework\. The WHAT axis in the Five Ws survey corresponds to our WHAT axis; the WHEN and WHY axes are unique to that framework, providing the broader*communication theory*context in which our framework sits\. TMLR 2026\.

## 6\. Implementation: The Training\-Free Paradigm

A striking observation across the 18 surveyed methods is that*most are training\-free*\. The training\-free property is a major advantage in practice: it means the methods can be deployed on top of any pre\-trained LLM with no additional data, no GPU hours, and no risk of catastrophic forgetting\.

[Table˜4](https://arxiv.org/html/2606.05711#S6.T4)lists the training regime for each method\.

Table 4:Training regime by method\. \(✓\\checkmark= training\-free;∙\\bullet= training required\.\)##### Why training\-free dominates\.

Training\-free methods have three structural advantages:

1. 1\.Composability\.They can be applied on top of any pre\-trained LLM, including new releases, without re\-training\.
2. 2\.No data requirement\.They do not need parallel latent–text corpora, which are expensive to construct\.
3. 3\.Robustness\.They cannot suffer from distribution shift between training and deployment\.

##### When training helps\.

Training becomes necessary when:

- •The sender and receiver are*architecturally heterogeneous*and no hand\-designed alignment works \(e\.g\., Vision Wormhole, MoT\)\.
- •The fusion function is*non\-trivial*and cannot be expressed as concatenation or addition \(e\.g\., Cache\-to\-Cache’s learned fuser\)\.
- •The system needs to*learn a routing policy*over a large set of agents \(e\.g\., MoT’s router\)\. In general, choosing*which*subtask to assign to*which*agent is an adaptive selection problem in its own right, and a growing body of work studies the empirical design space of such selection strategies for LLM\-based systems\(Liuet al\.,[2025](https://arxiv.org/html/2606.05711#bib.bib29)\)\.

The field appears to be converging on a hybrid:*training\-free*WHAT/WHICH axes combined with*lightweight training*on a small adapter for HOW \(fusion\)\. We expect this pattern to continue\.

## 7\. Benchmark Analysis and Empirical Insights

This section synthesises reported results from the 18 methods\. Direct cross\-method comparison is challenging because the methods use different backbones, benchmarks, and reporting conventions; we therefore focus on*trends*and*order\-of\-magnitude*effects rather than head\-to\-head numbers\.

### 7\.1 Benchmark Suites

Methods are typically evaluated on a mix of:

- •Math reasoning:GSM8K, MATH, AIME\.
- •General knowledge:MMLU, ARC\.
- •Code generation:HumanEval, MBPP, LiveCodeBench\.
- •Multi\-modal reasoning:MathVista, MMMU, ChartQA\.
- •Agentic QA:HotpotQA, 2WikiMultiHopQA, MuSiQue\.
- •Game\-like reasoning:Game24, Six Fives, Tower of London\.
- •Competitive tabletop games:Mahjong,*Uno*,*Honor of Kings*— increasingly used as testbeds for evaluating inter\-agent coordination under partial observability, for which dedicated toolkits such as RainbowArena\(Liuet al\.,[2026b](https://arxiv.org/html/2606.05711#bib.bib30)\)provide standardised APIs, opponent pools, and replay infrastructure\.

### 7\.2 Reported Quantitative Trends

- •Latency reduction\.Latent communication methods consistently reduce latency by 2–24×\\timesrelative to NL\-Comm\. The largest gains \(24×\\times\) are reported by Interlat on long\-context multi\-agent tasks\.
- •TTFT speedups\.KV\-cache\-based methods \(RelayCaching, Agent Memory\) report TTFT speedups of 4\.7×\\times–136×\\timesrelative to full re\-prefill\.
- •Token savings\.Latent methods typically reduce tokens generated by 3–4×\\times\(Agent Primitives reports 3–4×\\timeslower token usage vs\. text\-based MAS\)\.
- •Accuracy\.Most methods report accuracy competitive with or better than NL\-Comm baselines; SOTA gains are reported by SDE on complex reasoning, MoT on ID/OOD benchmarks, and LatentMAS on collaborative reasoning\.

A schematic comparison of representative methods on the trade\-off dimensions \(accuracy, latency, generality, engineering complexity\) is given in[Table˜5](https://arxiv.org/html/2606.05711#S7.T5)\.

Table 5:Trade\-off profile of representative methods across four design dimensions\. Symbols:⋆⁣⋆⁣⋆\\star\\\!\\star\\\!\\star= excellent,⋆⁣⋆\\star\\\!\\star= good,⋆\\star= fair\.
### 7\.3 Insights

Three empirical patterns stand out:

1. 1\.Long context benefits most\.The latency advantage of latent communication grows with context length, because the receiver*avoids*re\-encoding a long prompt\.
2. 2\.Same\-model agents dominate\.Most methods assume sender and receiver share the same backbone; cross\-architecture methods \(Vision Wormhole, MoT\) are still rare and require training\.
3. 3\.KV\-cache is the emerging default\.Of the 18 methods, 9 use KV\-caches as the communicated quantity, and the share is growing\.

## 8\. Open Problems and Future Directions

Latent communication is a young field\. We identify six open problems that we expect to shape the next generation of research\. A mind\-map view is given in[Figure˜5](https://arxiv.org/html/2606.05711#S8.F5)\.

![Refer to caption](https://arxiv.org/html/2606.05711v1/figs/F-Open-Problems-Map.png)Figure 5:Mind map of six open problems in latent communication\. Each branch represents a research direction with significant open questions\.### 8\.1 Cross\-Architecture Alignment

Most existing methods assume*homogeneous*agents \(same backbone\)\. The few methods that support*heterogeneous*agents \(Vision Wormhole, MoT\) require training a learned alignment module per pair of architectures\. A general, training\-free,𝒪\(N\)\\mathcal\{O\}\(N\)cross\-architecture alignment method remains elusive\.

### 8\.2 Security and Robustness

A latent channel is*opaque*: there is no natural language to inspect for adversarial content\. An attacker who controls the sender could embed adversarial perturbations in the hidden state that, while not affecting the receiver’s output text, cause the receiver to*behave*maliciously\. Conversely, a compromised receiver could exfiltrate the sender’s hidden state\. We see almost no work on*security*of latent channels and consider this a critical gap\.

### 8\.3 Compression and Quantisation

KV\-cache methods are bottlenecked by the size of the cache they transmit\. Q\-KVComm and Agent Memory are the first methods to attack this problem with*adaptive quantisation*, but the design space is large\. We expect*learned*compression,*token\-level*compression, and*layer\-level*compression to be active research areas\.

### 8\.4 Theoretical Understanding

The field is largely empirical\. We lack a*theoretical*account of when latent communication should outperform natural language, how much information is actually transmitted, and what the upper bound on speedup is\. Information\-theoretic and learning\-theoretic analyses are an open opportunity\.

### 8\.5 Latent Communication vs\. Latent CoT

*Latent CoT*is the practice of performing chain\-of\-thought reasoning*in latent space*within a*single*agent \(e\.g\., Coconut, LatentSeek\)\.*Latent Communication*is the practice of exchanging latent messages*between*agents\. The two share machinery \(KV\-cache reasoning, hidden\-state deltas\) but differ in*where*the latents flow: within one model or between two\. A unified framework that handles both directions of latent flow is a natural next step\. We discuss this in more detail in[Section˜9](https://arxiv.org/html/2606.05711#S9)\.

### 8\.6 Real\-World Deployment

Edge devices, mobile phones, and embedded systems have very different constraints from data centres\. The KV\-cache sharing techniques surveyed here \(LRAgent, Agent Memory, Edge LLM Handover\) are early steps in this direction, but we expect the*battery, memory, and bandwidth*constraints of edge deployment to drive significant new research\.

## 9\. Related Work

Latent communication sits at the intersection of four research areas: latent chain\-of\-thought, multi\-agent reinforcement learning, emergent language, and KV\-cache compression\. We briefly situate our framework within each\.

### 9\.1 Latent Chain\-of\-Thought

*Latent CoT*is the practice of reasoning in continuous latent space within a single LLM\. Representative works include Coconut\(Haoet al\.,[2024](https://arxiv.org/html/2606.05711#bib.bib22)\), LatentSeek\(Wang and others,[2025a](https://arxiv.org/html/2606.05711#bib.bib23)\), and the awesome\-lists\(Awesome Latent Space Contributors,[2024](https://arxiv.org/html/2606.05711#bib.bib24); EIT\-NLP Contributors,[2025](https://arxiv.org/html/2606.05711#bib.bib25)\)\. The relationship to latent communication is summarised in[Figure˜6](https://arxiv.org/html/2606.05711#S9.F6)\.

![Refer to caption](https://arxiv.org/html/2606.05711v1/figs/F-LC-vs-LCoT.png)Figure 6:Venn diagram showing the overlap and distinction between Latent Communication \(multi\-agent\) and Latent CoT \(single\-agent\)\. The two share machinery but differ in the direction of latent flow\.The two share machinery: both manipulate hidden states, KV\-caches, and state deltas\. They differ in*where*the latents flow: within one model \(Latent CoT\) or between two \(Latent Communication\)\. A unified framework that handles both directions of flow is a promising direction\.

### 9\.2 Multi\-Agent Reinforcement Learning \(MARL\)

MARL has a long tradition of*learned communication*between agents\(Foersteret al\.,[2016](https://arxiv.org/html/2606.05711#bib.bib26)\)\. Methods such as CommNet, TarMAC, and IC3Net learn continuous message vectors that are exchanged between agents\. The modern LLM\-based latent communication methods surveyed in this paper can be seen as a*white\-box, inference\-time*counterpart to MARL’s*learned*communication\. The Five Ws survey\(Chenet al\.,[2026](https://arxiv.org/html/2606.05711#bib.bib21)\)is a recent effort to bridge the two\. On the empirical side, toolkits such as RainbowArena\(Liuet al\.,[2026b](https://arxiv.org/html/2606.05711#bib.bib30)\)provide standardised tabletop\-game environments where both RL and LLM\-based agents can be evaluated under controlled multi\-agent conditions, offering a natural experimental bridge between the MARL and LLM\-MAS communities\.

### 9\.3 Emergent Language

*Emergent language*studies the symbolic protocols that arise when agents are trained to communicate\(Lazaridou and Baroni,[2017](https://arxiv.org/html/2606.05711#bib.bib27)\)\. The protocols are typically*discrete*\(unlike our continuous latents\), but the underlying question —*what should agents communicate?*— is the same\.

### 9\.4 KV\-Cache Compression and Sharing

A large body of work optimises the KV\-cache for*single\-model*inference: H2O\(Zhanget al\.,[2023](https://arxiv.org/html/2606.05711#bib.bib28)\), Scissorhands, KIVI, KVQuant, and others\. The methods surveyed in this paper \(KVComm, Cache\-to\-Cache, LatentMAS, Q\-KVComm, etc\.\) extend this line of work to the*multi\-agent*setting, where the cache is shared*between*agents\.

## 10\. Conclusion

We have presented a unified framework for latent communication in LLM\-based multi\-agent systems\. The framework organises 18 representative methods along three orthogonal axes:WHAT\(types of communicated information — Embeddings, Hidden States, KV\-Caches, and others\),WHICH\(sender–receiver alignment — latent information alignment and layer alignment\), andHOW\(information fusion strategy — concatenation, prepending, mathematical operations, cross\-attention, and cache restoration\)\. The framework exposes five generalisable takeaways, surfaces six open problems, and bridges latent communication to adjacent research areas including latent CoT, MARL, emergent language, and KV\-cache compression\.

We hope this framework provides a shared vocabulary for the rapidly growing latent communication community and lowers the barrier to entry for new researchers\. The field is moving fast — we expect the next 12–18 months to bring new methods, new benchmarks, and \(hopefully\) a theoretical understanding of when and why latent communication outperforms natural language\.

##### Reproducibility\.

All figures in this paper are derived from publicly available method illustrations \(referenced inline\)\. The companion repository at[https://github\.com/enochliu98/Awesome\-Latent\-Communication](https://github.com/enochliu98/Awesome-Latent-Communication)is continuously updated with new papers, code links, and reproducible figure prompts\.

##### Acknowledgements\.

We thank the authors of the 18 surveyed methods for making their code and data publicly available, and the open\-source community for curating the companion awesome\-list\. We also thank the anonymous reviewers for their constructive feedback\.

## References

- Awesome latent space\.Note:[https://github\.com/YU\-deep/Awesome\-Latent\-Space](https://github.com/YU-deep/Awesome-Latent-Space)Cited by:[§9\.1](https://arxiv.org/html/2606.05711#S9.SS1.p1.1)\.
- J\. Chen, H\. Yang, Z\. Liu, and C\. Joe\-Wong \(2026\)The five ws of multi\-agent communication: who talks to whom, when, what, and why — a survey from marl to emergent language and llms\.arXiv preprint arXiv:2602\.11583\.Note:Accepted at TMLR 2026Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.20.8.1.1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.16.19.4.1),[§5\.5](https://arxiv.org/html/2606.05711#S5.SS5.p1.pic1.1.1.1.1.1.1.1),[§9\.2](https://arxiv.org/html/2606.05711#S9.SS2.p1.1)\.
- X\. Duet al\.\(2026\)Enabling agents to communicate entirely in latent space\.arXiv preprint arXiv:2511\.09149\.Note:Accepted at ACL 2026Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.3.3.2.1.1),[§4\.1\.2](https://arxiv.org/html/2606.05711#S4.SS1.SSS2.p1.3),[§4\.2\.1](https://arxiv.org/html/2606.05711#S4.SS2.SSS1.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.5.3.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.5.3.1.1.1),[§5\.2](https://arxiv.org/html/2606.05711#S5.SS2.p2.pic1.4.4.4.1.1.1.1)\.
- EIT\-NLP Contributors \(2025\)Awesome latent cot\.Note:[https://github\.com/EIT\-NLP/Awesome\-Latent\-CoT](https://github.com/EIT-NLP/Awesome-Latent-CoT)Cited by:[§9\.1](https://arxiv.org/html/2606.05711#S9.SS1.p1.1)\.
- J\. Fein\-Ashley, D\. Parikh, R\. Kannan, and V\. Prasanna \(2025\)Mixture of thoughts: learning to aggregate what experts think, not just what they say\.arXiv preprint arXiv:2509\.21164\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.5.5.2.1.1),[§4\.1\.2](https://arxiv.org/html/2606.05711#S4.SS1.SSS2.p1.3),[§4\.2\.1](https://arxiv.org/html/2606.05711#S4.SS2.SSS1.p1.1),[§4\.3\.4](https://arxiv.org/html/2606.05711#S4.SS3.SSS4.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.16.14.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.1.2.1.1),[§5\.2](https://arxiv.org/html/2606.05711#S5.SS2.p5.pic1.2.2.2.1.1.1.1)\.
- J\. Foerster, Y\. M\. Assael, N\. de Freitas, and S\. Whiteson \(2016\)Learning to communicate with deep multi\-agent reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§9\.2](https://arxiv.org/html/2606.05711#S9.SS2.p1.1)\.
- Y\. Geng, Y\. Gao, W\. Wu, G\. Liu, and J\. Liu \(2026\)RelayCaching: accelerating llm collaboration via decoding kv cache reuse\.arXiv preprint arXiv:2603\.13289\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.15.3.1.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[§4\.3\.5](https://arxiv.org/html/2606.05711#S4.SS3.SSS5.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.12.10.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.13.11.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p6.pic1.4.4.4.1.1.1.1)\.
- G\. Hao, Y\. Dai, X\. Qin, and S\. Yu \(2026\)Brain\-inspired graph multi\-agent systems for llm reasoning\.arXiv preprint arXiv:2603\.15371\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.19.7.1.1.1),[4th item](https://arxiv.org/html/2606.05711#S4.I4.i4.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.16.18.3.1),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.18.16.1.1.1),[§5\.4](https://arxiv.org/html/2606.05711#S5.SS4.p2.pic1.1.1.1.1.1.1.1)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian \(2024\)Reasoning in latent space: an unconstrained chain\-of\-thought\.arXiv preprint arXiv:2412\.06769\.Cited by:[§9\.1](https://arxiv.org/html/2606.05711#S9.SS1.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau,et al\.\(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.arXiv preprint arXiv:2308\.00352\.Cited by:[§1](https://arxiv.org/html/2606.05711#S1.p1.1)\.
- H\. Jeon, H\. Ha, and J\. Kim \(2026\)LRAgent: efficient kv cache sharing for multi\-lora llm agents\.arXiv preprint arXiv:2602\.01053\.Note:ICML 2026Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.14.2.1.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.11.9.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.12.10.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p5.pic1.1.1.1.1.1.1.1)\.
- H\. Jin, P\. Kuang, Y\. Yu, X\. Yuan, and H\. Wang \(2026\)Agent primitives: reusable latent building blocks for multi\-agent systems\.arXiv preprint arXiv:2602\.03695\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.17.5.1.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.14.12.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.15.13.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p8.pic1.3.3.3.1.1.1.1)\.
- A\. Lazaridou and M\. Baroni \(2017\)Emergent multi\-agent communication in deep reinforcement learning\.arXiv preprint arXiv:1706\.02295\.Cited by:[§9\.3](https://arxiv.org/html/2606.05711#S9.SS3.p1.1)\.
- S\. Lee, J\. Park, C\. Zheng, and H\. Park \(2026\)Low\-latency edge llm handover via joint kv cache transfer and token prefill\.arXiv preprint arXiv:2603\.28018\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.18.6.1.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.15.13.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.16.14.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p9.pic1.1.1.1.1.1.1.1)\.
- G\. Li, H\. A\. A\. K\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem \(2023\)CAMEL: communicative agents for “mind” exploration of llm society\.arXiv preprint arXiv:2303\.17760\.Cited by:[§1](https://arxiv.org/html/2606.05711#S1.p1.1)\.
- M\. Liet al\.\(2025\)Thought communication in multiagent collaboration\.arXiv preprint arXiv:2510\.20733\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.13.1.1.1.1),[§4\.1\.2](https://arxiv.org/html/2606.05711#S4.SS1.SSS2.p1.3),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.9.7.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.10.8.1.1.1),[§5\.2](https://arxiv.org/html/2606.05711#S5.SS2.p4.pic1.1.1.1.1.1.1.1)\.
- C\. Liu, X\. Dou, L\. Wu, H\. Zhang, Y\. Zhao, Y\. Li, B\. Li, S\. Wang, D\. F\. Wong,et al\.\(2024\)Let models speak ciphers: multiagent debate through embeddings\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2310.06272)Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.1.1.2.1.1),[§4\.1\.1](https://arxiv.org/html/2606.05711#S4.SS1.SSS1.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.3.1.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.3.1.1.1.1),[§5\.1](https://arxiv.org/html/2606.05711#S5.SS1.p1.pic1.2.2.2.1.1.1.1)\.
- J\. Liuet al\.\(2025\)Cache\-to\-cache: direct semantic communication between large language models\.arXiv preprint arXiv:2510\.03215\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.8.8.3.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[§4\.2\.1](https://arxiv.org/html/2606.05711#S4.SS2.SSS1.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.7.5.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.8.6.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p2.pic1.2.2.2.1.1.1.1)\.
- S\. Liu, Y\. Liu, Z\. Wang, Y\. Wang, H\. Wu, L\. Xiang, and Z\. He \(2025\)Select\-then\-decompose: from empirical analysis to adaptive selection strategy for task decomposition in large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 5454–5477\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.278)Cited by:[§1](https://arxiv.org/html/2606.05711#S1.p1.1),[3rd item](https://arxiv.org/html/2606.05711#S6.I2.i3.p1.1)\.
- X\. Liu, R\. Zhang, W\. Yu, S\. Xiong, L\. He, F\. Wu, H\. Jung, M\. Fredrikson, X\. Wang, and J\. Gao \(2026a\)The vision wormhole: latent\-space communication in heterogeneous multi\-agent systems\.arXiv preprint arXiv:2602\.15382\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.11.2.1.1),[3rd item](https://arxiv.org/html/2606.05711#S4.I4.i3.p1.1),[§4\.2\.1](https://arxiv.org/html/2606.05711#S4.SS2.SSS1.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.16.17.2.1),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.17.15.1.1.1),[§5\.4](https://arxiv.org/html/2606.05711#S5.SS4.p1.pic1.4.4.4.1.1.1.1)\.
- Y\. Liu, S\. Liu, H\. Tang, Y\. Ma, Z\. Li, J\. Zhang, L\. Xiang, and Z\. He \(2026b\)RainbowArena: a multi\-agent toolkit for reinforcement learning and large language models in tabletop games\.Knowledge\-Based Systems333,pp\. 115046\.External Links:[Document](https://dx.doi.org/10.1016/j.knosys.2025.115046)Cited by:[§1](https://arxiv.org/html/2606.05711#S1.p1.1),[7th item](https://arxiv.org/html/2606.05711#S7.I1.i7.p1.1),[§9\.2](https://arxiv.org/html/2606.05711#S9.SS2.p1.1)\.
- K\. Parket al\.\(2025\)Q\-kvcomm: efficient multi\-agent communication via adaptive kv cache compression\.arXiv preprint arXiv:2512\.17914\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.10.10.2.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.10.8.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.11.9.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p4.pic1.5.5.5.1.1.1.1)\.
- Y\. P\. Shkolnikov \(2026\)Agent memory below the prompt: persistent q4 kv cache for multi\-agent llm inference on edge devices\.arXiv preprint arXiv:2603\.04428\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.11.16.4.1.1.1),[2nd item](https://arxiv.org/html/2606.05711#S4.I4.i2.p1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[§4\.3\.5](https://arxiv.org/html/2606.05711#S4.SS3.SSS5.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.13.11.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.14.12.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p7.pic1.10.10.10.1.1.1.1)\.
- T\. Wanget al\.\(2025a\)Seek in the dark: reasoning via test\-time instance\-level policy gradient in latent space\.arXiv preprint arXiv:2505\.13308\.Cited by:[§9\.1](https://arxiv.org/html/2606.05711#S9.SS1.p1.1)\.
- Y\. Wanget al\.\(2025b\)KVComm: enabling efficient llm communication through selective kv sharing\.arXiv preprint arXiv:2510\.03346\.Note:Accepted at ICLR 2026Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.6.6.2.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.6.4.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.7.5.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p1.pic1.2.2.2.1.1.1.1)\.
- Z\. Wanget al\.\(2025c\)Latent collaboration in multi\-agent systems\.arXiv preprint arXiv:2511\.20639\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.9.9.2.1.1),[§4\.1\.3](https://arxiv.org/html/2606.05711#S4.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.8.6.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.9.7.1.1.1),[§5\.3](https://arxiv.org/html/2606.05711#S5.SS3.p3.pic1.2.2.2.1.1.1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, S\. Zhang, S\. Khosla,et al\.\(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.arXiv preprint arXiv:2308\.08155\.Cited by:[§1](https://arxiv.org/html/2606.05711#S1.p1.1)\.
- R\. Yang, J\. Cao, Z\. Zhang,et al\.\(2025\)Augmenting multi\-agent communication with state delta trajectory\.arXiv preprint arXiv:2506\.19209\.Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.4.4.2.1.1),[1st item](https://arxiv.org/html/2606.05711#S4.I4.i1.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.05711#S4.SS1.SSS2.p1.3),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.16.16.1.1),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.6.4.1.1.1),[§5\.2](https://arxiv.org/html/2606.05711#S5.SS2.p3.pic1.1.1.1.1.1.1.1)\.
- R\. Ye, X\. Zhang, Y\. Pang, P\. Qi, Z\. Wang,et al\.\(2025\)Communicating activations between language model agents\.InInternational Conference on Machine Learning \(ICML\),External Links:[Link](https://arxiv.org/abs/2501.14082)Cited by:[Table 6](https://arxiv.org/html/2606.05711#A1.T6.2.2.2.1.1),[§4\.1\.2](https://arxiv.org/html/2606.05711#S4.SS1.SSS2.p1.3),[Table 1](https://arxiv.org/html/2606.05711#S4.T1.4.2.2),[Table 3](https://arxiv.org/html/2606.05711#S4.T3.1.4.2.1.1.1),[§5\.2](https://arxiv.org/html/2606.05711#S5.SS2.p1.pic1.1.1.1.1.1.1.1)\.
- Z\. Zhang, Y\. Yang, Z\. Yao, Y\. Yan, J\. E\. Gonzalez, and M\. W\. Mahoney \(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§9\.4](https://arxiv.org/html/2606.05711#S9.SS4.p1.1)\.

## Appendix AMethod Quick\-Reference Table

Table 6:Quick\-reference for all 18 methods\. WHAT / WHICH / HOW refer to the three axes of the unified framework \([Section˜4](https://arxiv.org/html/2606.05711#S4)\)\.
Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

Similar Articles

When LLMs Develop Languages: Symbolic Communication for Efficient Multi-Agent Reasoning

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents

Multi-Agent LLMs Fail to Explore Each Other

Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

Submit Feedback

Similar Articles

When LLMs Develop Languages: Symbolic Communication for Efficient Multi-Agent Reasoning
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents
Multi-Agent LLMs Fail to Explore Each Other
Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate