@FinanceYF5: Paper:
Summary
This paper introduces LatentMAS, a training-free framework for multi-agent systems that enables large language model agents to collaborate directly in continuous latent space via shared latent working memory, achieving up to 14.6% higher accuracy and 4x faster inference while reducing token usage by over 70%.
View Cached Full Text
Cached at: 06/25/26, 07:15 AM
Paper:https://t.co/sP6yk2SU64
Latent Collaboration in Multi-Agent Systems
Source: https://arxiv.org/html/2511.20639 Ruizhong QiuGaotang LiXiyuan YangKatherine TieuPan LuKe ShenHanghang TongYejin ChoiJingrui HeJames ZouMengdi WangLing Yang
Abstract
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduceLatentMAS, anend-to-end training-freeframework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings instead of text. Then, a shared latent working memory preserves and transfers each agent’s internal representations and latent thoughts, ensuring lossless information exchange without re-encoding. We provide detailed theoretical analyses showing that LatentMAS achieves higher expressiveness and lossless information preservation with lower overall complexity than standard text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS outperforms advanced single agents and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4×\times-4.3×\timesfaster end-to-end inference.
Machine Learning, ICML
![[Uncaptioned image]](https://arxiv.org/html/2511.20639v3/x1.png)
Figure 1:Performance of LatentMAS across (i) task accuracy (%), (ii) inference speed (time(s) / run), and (iii) token usage (per token) over 9 benchmarks and 3 LLM model scales under the Hierarchical MAS setting. LatentMAS consistently improves system-level reasoning accuracy while substantially reducing computational overhead compared with single model and text-based MAS.
1Introduction
Model collaboration emerges as the foundation of system-level intelligence in the era of Agentic AI(Acharya et al.,2025). Recent advances in multi-agent systems (MAS)(Wu et al.,2024; Hu et al.,2025)have catalyzed a paradigm shift from solitary, model-centric reasoning into a collaborative endeavor among multiple interacting models. Among these, large language model (LLM)-based MAS has been adopted across various downstream applications, including cooperative math and science reasoning(Pezeshkpour et al.,2024; Zhou et al.,2025), automated code generation(Yang et al.,2024b; Ding et al.,2025), and open-domain question answering(Jin et al.,2025; Li et al.,2025c,2026). Within LLM-based MAS, natural language or text generally serves as thelingua franca– the common medium that carries each agent’s internal thoughts and enables communication across different agents(Guo et al.,2024).
Beyond explicit text, several studies have explored the use of LLMs’ continuous latent space as a new form of “model language,”(Chen et al.,2025b)by either (i) leveraging hidden representations within transformers to enable single model’s internal latent chain-of-thought (CoT) reasoning(Hao et al.,2024; Zheng et al.,2025; Zhang et al.,2025), or (ii) employing KV caches or layer embeddings for information exchange across two models(Liu et al.,2024; Fu et al.,2025). However, a comprehensive model collaboration framework unifying both latent reasoning and latent communication remains unexplored. Moving one step forward, we aim to investigate:
Can MAS achieve pure latent collaboration?
To address this question, we introduceLatentMAS, an end-to-end collaborative framework that operates entirely within the continuous latent space. Our core design integrates both internallatent thoughts generationand cross-agentlatent working memory transfer. Inside each agent, reasoning unfolds through auto-regressive generation of last-layer hidden representations, capturing the model’s ongoing internal thoughts without explicit decoding. Across agents, information is exchanged via shared latent working memory stored in layer-wise KV caches, capturing both the input context and newly generated latent thoughts. Overall, LatentMAS is completelytraining-free, enabling all agents to think and interact purely through their internal latent representations.
Building on this framework design, LatentMAS is grounded on three foundational principles, which are substantiated through comprehensive theoretical and empirical analyses:
•Reasoning Expressiveness:Hidden representations naturally encode model continuous thoughts, making each latent step more expressive than discrete tokens and capable of conveying richer semantic information.•Communication Fidelity:Latent working memory preserves input representations and latent thoughts, enabling lossless information transfer across agents.•Collaboration Complexity:LatentMAS exhibits higher collaborative expressiveness than text-based MAS with substantially lower inference complexity.
The first two principles jointly underscore the advantage of LatentMAS by enabling richer latent reasoning and lossless latent communication. The last principle characterizes how latent collaboration reshapes the efficiency-expressiveness tradeoff, illustrating that LatentMAS enables scalable multi-agent coordination with reduced computational overhead.
To empirically assess the efficacy of LatentMAS, we conduct comprehensive evaluations on 9 benchmarks spanning math and science reasoning, commonsense understanding, and code generation (Figure1). Across both sequential and hierarchical MAS settings, and over five different backbone models spanning the Qwen3(Yang et al.,2025)and Llama3(Grattafiori et al.,2024)families, LatentMAS consistently outperforms advanced single-model and text-based MAS baselines by (i) improving accuracy by up to 14.6%, (ii) reducing output token usage by 70.8%-83.7%, and (iii) delivering 4×\times-4.3×\timesfaster end-to-end inference. These results demonstrate that latent collaboration not only enhances system-level reasoning quality but also provides consistent efficiency gains without any additional training. Further detailed analyses of latent thought expressiveness, working-memory transfer, and input-output alignment confirm that LatentMAS enables semantically meaningful, lossless, and stable collaboration entirely in latent space.
2Preliminary and Notations
Auto-regressive Generation in Transformer.Letfθ(⋅)f_{\theta}(\cdot)denotes the function computed by a standard Transformer model(Vaswani et al.,2017), parameterized byθ\theta. Given an input sequencex=(x1,x2,…,xT)x=(x_{1},x_{2},\dots,x_{T}), the transformerfθ(⋅)f_{\theta}(\cdot)first encodes each token via its input embedding layerWinW_{\text{in}}to obtain token embeddings up to steptt, i.e.,E=[e1,e2,…,et]∈ℝt×dhE=[e_{1},e_{2},\dots,e_{t}]\in\mathbb{R}^{t\times d_{h}}, wheredhd_{h}is the model’s hidden dimension. The input token embeddingsEEthen successively process throughLLtransformer layers in the forward pass through the model’s residual stream, yielding the final-layer hidden representationsH=[h1,h2,…,ht]∈ℝt×dhH=[h_{1},h_{2},\dots,h_{t}]\in\mathbb{R}^{t\times d_{h}}. For next token generation, the model computes:
fθ(xt+1∣x≤t)\displaystyle f_{\theta}(x_{t+1}\mid x_{\leq t})=softmax(htWout),\displaystyle=\mathrm{softmax}(h_{t}W_{\text{out}}),(1)whereWoutW_{\text{out}}denotes the language model head that maps the hidden representation to the vocabulary space. Each token is generated in an auto-regressive manner and appended to the input sequence. Forlatent generation, the model performs auto-regression over hidden representations instead of tokens, bypassing explicit decoding(Hao et al.,2024).
KV Cache as Working Memory.In decoder-only Transformers, the Key-Value (KV) cache functions as a dynamic working memory during auto-regressive generation, storing intermediate representations from previous decoding steps to avoid redundant computation. Specifically, given the input embeddingsEE, each transformer layer projects them through projection matricesWQ,WK,WVW_{Q},W_{K},W_{V}to obtainQ,K,VQ,K,V. When the next token at stept+1t+1is generated, the model appends its embedding to the input sequence and updates the cache (Kcache,VcacheK_{\mathrm{cache}},V_{\mathrm{cache}}) as:
Kcache←[K≤t;Kt+1],Vcache←[V≤t;Vt+1],K_{\mathrm{cache}}\leftarrow[K_{\leq t};K_{t+1}],\quad V_{\mathrm{cache}}\leftarrow[V_{\leq t};V_{t+1}],(2)whereK≤tK_{\leq t},V≤tV_{\leq t}are accumulated key/value matrices from all previous steps andKt+1K_{t+1},Vt+1V_{t+1}are new key/value vectors computed from the current token’s hidden state. This accumulative property enables the KV cache to maintain a growing working memory of model internal representations.
Figure 2:Illustration of sequential and hierarchical MAS settings.
Figure 3:Overview of the LatentMAS framework.To enable end-to-end system-wide latent collaboration, each LLM agent in LatentMAS first performs reasoning in latent space by generating latent thoughts from last-layer hidden states (Section3.1), and then communicates its internal reasoning to subsequent agents via layer-wise transfer through shared latent working memory (Section3.2).LLM-based MAS Setting.We consider a multi-agent system𝒮\mathcal{S}composed ofNNagents, denoted as𝒜={A1,A2,…,AN},\mathcal{A}=\{A_{1},A_{2},\dots,A_{N}\},where each agentAiA_{i}is an LLM corresponding tofθif_{\theta_{i}}above. At inference time, an input questionqqis provided to the system𝒮\mathcal{S}, which orchestrates interactions among agents to collaboratively produce a final answeraacorresponding toqq. As MAS design paradigms are not definitive in general and often vary across downstream tasks(Tran et al.,2025; Cemri et al.,2025), we do not restrict our latent collaboration design to any particular architecture. Instead, we adopt two most commonly used MAS settings (sequentialandhierarchical) as the bases to experimentally evaluate our method. Figure2illustrates the two MAS architecture settings. In thesequential MAS, we adopt a chain-of-agents design(Zhang et al.,2024b; Zhao et al.,2026)comprising four LLM agents:planner,critic,refiner, andsolver. These agents assume complementary reasoning roles and are organized in a sequential pipeline, where the CoT output of each agent with the questionqqserves as the input to the next agent. In thehierarchical MAS, we adopt a domain-specialized design(Zhuge et al.,2024; Zhao et al.,2025). Multiple LLM agents, includingcode,math, andscienceagents, operate as different domain experts. Each agent independently reasons over the questionqqfrom its disciplinary perspective. Asummarizeragent then receives all intermediate responses along with the questionqqand performs hierarchical aggregation to synthesize and refine the final answer.
3LatentMAS
We introduce LatentMAS, an end-to-end latent collaboration framework that all agents reason and communicate entirely within the latent space. As illustrated in Figure3, our method enables LLM agents to (i) auto-regressively generate expressive latent thoughts (Section3.1), (ii) transfer latent working memory across agents with lossless communication fidelity (Section3.2), and (iii) achieve lower inference complexity than text-based MAS while supporting a higher degree of model coordination (Section3.3).
Method Roadmap.In the following sub-sections, we present the complete pipeline of LatentMAS, detailing each component and interleaving theoretical analyses to justify the corresponding design principles.
3.1Auto-regressive Latent Thoughts Generation
We first describe how each LLM agent performs latent reasoning through its layer-wise hidden states. Instead of decoding explicit tokens, reasoning unfolds directly within the agent by auto-regressively appending hidden representations from the final model layer.
Given the input embeddingsE=[e1,e2,…,et]E=[e_{1},e_{2},\dots,e_{t}]containing the information from the questionqqand each agent’s instruction prompt, each LLM agentAi∈𝒜A_{i}\in\mathcal{A}passesEEthroughLLtransformer layers to compute the last-layer hidden representationhth_{t}at current steptt. Then, we inserthth_{t}as the input embedding for the next stept+1t+1, replacing the original decoding and next-token embedding processes used in standard token generation. We auto-regressively repeat the process formmlatent steps, yielding a sequence of newly generated last-layer hidden statesH=[ht+1,ht+2,…,ht+m]H=[h_{t+1},h_{t+2},\dots,h_{t+m}]. We define the continuous output representationsHHas thelatent thoughtsgenerated byAiA_{i}.
Input-Output Distribution Alignment.Since the newly generatedHHform a sequence of dense, high-level representations, directly inserting them into shallow layers as input embeddings may lead to out-of-distribution activations(Meegahapola et al.,2019; Zhou et al.,2019), as these hidden states differ from the statistical patterns of learned token embeddings. To mitigate this in a training-free manner, we propose alinear alignment operatorthat maps last-layer hidden states back to the valid input embeddings. Specifically, givenWinW_{\text{in}},WoutW_{\text{out}}as the input and output embedding layers ofAiA_{i}, we seek a projection matrixWa∈ℝdh×dhW_{a}\in\mathbb{R}^{d_{h}\times d_{h}}that maps each output vectorh∈Hh\in Hto a new input vectoreeto align with valid input space defined byWinW_{\text{in}}:
e=hWa,whereWa≈Wout†Win.\displaystyle e=hW_{a},\quad\text{where }W_{a}\approx W_{\text{out}}^{\dagger}W_{\text{in}}.(3)Here,Wout†W_{\text{out}}^{\dagger}is the pseudo-inverse(Hoerl & Kennard,1970)ofWoutW_{\text{out}}(in AppendixA, we provide a detailed theoretical justification for the effectiveness ofWaW_{a}in the input–output alignment process, followed by a formal derivation ofWaW_{a}). We then append the aligned vectoreeinto the input sequence for auto-regressive latent generation. Note thatWaW_{a}is a small projection matrix of sizedh×dhd_{h}\times d_{h}(e.g.,dhd_{h}=1024 for Qwen3-0.6B) and is computed once and reused in all subsequent latent steps. This design makes the alignment computationally negligible while maintaining distributional consistency between latent and discrete representations.
Expressiveness on Continuous Latent Thoughts.With the mechanism of latent thought generation established within each agent, we next provide a theoretical analysis to quantify its representational advantage over conventional discrete token generation. The following theorem formalizes that latent thoughts, which inherently preserve richer semantic structures, achieve substantially higher expressive capacity than discrete text-based reasoning.
Theorem 3.1(Expressiveness of Latent Thoughts).
Under the Linear Representation Hypothesis onhh(stated in AssumptionB.1), if the sequence of all latent thoughts with lengthmmcan be expressed losslessly through corresponding text-based reasoning, then the length of text (in tokens) needs to be at leastΩ(dhm/log|𝒱|),\Omega\big(d_{h}m/\log|\mathcal{V}|\big),where|𝒱|>1|\mathcal{V}|>1denotes the vocabulary size.
As an illustration to Remark3.2, for Qwen3-4B / 8B / 14B models(Yang et al.,2025), latent thoughts generation can be 235.7 / 377.1 / 471.4 times more efficient than text-based reasoning. The full proof of Theorem3.1is provided in AppendixB.1. Beyond reasoning within individual agents, collaboration in LatentMAS further relies on how these agents exchange latent information, which we detail next.
3.2Thoughts Transfer via Latent Working Memory
In text-based MAS, after one LLM agent completes its generation, the natural language output is directly appended to the input sequence of the next agent. However, since each agent in LatentMAS performs hidden-state generation without explicit text outputs, we design a newlatent working memorytransfer mechanism to ensure lossless information preservation and exchange.
For clarity, we describe the memory transfer mechanism using the first two consecutive LLM agentsA1,A2∈𝒜A_{1},A_{2}\in\mathcal{A}. As shown in Figure3, agentA1A_{1}first performsmmlatent steps of generation (Section3.1). After completing these steps, we extract the KV-caches from allLLtransformer layers ofA1A_{1}once and collect thelatent working memoryofA1A_{1}as:
ℳA1\displaystyle\mathcal{M}_{A_{1}}={(KA1,cache(l),VA1,cache(l))|l=1,2,…,L},\displaystyle=\left\{\left(K^{(l)}_{A_{1},\mathrm{cache}},V^{(l)}_{A_{1},\mathrm{cache}}\right)\,\middle|\,l=1,2,\dots,L\right\},(4)withKA1,cache(l)=[KA1,1(l),…,KA1,t+m(l)],\displaystyle\text{with }K^{(l)}_{A_{1},\mathrm{cache}}=[K^{(l)}_{A_{1},1},\dots,K^{(l)}_{A_{1},t+m}],andVA1,cache(l)=[VA1,1(l),…,VA1,t+m(l)].\displaystyle\text{and }V^{(l)}_{A_{1},\mathrm{cache}}=[V^{(l)}_{A_{1},1},\dots,V^{(l)}_{A_{1},t+m}].HereKA1,cache(l)K^{(l)}_{A_{1},\mathrm{cache}}andVA1,cache(l)V^{(l)}_{A_{1},\mathrm{cache}}are accumulated key and value matrices at thell-th layer. Unlike direct cache-sharing methods that exchange information mainly on prefilled input context across models, the collection of layer-wise caches inℳA1\mathcal{M}_{A_{1}}encapsulates both the initial input context and the newly generated latent thoughts of agentA1A_{1}.
Next, the successive agentA2A_{2}integrates the working memoryℳA1\mathcal{M}_{A_{1}}from agentA1A_{1}. BeforeA2A_{2}generates latent thoughts (i.e., last-layer hidden states), we perform layer-wise concatenation to update its KV cache by prepending eachKA1,cache(l)K^{(l)}_{A_{1},\mathrm{cache}}andVA1,cache(l)V^{(l)}_{A_{1},\mathrm{cache}}to existingKA2,cache(l)K^{(l)}_{A_{2},\mathrm{cache}}andVA2,cache(l)V^{(l)}_{A_{2},\mathrm{cache}}. By doing so, the new latent thoughts generation inA2A_{2}is conditioned on both the working memory ofA1A_{1}and its own internal representations.
Table 1:Main results of LatentMAS on 6 general tasks under the Sequential MAS setting.We report 3 metrics in total, including task accuracy (%,“Acc.”), total output token usage (“Token”), and end-to-end inference speed (time(s) / run,“Speed”). We compare LatentMAS with both TextMAS and single-model (“Single”) baselines. For each metric, weboldthe better performance and visualize LatentMAS gains over TextMAS in theImprovecolumns. Additional results under the Hierarchical MAS setting are shown in Table3.TasksMetricsQwen3-4BImproveQwen3-8BImproveQwen3-14BImproveSingleTextMASLatentMASSingleTextMASLatentMASSingleTextMASLatentMASSequential MAS SettingAcc.95.496.498.6↑\uparrow2.295.699.198.8↓\downarrow0.397.299.099.4↑\uparrow0.4ARC-EToken7242420581↓\downarrow76.0%6562085490↓\downarrow76.5%6081670224↓\downarrow86.6%Speed3692874512×\times5.640437021759×\times2.155191712124×\times4.3Acc.89.290.092.3↑\uparrow2.391.094.694.4↓\downarrow0.292.695.995.6↓\downarrow0.3ARC-CToken9132678718↓\downarrow73.2%8462252529↓\downarrow76.5%7732985426↓\downarrow85.7%Speed971579260×\times6.12662059703×\times2.933851251136×\times4.5Acc.82.489.888.2↓\downarrow1.681.192.393.8↑\uparrow1.583.793.895.2↑\uparrow1.4GSM8KToken11363172607↓\downarrow80.9%12802324860↓\downarrow63.0%11183324644↓\downarrow80.6%Speed4691970375×\times5.34491739543×\times3.253637291952×\times1.9Acc.47.765.366.3↑\uparrow1.053.075.075.3↑\uparrow0.364.780.380.7↑\uparrow0.4MedQAToken213439621685↓\downarrow57.5%209842601555↓\downarrow63.5%174634441841↓\downarrow46.5%Speed2361267438×\times2.94761923928×\times2.1136041421420×\times2.9Acc.63.569.873.5↑\uparrow3.764.869.574.6↑\uparrow5.168.572.875.7↑\uparrow2.9MBPP+Token163444201339↓\downarrow69.7%205336951164↓\downarrow68.5%185849711621↓\downarrow67.4%Speed5232148577×\times3.7106436281275×\times2.8241087282400×\times3.6Acc.75.079.779.9↑\uparrow0.274.480.580.5↑\uparrow0.076.881.186.5↑\uparrow5.4HumanEval+Token238059871775↓\downarrow70.4%250745931866↓\downarrow59.4%236659342042↓\downarrow65.6%Speed2741044350×\times3.05021619497×\times3.3108440621285×\times3.2
Lossless Information Transfer.The latent working memory transfer mechanism ensures that each succeeding agent in LatentMAS seamlessly receives its predecessor’s complete output without re-encoding. In the following theorem, we further formalize this property to demonstrate that latent working memory transfer guarantees information fidelity equivalent to explicit input exchange.
Theorem 3.3(Information Preservation via Latent Working Memory).
In both latent and text-based reasoning, the outputs of an agent when receiving latent working memory from preceding agents are equivalent to those obtained when directly inputting the preceding agents’ outputs.
Full proof of Theorem3.3is provided inB.2. With lossless information preservation, we transfer latent working memory in KV rather than directly transmitting hidden states to avoid redundant recomputation for the successive agent.
3.3End-to-End Pipeline with Complexity Analyses
For the remaining agents in LatentMAS, we follow the same latent thoughts generation and working memory transfer mechanism described above. Specifically, agentA3A_{3}inherits the working memoryℳA2\mathcal{M}_{A_{2}}from the preceding agentA2A_{2}, performs auto-regressive last-layer hidden state generation, and subsequently transmits its updated latent working memoryℳA3\mathcal{M}_{A_{3}}to the next agent. This process continues across all agents in LatentMAS, with only the last agent decoding the final answer in texts. In the following theorem, we theoretically analyze the overall complexity of LatentMAS.
Theorem 3.4(LatentMAS Complexity).
The time complexity for each agent of LatentMAS isO((dh2m+dhm2+dhtm)L)O\big((d_{h}^{2}m+d_{h}m^{2}+d_{h}tm)L\big), wherettis the input length of this agent, andmmis the length of latent thoughts. In contrast, assuming Theorem3.1, the time complexity for each agent of the vanilla text-based MAS needs to beO((dh3m1log|𝒱|+dh3m21log2|𝒱|+dh2tm1log|𝒱|)L+dh2|𝒱|m1log|𝒱|)O\big(\big(d_{h}^{3}m\frac{1}{\log|\mathcal{V}|}+d_{h}^{3}m^{2}\frac{1}{\log^{2}|\mathcal{V}|}+d_{h}^{2}tm\frac{1}{\log|\mathcal{V}|}\big)L+d_{h}^{2}|\mathcal{V}|m\frac{1}{\log|\mathcal{V}|}\big)to achieve the same expressiveness.
The full proof of Theorem3.4is provided inB.3. It is worth noting that LatentMAS is agnostic to specific model collaboration strategies and can be seamlessly applied to sequential, hierarchical, or other advanced MAS designs.
4Empirical Evaluations
Figure 4:Efficiency gains of LatentMAS over single model and TextMAS under the Sequential MAS setting.Left:LatentMAS achieves consistent faster end-to-end inference speed.Right:LatentMAS requires fewer system-wise token usage in total.Tasks and Datasets.We conduct comprehensive evaluations of LatentMAS across 9 benchmarks spanning both general-purpose and reasoning-intensive tasks: (i)Math & Science Reasoning, including GSM8K(Cobbe et al.,2021), AIME24(Maxwell-Jia,2024), AIME25(math ai,2025), GPQA-Diamond(Rein et al.,2023), and MedQA(Yang et al.,2024a); (ii)Commonsense Reasoning, including ARC-Easy(Clark et al.,2018b)and ARC-Challenge(Clark et al.,2018a); and (iii)Code Generation, including MBPP-Plus(Liu et al.,2023)and HumanEval-Plus(Liu et al.,2023). Additional benchmark details are provided in AppendixC.1.
Models and Baselines.We adopt 5 off-the-shelf models from both the Qwen3 family(Yang et al.,2025)(4B, 8B, and 14B) and the Llama 3 family(Grattafiori et al.,2024)(3B, 8B) to construct LatentMAS at different scales. For baseline comparison, we evaluate LatentMAS against: (i)Single LLM agents (Single), where a single LLM directly performs standard auto-regressive generation with token-level decoding; (ii)Sequential text-based MAS (Sequential TextMAS), following the chain-of-agents design(Zhang et al.,2024b)with text-mediated reasoning and communication; and (iii)Hierarchical text-based MAS (Hierarchical TextMAS), where domain-specialized agents collaborate through a summarizer(Zhuge et al.,2024)using text-based reasoning and communication. Detailed model and baseline implementations are provided in AppendixC.2.
Implementation Details.For latent thoughts generation, we compute the realignment matrixWaW_{a}once per run and reuse it across all inference steps. Each LLM agent performsm∈{0,10,20,40,80}m\in\{0,10,20,40,80\}latent steps during reasoning. For working memory transfer, we directly concatenate the KV caches from the immediately preceding agent into the corresponding transformer layers through thepast_key_valuesinterface in HuggingFaceTransformers(Face,2025). Besides the HuggingFace implementation, we also integrate all baseline methods and LatentMAS with thevLLMbackend(Kwon et al.,2023), enabling prefix caching and tensor-parallel inference for efficient deployment of larger LLM agents. We perform hyperparameter tuning and report the mean performance over three independent runs. Across both baselines and our method, we set all LLM agents with a temperature of 0.6 and a top-ppof 0.95. We adjust the maximum output length for each task according to its relative difficulty. We set the maximum length to 2,048 tokens for ARC-Eacy, ARC-Challenge, and GSM8K, 4096 tokens for MedQA, MBPP+, and Humaneval+, 8,192 tokens for GPQA and 20,000 tokens for AIME24 and AIME25. All experiments are conducted on 8×\timesNVIDIA A100-80G GPUs.
Table 2:Main results of LatentMAS on 3 reasoning-intensive tasks under Sequential and Hierarchical MAS settings.We report accuracy, token usage, and end-to-end speed, and highlight the performance gains following the same evaluation protocol as in Table1.TasksMetricsQwen3-8BImproveQwen3-14BImproveSingleTextMASLatentMASSingleTextMASLatentMASSequential MAS SettingAcc.50.053.356.7↑\uparrow3.463.363.366.7↑\uparrow3.4AIME24Token12891385968953↓\downarrow76.8%112633209210593↓\downarrow67.0%Speed4212808688×\times4.1101845541149×\times4.0Acc.46.753.353.3↑\uparrow0.056.760.063.3↑\uparrow3.3AIME25Token14692450888699↓\downarrow80.7%112984461811402↓\downarrow74.4%Speed4503150820×\times3.8104051841473×\times3.5Acc.39.943.445.5↑\uparrow2.148.551.552.0↑\uparrow0.5GPQA-DiamondToken6435179864571↓\downarrow74.6%5547126765454↓\downarrow57.0%Speed8135771854×\times6.8104397141475×\times6.6Hierarchical MAS SettingAcc.50.053.353.3↑\uparrow0.063.370.073.3↑\uparrow3.3AIME24Token12891426297526↓\downarrow82.3%112632902510230↓\downarrow64.8%Speed4213132776×\times4.0101857181089×\times5.3Acc.46.750.050.0↑\uparrow0.056.766.766.7↑\uparrow0.0AIME25Token146925392913230↓\downarrow75.5%11298500039527↓\downarrow80.9%Speed4503488616×\times5.7104060191056×\times5.7Acc.39.943.046.9↑\uparrow3.948.552.053.0↑\uparrow1.0GPQA-DiamondToken6435224503395↓\downarrow84.9%5547209313606↓\downarrow82.8%Speed8136108798×\times7.7104391191458×\times6.3
4.1Main Results
Tables1,2, and3report the overall performance of LatentMAS across 9 general and reasoning-intensive benchmarks built from 5 different scales of LLM backbones. To thoroughly examine collaboration behaviors during inference, we evaluate each method from three complementary perspectives: (i)task accuracy, (ii)system throughput (total output tokens), and (iii)end-to-end inference speed. Across all tasks, LatentMAS consistently improves over the single-model baseline by an average of 14.6% and 13.3% under the sequential and hierarchical settings, respectively, and further yields gains of 2.8% and 4.6% over text-based MAS. Under identical MAS architectures, LatentMAS provides 4×\timesand 4.3×\timesfaster inference speed on average compared with sequential and hierarchical text-based MAS. Additionally, as the agents’ collaboration occurs entirely in latent space, LatentMAS reduces token usage by 70.8% and 83.7% relative to sequential and hierarchical TextMAS. We leave additional results on Llama backbones in AppendixD.2.
4.2Efficiency Analyses on Latent Collaboration.
End-to-end Inference Speed.As established earlier in Theorem3.1, LatentMAS can achieve orders-of-magnitude higher expressiveness, thereby enabling more efficient inference than text-based MAS. We further empirically validate this advantage through comparisons with TextMAS. As shown in Figure4and10(left), even after accelerating the TextMAS baselines using the vLLM service, LatentMAS still achieves a 2.6×\times-7×\timesspeedup over the vLLM-optimized TextMAS. This improvement stems from the substantially reduced number of latent steps required for latent thoughts generation compared with the much larger decoding steps needed for per-token text generation. With fewer than 50 latent steps, LatentMAS attains comparable or even higher performance on reasoning-intensive tasks such as AIME 24/25, whereas TextMAS typically requires more than 20K output tokens to complete full text-based CoT trajectories.
Token Throughput.In addition, as illustrated in Figure4and10(right), LatentMAS reduces token usage by 59.4%-87.9% compared with TextMAS, as agents in LatentMAS communicate by directly transferring latent working memory into another agent’s internal layers rather than relying on the text medium. LatentMAS also achieves 15.0%-60.3% lower token usage than single agents. Compared with single-model reasoning, LatentMAS distributes the input question across collaborating agents, greatly reducing the burden on the final agent, which primarily aggregates preceding latent thoughts and decodes the final answer using only a small number of tokens. As a result, the entire system generates fewer output tokens while still achieving higher accuracy.
Figure 5:Illustration of the semantic meaning encoded by LatentMAS.LatentMAS’s generated latent thoughts embeddings largely cover the embedding space of text-based generated tokens, indicating the semantic consistency and expressive capacity of LatentMAS.
Figure 6:Effectiveness of the input-output alignmentWaW_{a}on MedQA.Unaligned output embeddings (hth_{t}) drift away from the original input embeddings (ete_{t}), while the aligned vectors (et+1e_{t+1}) realign withete_{t}, demonstrating thatWaW_{a}preserves embedding-space structure and prevents representation drift. Analyses on additional models are provided in AppendixE.1.
5In-depth Analyses on LatentMAS
Do Latent Thoughts Reflect Text Reasoning?We first verify whether latent thoughts generation in LatentMAS produces meaningful and semantically expressive representations. To this end, we compare the distribution of newly generated last-layer embeddings in LatentMAS with the embeddings of token-by-token responses produced by TextMAS. Experiments are conducted on 300 MedQA questions, using 40 latent steps for LatentMAS and a 4096 max-token budget for the TextMAS baseline.
As shown in Figure5, we highlight two key observations: (i) The last-layer embeddings from LatentMAS share nearly the same region of the embedding space with the token embeddings from TextMAS, indicating that latent thoughts encode similar semantic representations as the correct text responses. (ii) The last-layer embeddings from LatentMAS largely cover the distribution of token embeddings from TextMAS, indicating that latent thoughts offer greater diversity and expressive capacity than discrete tokens. Together, these findings show that latent thoughts not only capture the valid semantics of their corresponding text responses but also encode richer and more expressive representations inside. We further include a case study in AppendixJanalyzing how LLM agents in LatentMAS interpret their own latent thoughts to provide additional validation.
Effectiveness on Input-Output Alignment.We next empirically evaluate the effectiveness of the input-output alignment in our method design. First, we compare the input vectorete_{t}obtained from the standard token embedding layer with both the newly generated output vectorhth_{t}before alignment and the after-aligned vectoret+1e_{t+1}. As shown in Figure6, we visualize the three embedding vectors by comparing both density distributions and geometric relationships in the projected embedding space. We observe that the newhth_{t}deviates largely from the original input embeddingete_{t}. After applyingWaW_{a}, the aligned vectoret+1e_{t+1}realigns withete_{t}, indicating thatWaW_{a}effectively restores the geometric and statistical structure of the input embedding space and mitigates representation drift across iterative latent steps. In Figure7, across 3 different tasks, we further compare downstream performance before and after applyingWaW_{a}and observe consistent accuracy gains of 2.3%-5.3% byWaW_{a}.
Figure 7:Downstream accuracy performance (%) on 3 benchmarks before/after applying the input-output alignmentWaW_{a}.Optimal Latent Step Depth.To understand how many latent steps are needed for optimal performance in LatentMAS, we analyze the effect of increasing latent step depth across three downstream tasks. As shown in Figure8, increasing the number of latent steps generally improves downstream performance, indicating that additional latent thoughts enhance collaborative expressiveness. Across the three tasks on Qwen3-14B, we find that accuracy steadily rises and peaks around 40-80 steps. Beyond this range, performance plateaus or declines, suggesting that excessive latent thought generation may introduce redundant or less useful information. Based on this observation, we adopt a moderate latent step budget within this range in practice, as it consistently provides the best accuracy-efficiency trade-off without requiring any task-specific training procedures.
Figure 8:Effectiveness of different latent step depths.Intermediate Error Tracking within LatentMAS.To better audit and interpret the latent collaboration process, we extend our original framework with an additional“debug mode”to enable a flexible conversion between inner latent thoughts and outer human-readable texts within each agent. We leave implementation details on how to converse to the debug mode in LatentMAS, together with additional detailed analyses and case studies, to AppendixE,F, andJ.
6Related Work
LLM-based Multi-agent Systems.Recent studies in Agentic AI have extended classical multi-agent systems(Hong et al.,2023; Park et al.,2023a; Yang et al.,2024c)grounded in traditional reinforcement learning and policy coordination, to modern LLM settings(Li et al.,2025b; Tan et al.,2025), enabling models to operate as autonomous agents that collaborate in reasoning, planning, and problem-solving(Tao et al.,2024; Wang et al.,2025c; Zhao et al.,2025). Early methods(Yao et al.,2022; Wu et al.,2024; Li et al.,2023)coordinate multiple LLMs through explicit dialogue and role assignment. Additional methods introduce structured communication protocols to enhance cooperation efficiency(Ye et al.,2025a)and emergent specialization(Mieczkowski et al.,2025; Huang et al.,2026)among agents. Building upon these advanced text-MAS methods, our work aims to construct a multi-agent collaboration system entirely in latent space, treating agents as tightly integrated components to achieve more expressive and efficient coordination.
Model Collaboration in Latent Space.Recent studies on model collaboration(Tran et al.,2025)have gradually shifted from text-level coordination toward interaction directly in latent representations. Instead of exchanging information through texts, these works(Coda-Forno et al.,2025; Ye et al.,2025a; Fu et al.,2025; Zheng et al.,2025)enable models to communicate through model internal states such as hidden representations(Coda-Forno et al.,2025)for efficiency(Kong et al.,2026). Most existing methods rely on auxiliary alignment mechanisms and typically require additional training to facilitate effective cross-model communication. On the other hand, LatentMAS is a training-free latent multi-agent system in which each agent reasons and communicates via native latent representations.
We leave additional related works on latent reasoning and agent collaboration in AppendixH.
7Conclusion
We introduce LatentMAS, a training-free framework that enables multi-agent systems to collaborate entirely within latent space. Our framework enables latent auto-regressive reasoning with a lossless latent working-memory transfer mechanism. Theoretical analyses and empirical evaluations across diverse reasoning and generation benchmarks demonstrate that LatentMAS consistently improves accuracy performance, token usage, and decoding speed over strong single-model and text-based MAS baselines. LatentMAS establishes a scalable and general paradigm for agentic collaboration beyond the constraints of natural language.
Acknowledgment
The authors thank members of the iDEA-iSAIL Lab and the Zou Group for helpful discussions and feedback on this work. This work is supported by National Science Foundation under Award No. IIS-2433308 and Award No. IIS-2117902. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.
Impact Statement
This paper introduces LatentMAS, a framework for multi-agent collaboration in the latent space, aiming to improve reasoning performance while reducing inference cost. The methods and findings in this work are intended to advance research in efficient and scalable machine learning systems. We do not anticipate immediate negative societal impacts beyond those commonly associated with deploying more capable and efficient language model systems.
References
- Acharya et al. (2025)Acharya, D. B., Kuppan, K., and Divya, B.Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey.IEEe Access, 2025.
- Ainsworth et al. (2022)Ainsworth, S. K., Hayase, J., and Srinivasa, S.Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022.
- Cemri et al. (2025)Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., et al.Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025.
- Chen et al. (2025a)Chen, W., Yuan, J., Qian, C., Yang, C., Liu, Z., and Sun, M.Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system.InFindings of the Association for Computational Linguistics: ACL 2025, pp. 11534–11557, 2025a.
- Chen et al. (2025b)Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., Chen, Y., Zhang, W., Wang, J., Li, W., and Shen, X.Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025b.
- Clark et al. (2018a)Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O.Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018a.
- Clark et al. (2018b)Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O.Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018b.
- Cobbe et al. (2021)Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J.Training verifiers to solve math word problems, 2021.URLhttps://arxiv.org/abs/2110.14168.
- Coda-Forno et al. (2025)Coda-Forno, J., Zhao, Z., Zhang, Q., Tamboli, D., Li, W., Fan, X., Zhang, L., Schulz, E., and Tseng, H.-P.Exploring system 1 and 2 communication for latent reasoning in llms.arXiv preprint arXiv:2510.00494, 2025.
- Ding et al. (2025)Ding, H., Tao, S., Pang, L., Wei, Z., Gao, J., Ding, B., Shen, H., and Cheng, X.Toolcoder: A systematic code-empowered tool learning framework for large language models.arXiv preprint arXiv:2502.11404, 2025.
- Face (2025)Face, H.Transformers documentation.https://huggingface.co/docs/transformers/en/index, 2025.
- Fourney et al. (2024)Fourney, A., Bansal, G., Mozannar, H., Tan, C., Salinas, E., Niedtner, F., Proebsting, G., Bassman, G., Gerrits, J., Alber, J., et al.Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024.
- Fu et al. (2025)Fu, T., Min, Z., Zhang, H., Yan, J., Dai, G., Ouyang, W., and Wang, Y.Cache-to-cache: Direct semantic communication between large language models.arXiv preprint arXiv:2510.03215, 2025.
- Fungwacharakorn et al. (2024)Fungwacharakorn, W., Thanh, N. H., Zin, M. M., and Satoh, K.Layer-of-thoughts prompting (lot): Leveraging llm-based retrieval with constraint hierarchies.arXiv preprint arXiv:2410.12153, 2024.
- Grattafiori et al. (2024)Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.
- Guo et al. (2024)Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X.Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024.
- Hao et al. (2024)Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y.Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024.
- Hoerl & Kennard (1970)Hoerl, A. E. and Kennard, R. W.Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970.
- Hong et al. (2023)Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S. K. S., Lin, Z., et al.Metagpt: Meta programming for a multi-agent collaborative framework.InThe Twelfth International Conference on Learning Representations, 2023.
- Hu et al. (2025)Hu, M., Zhou, Y., Fan, W., Nie, Y., Xia, B., Sun, T., Ye, Z., Jin, Z., Li, Y., Chen, Q., et al.Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation.arXiv preprint arXiv:2505.23885, 2025.
- Huang et al. (2026)Huang, Q., Zhou, Z., Li, Y., Yang, K., Wang, B., and Wang, Y.Many minds, one goal: Time series forecasting via sub-task specialization and inter-agent cooperation.InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026.URLhttps://openreview.net/forum?id=Uon41HfqR3.
- Jin et al. (2025)Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J.Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025.
- Kong et al. (2026)Kong, Z., Li, Y., Zeng, F., Xin, L., Messica, S., Lin, X., Zhao, P., Kellis, M., Tang, H., and Zitnik, M.Token reduction should go beyond efficiency in generative models – from vision, language to multimodality, 2026.URLhttps://arxiv.org/abs/2505.18227.
- Kwon et al. (2023)Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I.Efficient memory management for large language model serving with pagedattention.InProceedings of the 29th symposium on operating systems principles, pp. 611–626, 2023.
- Li et al. (2025a)Li, B., Sun, X., Liu, J., Wang, Z., Wu, J., Yu, X., Chen, H., Barsoum, E., Chen, M., and Liu, Z.Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a.
- Li et al. (2023)Li, G., Al Kader Hammoud, H. A., Itani, H., Khizbullin, D., and Ghanem, B.Camel: communicative agents for ”mind” exploration of large language model society.InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc.
- Li et al. (2025b)Li, Z., Ji, Q., Ling, X., and Liu, Q.A comprehensive review of multi-agent reinforcement learning in video games.IEEE Transactions on Games, 2025b.
- Li et al. (2025c)Li, Z., Zhang, H., Han, S., Liu, S., Xie, J., Zhang, Y., Choi, Y., Zou, J., and Lu, P.In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025c.
- Li et al. (2026)Li, Z., Zou, J., Fang, F., Ning, X., Ai, M., Wei, T., Chen, S., Yang, X., and He, J.Heterogeneous scientific foundation model collaboration.CoRR, abs/2604.27351, 2026.doi:10.48550/ARXIV.2604.27351.URLhttps://doi.org/10.48550/arXiv.2604.27351.
- Liu et al. (2023)Liu, J., Xia, C. S., Wang, Y., and Zhang, L.Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023.
- Liu et al. (2024)Liu, L., Pfeiffer, J., Wu, J., Xie, J., and Szlam, A.Deliberation in latent space via differentiable cache augmentation.arXiv preprint arXiv:2412.17747, 2024.
- math ai (2025)math ai.AIME 2025 dataset.https://huggingface.co/datasets/math-ai/aime25, 2025.
- Maxwell-Jia (2024)Maxwell-Jia.AIME 2024 dataset.https://huggingface.co/datasets/Maxwell-Jia/AIME_2024, 2024.
- Meegahapola et al. (2019)Meegahapola, L., Subramaniam, V., Kaplan, L., and Misra, A.Prior activation distribution (pad): A versatile representation to utilize dnn hidden units.arXiv preprint arXiv:1907.02711, 2019.
- Mieczkowski et al. (2025)Mieczkowski, E., Mon-Williams, R., Bramley, N., Lucas, C. G., Velez, N., and Griffiths, T. L.Predicting multi-agent specialization via task parallelizability.arXiv preprint arXiv:2503.15703, 2025.
- Park et al. (2023a)Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S.Generative agents: Interactive simulacra of human behavior.InProceedings of the 36th annual acm symposium on user interface software and technology, pp. 1–22, 2023a.
- Park et al. (2023b)Park, K., Choe, Y. J., and Veitch, V.The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023b.
- Pezeshkpour et al. (2024)Pezeshkpour, P., Kandogan, E., Bhutani, N., Rahman, S., Mitchell, T., and Hruschka, E.Reasoning capacity in multi-agent systems: Limitations, challenges and human-centered solutions.arXiv preprint arXiv:2402.01108, 2024.
- Rein et al. (2023)Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R.Gpqa: A graduate-level google-proof q&a benchmark, 2023.URLhttps://arxiv.org/abs/2311.12022.
- Sagi & Rokach (2018)Sagi, O. and Rokach, L.Ensemble learning: A survey.Wiley interdisciplinary reviews: data mining and knowledge discovery, 8(4):e1249, 2018.
- Shi et al. (2025)Shi, X., Chiesa, M., Maguire Jr, G. Q., and Kostic, D.Kvcomm: Enabling efficient llm communication through selective kv sharing.arXiv preprint arXiv:2510.03346, 2025.
- Tan et al. (2025)Tan, L., Wei, F., Ma, X., Peng, R., Xiao, H., and Yang, L.Systemic condition-based maintenance optimization under inspection uncertainties: A customized multiagent reinforcement learning approach.IEEE Transactions on Reliability, 2025.
- Tao et al. (2024)Tao, W., Zhou, Y., Wang, Y., Zhang, W., Zhang, H., and Cheng, Y.Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems, 37:51963–51993, 2024.
- Tran et al. (2025)Tran, K.-T., Dao, D., Nguyen, M.-D., Pham, Q.-V., O’Sullivan, B., and Nguyen, H. D.Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025.
- Vaswani et al. (2017)Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I.Attention is all you need.In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.),Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Wang et al. (2025a)Wang, J., WANG, J., Athiwaratkun, B., Zhang, C., and Zou, J.Mixture-of-agents enhances large language model capabilities.InThe Thirteenth International Conference on Learning Representations, 2025a.URLhttps://openreview.net/forum?id=h0ZfDIrj7T.
- Wang et al. (2025b)Wang, Q., Shi, Y., Wang, Y., Zhang, Y., Wan, P., Gai, K., Ying, X., and Wang, Y.Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025b.
- Wang et al. (2025c)Wang, Z., Moriyama, S., Wang, W.-Y., Gangopadhyay, B., and Takamatsu, S.Talk structurally, act hierarchically: A collaborative framework for llm multi-agent systems.arXiv preprint arXiv:2502.11098, 2025c.
- Wortsman et al. (2022)Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al.Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.InInternational conference on machine learning, pp. 23965–23998. PMLR, 2022.
- Wu et al. (2025)Wu, F., Li, Z., Wei, F., Li, Y., Ding, B., and Gao, J.Talk to right specialists: Routing and planning in multi-agent system for question answering.arXiv preprint arXiv:2501.07813, 2025.
- Wu et al. (2024)Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., et al.Autogen: Enabling next-gen llm applications via multi-agent conversations.InFirst Conference on Language Modeling, 2024.
- Yan et al. (2025)Yan, B., Zhou, Z., Zhang, L., Zhang, L., Zhou, Z., Miao, D., Li, Z., Li, C., and Zhang, X.Beyond self-talk: A communication-centric survey of llm-based multi-agent systems.arXiv preprint arXiv:2502.14321, 2025.
- Yang et al. (2025)Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025.
- Yang et al. (2024a)Yang, H., Chen, H., Guo, H., Chen, Y., Lin, C.-S., Hu, S., Hu, J., Wu, X., and Wang, X.Llm-medqa: Enhancing medical question answering through case studies in large language models.arXiv preprint arXiv:2501.05464, 2024a.
- Yang et al. (2024b)Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. R., and Press, O.SWE-agent: Agent-computer interfaces enable automated software engineering.InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.URLhttps://openreview.net/forum?id=mXpq6ut8J3.
- Yang et al. (2024c)Yang, Y., Peng, Q., Wang, J., Wen, Y., and Zhang, W.Llm-based multi-agent systems: Techniques and business perspectives.arXiv preprint arXiv:2411.14033, 2024c.
- Yao et al. (2022)Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y.React: Synergizing reasoning and acting in language models.InThe eleventh international conference on learning representations, 2022.
- Ye et al. (2025a)Ye, H., Gao, Z., Ma, M., Wang, Q., Fu, Y., Chung, M.-Y., Lin, Y., Liu, Z., Zhang, J., Zhuo, D., et al.Kvcomm: Online cross-context kv-cache communication for efficient llm-based multi-agent systems.arXiv preprint arXiv:2510.12872, 2025a.
- Ye et al. (2025b)Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al.Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025b.
- Yue et al. (2024)Yue, L., Xing, S., Chen, J., and Fu, T.Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning.InProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–10, 2024.
- Zhang et al. (2024a)Zhang, C., He, S., Qian, J., Li, B., Li, L., Qin, S., Kang, Y., Ma, M., Liu, G., Lin, Q., et al.Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279, 2024a.
- Zhang et al. (2024b)Zhang, Y., Sun, R., Chen, Y., Pfister, T., Zhang, R., and Arik, S.Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024b.
- Zhang et al. (2025)Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y., and Wang, X. E.Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025.
- Zhao et al. (2026)Zhao, J., Xie, H., Lei, Y., Song, X., Shi, Z., Li, L., Liu, S., Xie, L., and Zhang, H.Cochain: Balancing insufficient and excessive collaboration in llm agent workflows, 2026.URLhttps://arxiv.org/abs/2505.10936.
- Zhao et al. (2025)Zhao, W., Yuksekgonul, M., Wu, S., and Zou, J.Sirius: Self-improving multi-agent systems via bootstrapped reasoning.arXiv preprint arXiv:2502.04780, 2025.
- Zheng et al. (2025)Zheng, Y., Zhao, Z., Li, Z., Xie, Y., Gao, M., Zhang, L., and Zhang, K.Thought communication in multiagent collaboration.arXiv preprint arXiv:2510.20733, 2025.
- Zhou et al. (2025)Zhou, H., Geng, H., Xue, X., Kang, L., Qin, Y., Wang, Z., Yin, Z., and Bai, L.Reso: A reward-driven self-organizing llm-based multi-agent system for reasoning tasks.arXiv preprint arXiv:2503.02390, 2025.
- Zhou et al. (2019)Zhou, W., Du, J., and Ren, X.Improving bert fine-tuning with embedding normalization.arXiv preprint arXiv:1911.03918, 2019.
- Zhu et al. (2025)Zhu, H., Hao, S., Hu, Z., Jiao, J., Russell, S., and Tian, Y.Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025.
- Zhuge et al. (2024)Zhuge, M., Wang, W., Kirsch, L., Faccio, F., Khizbullin, D., and Schmidhuber, J.Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024.
- Zou et al. (2023)Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al.Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023.
Table of Contents
- 1Introduction
- 2Preliminary and Notations
- 3LatentMAS1. 3.1Auto-regressive Latent Thoughts Generation 2. 3.2Thoughts Transfer via Latent Working Memory 3. 3.3End-to-End Pipeline with Complexity Analyses
- 4Empirical Evaluations1. 4.1Main Results 2. 4.2Efficiency Analyses on Latent Collaboration.
- 5In-depth Analyses on LatentMAS
- 6Related Work
- 7Conclusion
- References
- AInput-Output Alignment in LatentMAS1. A.1Theoretical Justification onWaW_{a} 2. A.2Solving the Alignment MatrixWaW_{a}
- BTheoretical Analysis1. B.1Proof of Theorem3.1 2. B.2Proof of Theorem3.3 3. B.3Proof of Theorem3.4
- CExperiment Setups1. C.1Evaluation Details 2. C.2Implementation Details
- DAdditional Experiments1. D.1Performance of LatentMAS under Hierarchical MAS Setting 2. D.2LatentMAS on Llama Backbones 3. D.3Additional Efficiency Analyses on LatentMAS
- EDetailed Analyses on LatentMAS1. E.1Additional Analyses on Input-Output Alignment 2. E.2Quantifying Latent Thought Diversity. 3. E.3Ablations on Latent Reasoning and Communication in LatentMAS
- FHow to Debug LatentMAS?
- GAdditional Discussions on LatentMAS
- HAdditional Related Works
- ICase Study on Intermediate Error Tracking of LatentMAS
- JCase Study on LatentMAS
- KPrompt Template for LatentMAS
Appendix
Appendix AInput-Output Alignment in LatentMAS
A.1Theoretical Justification onWaW_{a}
In this section, we outline the theoretical justification for howWaW_{a}minimizes the distributional gap between the distribution of token embeddings and the distribution of aligned embeddings.
LetPeP_{e}andPhP_{h}be the distribution of token embeddingseeand the hidden embeddingshh, respectively. We assume thatPeP_{e}andPhP_{h}can be generated bye=∑x∈𝒱σxWin,xe=\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}andh=β∑x∈𝒱σxWout,xh=\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}, respectively, where each semanticσ∈ℝ𝒱\sigma\in\mathbb{R}^{\mathcal{V}}is represented as a weight vector over tokens𝒱\mathcal{V}with‖σ‖2≤1\|\sigma\|_{2}\leq 1and follows an underlying discrete semantic distributionσ∼Pσ\sigma\sim P_{\sigma}, andβ>0\beta>0characterizes the scale of output embeddingshh. In practice, we calculate the scaleβ\betaby𝒱‖h‖∑x∈𝒱‖Win,x‖\frac{\mathcal{V}\|h\|}{\sum_{x\in\mathcal{V}}\|W_{\text{in},x}\|}. For an alignment matrixWaW_{a}, the aligned embedding distributionPe^,WaP_{\hat{e},W_{a}}is
Pe^,Wa:e^=hWa,h∼Ph.\displaystyle P_{\hat{e},W_{a}}:\;\hat{e}=hW_{a},\quad h\sim P_{h}.(5)Our goal is to minimize the distance between the aligned embedding distributionPe^,WaP_{\hat{e},W_{a}}and the token embedding distributionPeP_{e}, which we measure via the Wasserstein distance:
dWasserstein(Pe^,Wa,Pe):=infγ∈Γ(Pe,Pe^,Wa)𝔼(e^,e)∼γ[‖e^−e‖22],\displaystyle d_{\textnormal{Wasserstein}}(P_{\hat{e},W_{a}},P_{e}):=\inf_{\gamma\in\Gamma(P_{e},P_{\hat{e},W_{a}})}\sqrt{\mathop{\mathbb{E}}_{(\hat{e},e)\sim\gamma}[\|\hat{e}-e\|_{2}^{2}]},(6)whereΓ(Pe^,Wa,Pe)\Gamma(P_{\hat{e},W_{a}},P_{e})is the set of all couplings ofPeP_{e}andPe^,WaP_{\hat{e},W_{a}}.
Theorem A.1(Upper Bound on Distribution Alignment).
For any alignment matrixWaW_{a}, the Wasserstein distance betweenPeP_{e}andPe^,WaP_{\hat{e},W_{a}}is upper bounded by
dWasserstein(Pe^,Wa,Pe)≤‖βWoutWa−Win‖F.\displaystyle d_{\textnormal{Wasserstein}}(P_{\hat{e},W_{a}},P_{e})\leq\|\beta W_{\textnormal{out}}W_{a}-W_{\textnormal{in}}\|_{F}.(7) As we show in AppendixA.2, our choice ofWaW_{a}(Equation3) minimizes this upper bound ofW(Pe^,Wa,Pe)W(P_{\hat{e},W_{a}},P_{e}).
Proof.
Consider the following joint distributionγ∗(e^,e)\gamma^{*}(\hat{e},e):
γ∗(e^,e):=∑σ∈supp(Pσ)Pσ(σ)1[β∑x∈𝒱σxWout,xWa=e^]1[∑x∈𝒱σxWin,x=e].\displaystyle\gamma^{*}(\hat{e},e):=\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}.(8)Note that for everye^\hat{e},
∑e∈supp(Pe)γ∗(e^,e)\displaystyle\sum_{e\in\operatorname{supp}(P_{e})}\gamma^{*}(\hat{e},e)=∑e∈supp(Pe)∑σ∈supp(Pσ)Pσ(σ)1[β∑x∈𝒱σxWout,xWa=e^]1[∑x∈𝒱σxWin,x=e]\displaystyle=\sum_{e\in\operatorname{supp}(P_{e})}\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}(9)=∑σ∈supp(Pσ)Pσ(σ)1[β∑x∈𝒱σxWout,xWa=e^]∑e∈supp(Pe)1[∑x∈𝒱σxWin,x=e]\displaystyle=\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}\sum_{e\in\operatorname{supp}(P_{e})}1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}(10)=∑σ∈supp(Pσ)Pσ(σ)1[β∑x∈𝒱σxWout,xWa=e^]1[∑x∈𝒱σxWin,x∈supp(Pe)]\displaystyle=\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}\in\operatorname{supp}(P_{e})]}(11)=∑σ∈supp(Pσ)Pσ(σ)1[β∑x∈𝒱σxWout,xWa=e^]\displaystyle=\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}(12)=Pe^,Wa(e^);\displaystyle=P_{\hat{e},W_{a}}(\hat{e});(13)and that for everyee,
∑e^∈supp(Pe^,Wa)γ∗(e^,e)\displaystyle\sum_{\hat{e}\in\operatorname{supp}(P_{\hat{e},W_{a}})}\gamma^{*}(\hat{e},e)=∑e^∈supp(Pe^,Wa)∑σ∈supp(Pσ)Pσ(σ)1[β∑x∈𝒱σxWout,xWa=e^]1[∑x∈𝒱σxWin,x=e]\displaystyle=\sum_{\hat{e}\in\operatorname{supp}(P_{\hat{e},W_{a}})}\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}(14)=∑σ∈supp(Pσ)Pσ(σ)1[∑x∈𝒱σxWin,x=e]∑e^∈supp(Pe^,Wa)1[β∑x∈𝒱σxWout,xWa=e^]\displaystyle=\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}\sum_{\hat{e}\in\operatorname{supp}(P_{\hat{e},W_{a}})}1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}(15)=∑σ∈supp(Pσ)Pσ(σ)1[∑x∈𝒱σxWin,x=e]1[β∑x∈𝒱σxWout,xWa∈supp(Pe^,Wa)]\displaystyle=\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}\in\operatorname{supp}(P_{\hat{e},W_{a}})]}(16)=∑σ∈supp(Pσ)Pσ(σ)1[∑x∈𝒱σxWin,x=e]\displaystyle=\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}(17)=Pe(e).\displaystyle=P_{e}(e).(18)This impliesγ∗∈Γ(Pe^,Wa,Pe)\gamma^{*}\in\Gamma(P_{\hat{e},W_{a}},P_{e}). Sinceσ\sigmais a distribution over tokens, and∥⋅∥22\|\cdot\|_{2}^{2}is a convex function. Finally, since‖σ‖2≤\|\sigma\|_{2}\leqfor all semanticσ∈supp(Pσ)\sigma\in\operatorname{supp}(P_{\sigma}), we have
dWasserstein(Pe^,Wa,Pe)\displaystyle d_{\textnormal{Wasserstein}}(P_{\hat{e},W_{a}},P_{e})=infγ∈Γ(Pe,Pe^,Wa)𝔼(e^,e)∼γ[‖e^−e‖22]\displaystyle=\inf_{\gamma\in\Gamma(P_{e},P_{\hat{e},W_{a}})}\sqrt{\mathop{\mathbb{E}}_{(\hat{e},e)\sim\gamma}[\|\hat{e}-e\|_{2}^{2}]}(19)≤𝔼(e^,e)∼γ∗[‖e^−e‖22]\displaystyle\leq\sqrt{\mathop{\mathbb{E}}_{(\hat{e},e)\sim\gamma^{*}}[\|\hat{e}-e\|_{2}^{2}]}(20)=∑e^∈supp(Pe^,Wa)∑e∈supp(Pe)γ∗(e^,e)‖e^−e‖22\displaystyle=\sqrt{\sum_{\hat{e}\in\operatorname{supp}(P_{\hat{e},W_{a}})}\sum_{e\in\operatorname{supp}(P_{e})}\gamma^{*}(\hat{e},e)\|\hat{e}-e\|_{2}^{2}}(21)=∑e^∈supp(Pe^,Wa)∑e∈supp(Pe)∑σ∈supp(Pσ)Pσ(σ)1[β∑x∈𝒱σxWout,xWa=e^]1[∑x∈𝒱σxWin,x=e]‖e^−e‖22\displaystyle=\sqrt{\sum_{\hat{e}\in\operatorname{supp}(P_{\hat{e},W_{a}})}\sum_{e\in\operatorname{supp}(P_{e})}\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}\|\hat{e}-e\|_{2}^{2}}(22)=∑σ∈supp(Pσ)Pσ(σ)∑e^∈supp(Pe^,Wa)∑e∈supp(Pe)1[β∑x∈𝒱σxWout,xWa=e^]1[∑x∈𝒱σxWin,x=e]‖e^−e‖22\displaystyle=\sqrt{\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)\sum_{\hat{e}\in\operatorname{supp}(P_{\hat{e},W_{a}})}\sum_{e\in\operatorname{supp}(P_{e})}1_{[\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}=\hat{e}]}1_{[\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}=e]}\|\hat{e}-e\|_{2}^{2}}(23)=∑σ∈supp(Pσ)Pσ(σ)‖β∑x∈𝒱σxWout,xWa−∑x∈𝒱σxWin,x‖22\displaystyle=\sqrt{\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)\bigg\|\beta\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{out},x}W_{a}-\sum_{x\in\mathcal{V}}\sigma_{x}W_{\text{in},x}\bigg\|_{2}^{2}}(24)=∑σ∈supp(Pσ)Pσ(σ)‖∑x∈𝒱σx(βWout,xWa−Win,x)‖22\displaystyle=\sqrt{\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)\bigg\|\sum_{x\in\mathcal{V}}\sigma_{x}(\beta W_{\text{out},x}W_{a}-W_{\text{in},x})\bigg\|_{2}^{2}}(25)=∑σ∈supp(Pσ)Pσ(σ)‖σ(βWoutWa−Win)‖22\displaystyle=\sqrt{\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)\|\sigma(\beta W_{\text{out}}W_{a}-W_{\text{in}})\|_{2}^{2}}(26)≤∑σ∈supp(Pσ)Pσ(σ)‖σ‖22‖βWoutWa−Win‖22\displaystyle\leq\sqrt{\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)\|\sigma\|_{2}^{2}\|\beta W_{\text{out}}W_{a}-W_{\text{in}}\|_{2}^{2}}(27)≤∑σ∈supp(Pσ)Pσ(σ)‖βWoutWa−Win‖22\displaystyle\leq\sqrt{\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)\|\beta W_{\text{out}}W_{a}-W_{\text{in}}\|_{2}^{2}}(28)≤∑σ∈supp(Pσ)Pσ(σ)‖βWoutWa−Win‖F2\displaystyle\leq\sqrt{\sum_{\sigma\in\operatorname{supp}(P_{\sigma})}P_{\sigma}(\sigma)\|\beta W_{\text{out}}W_{a}-W_{\text{in}}\|_{F}^{2}}(29)=‖βWoutWa−Win‖F2\displaystyle=\sqrt{\|\beta W_{\text{out}}W_{a}-W_{\text{in}}\|_{F}^{2}}(30)=‖βWoutWa−Win‖F.\displaystyle=\|\beta W_{\text{out}}W_{a}-W_{\text{in}}\|_{F}.(31)∎
A.2Solving the Alignment MatrixWaW_{a}
In Section3.1, we put the last-layer hidden stateshhback to the input sequence to enable the model’s latent reasoning. However, since thehhis not perfectly aligned with the input embedding space, directly feedinghhinto shallow layers may lead to out-of-distribution activation patterns inside LLMs. To mitigate this in a training-free way, we seek a matrixWaW_{a}which mapshhto a valid input space (i.e.,e=hWae=hW_{a}). A straightforward way to calculateWaW_{a}is to enforce that the aligned latent vectoreebehaves similarly to a real input embedding when it enters the model. Motivated by our TheoremA.1below, this corresponds to the following minimization problem:
minWa‖βWoutWa−Win‖F2.\min_{W_{a}}\|\beta W_{\text{out}}W_{a}-W_{\text{in}}\|_{F}^{2}.(32)This objective is quadratic inWaW_{a}, so we can derive a closed-form solution by setting its derivative to zero, which yields the normal equation:
βWout⊤WoutWa−Wout⊤Win=0.\beta W_{\text{out}}^{\top}W_{\text{out}}W_{a}-W_{\text{out}}^{\top}W_{\text{in}}=0.(33)Solving forWaW_{a}gives:
Wa=1β(Wout⊤Wout)−1Wout⊤Win.W_{a}=\frac{1}{\beta}\big(W_{\text{out}}^{\top}W_{\text{out}}\big)^{-1}W_{\text{out}}^{\top}W_{\text{in}}.(34)For numerical stability, we further add a small hyperparameterλ>0\lambda>0to obtain a ridge regression solution(Hoerl & Kennard,1970):
Wa=1β(Wout⊤Wout+λI)−1Wout⊤Win,W_{a}=\frac{1}{\beta}\big(W_{\text{out}}^{\top}W_{\text{out}}+\lambda I\big)^{-1}W_{\text{out}}^{\top}W_{\text{in}},(35)which we compute once and reuse for all latent reasoning steps. Here,λ>0\lambda>0is a small hyperparameter to approximate the pseudo-inverseWout†=limλ↘0(Wout⊤Wout+λI)−1Wout⊤W_{\text{out}}^{\dagger}=\lim_{\lambda\searrow 0}\big(W_{\text{out}}^{\top}W_{\text{out}}+\lambda I\big)^{-1}W_{\text{out}}^{\top}.
Appendix BTheoretical Analysis
B.1Proof of Theorem3.1
Assumption B.1(Linear Representation Hypothesis;Park et al.,2023b).
We assume that the hidden embeddingshhare linear combinations∑i=1dhcisi\sum_{i=1}^{d_{h}}c_{i}s_{i}of an underlying semantic basis{s1,…,sdh}⊂ℝdh\{s_{1},\dots,s_{d_{h}}\}\subset\mathbb{R}^{d_{h}}(linearly independent) with ternary coefficientsc1,…,cdh∈{0,±1}c_{1},\dots,c_{d_{h}}\in\{0,\pm 1\}, whereci=0c_{i}=0represents thathhdoes not have semanticii, andci=±1c_{i}=\pm 1represents thathhhas semanticiiin a positive/negative way.
Theorem B.1(Restate of Theorem3.1).
Under the Linear Representation Hypothesis onhh, if the sequence of all latent thoughts with lengthmmcan be expressed losslessly through corresponding text-based reasoning, then the length of text (in tokens) needs to be at leastΩ(dhm/log|𝒱|),\Omega\big(d_{h}m/\log|\mathcal{V}|\big),where|𝒱|>1|\mathcal{V}|>1denotes the vocabulary size.
Proof of Theorem3.1.
Under AssumptionB.1, the setℋ\mathcal{H}of hidden embeddings is
ℋ={∑i=1dhcisi:c1,…,cdh∈{0,±1}},\displaystyle\mathcal{H}=\bigg\{\sum_{i=1}^{d_{h}}c_{i}s_{i}:c_{1},\dots,c_{d_{h}}\in\{0,\pm 1\}\bigg\},(36)where{s1,…,sdh}⊂ℝdh\{s_{1},\dots,s_{d_{h}}\}\subset\mathbb{R}^{d_{h}}is the underlying semantic basis. Then, the set of length-ttlatent reasoning sequences isℋm\mathcal{H}^{m}. Since the semantic basis is linearly independent, the size of the setℋ\mathcal{H}of hidden embeddings is
|ℋ|=|{0,±1}||{s1,…,sdh}|=3dh.\displaystyle|\mathcal{H}|={|\{0,\pm 1\}|}^{|\{s_{1},\dots,s_{d_{h}}\}|}=3^{d_{h}}.(37)Thus, the size of the set of length-mmlatent reasoning sequences is
|ℋm|=|ℋ|m=(3dh)m=3dhm.\displaystyle|\mathcal{H}^{m}|=|\mathcal{H}|^{m}=(3^{d_{h}})^{m}=3^{d_{h}m}.(38)To represent the setℋm\mathcal{H}^{m}of length-mmlatent reasoning sequences via the set𝒱m′\mathcal{V}^{m^{\prime}}of length-m′m^{\prime}text-based reasoning sequences losslessly, there needs to exist an surjective map from𝒱m′\mathcal{V}^{m^{\prime}}toℋm\mathcal{H}^{m}, which implies that|𝒱m′|≥|ℋm||\mathcal{V}^{m^{\prime}}|\geq|\mathcal{H}^{m}|. Therefore,
m′\displaystyle m^{\prime}=log|𝒱|(|𝒱|m′)=log|𝒱||𝒱m′|\displaystyle=\log_{|\mathcal{V}|}(|\mathcal{V}|^{m^{\prime}})=\log_{|\mathcal{V}|}|\mathcal{V}^{m^{\prime}}|(39)≥log|𝒱||ℋm|=log|𝒱|(3dhm)\displaystyle\geq\log_{|\mathcal{V}|}|\mathcal{H}^{m}|=\log_{|\mathcal{V}|}(3^{d_{h}m})(40)=dhmlog3log|𝒱|=Ω(dhmlog|𝒱|).∎\displaystyle=\frac{d_{h}m\log 3}{\log|\mathcal{V}|}=\Omega\Big(\frac{d_{h}m}{\log|\mathcal{V}|}\Big).\qed(41)
B.2Proof of Theorem3.3
Theorem B.2(Restate of Theorem3.3).
In both latent and text-based reasoning, the outputs of an agent when receiving latent working memory from preceding agents are equivalent to those obtained when directly inputting the preceding agents’ outputs.
Proof.
Leth(l),K(l),V(l)h^{(l)},K^{(l)},V^{(l)}andh′(l),K′,(l)V′(l)h^{\prime(l)},K^{\prime}{}^{(l)},V^{\prime}{}^{(l)}denote the output, keys, and values ofll-th transformer layer when receiving latent working memory from preceding agents and when directly inputting the preceding agents’ outputs, respectively. In the following, we will use induction to show thath(l)=h′(l)h^{(l)}=h^{\prime}{}^{(l)}for every layerl=1,…,Ll=1,\dots,L.
Induction step.
Suppose thath(l−1)=h′(l−1)h^{(l-1)}=h^{\prime}{}^{(l-1)}, and we will show thath(l)=h′(l)h^{(l)}=h^{\prime}{}^{(l)}.
The KV cache containsK≤t+m(l)K_{\leq t+m}{}^{(l)}andV≤t+m(l)V_{\leq t+m}{}^{(l)}. For each past token layers, at each attention layer, the transformer produces one column ofK≤t+m(l)K_{\leq t+m}{}^{(l)}and a corresponding column ofV≤t+m(l)V_{\leq t+m}{}^{(l)}. At the next step the model forms a query from the current input and then uses that query together with the storedK≤t+m(l)K_{\leq t+m}{}^{(l)}andV≤t+m(l)V_{\leq t+m}{}^{(l)}to form the attention result. That attention result is a deterministic function of the query and of the keys and values it attends to.
We are comparing two ways to make those same keys and values available to the current computation: (i) actually feeding the earlier tokens into the model again, in which case the model will recompute the same keys and values and then use them in attention; (ii) reading inK≤t+m(l)K_{\leq t+m}{}^{(l)}andV≤t+m(l)V_{\leq t+m}{}^{(l)}from the cache and use them directly. In both cases the keys and values presented to the attention computation are identical, because the cache was produced by the same model on the same inputs.
Given identical keys and values and the same current input, the attention output is the same in both scenarios. The remainder of the transformer computation that produces the last-layer hidden embedding is a deterministic function of that attention output (and the current input). Therefore the last-layer hidden embeddingh(l)h^{(l)}produced for the current step is the same whether the model recomputed keys/values from tokens or readK≤t+m,(l)V≤t+m(l)K_{\leq t+m}{}^{(l)},V_{\leq t+m}{}^{(l)}from cache. Formally, sinceh(l−1)=h′(l−1)h^{(l-1)}=h^{\prime}{}^{(l-1)},K≤t+m=(l)K≤t+m′(l)K_{\leq t+m}{}^{(l)}=K^{\prime}_{\leq t+m}{}^{(l)}, andV≤t+m=(l)V≤t+m′(l)V_{\leq t+m}{}^{(l)}=V^{\prime}_{\leq t+m}{}^{(l)}, thenh(l)=h′(l)h^{(l)}=h^{\prime}{}^{(l)}.
Induction base case.
For the first layer, similarly with the induction step, since the input is the same (for both latent-based and text-based reasoning),K≤t+m=(1)K≤t+m′(1)K_{\leq t+m}{}^{(1)}=K^{\prime}_{\leq t+m}{}^{(1)}, andV≤t+m=(1)V≤t+m′(1)V_{\leq t+m}{}^{(1)}=V^{\prime}_{\leq t+m}{}^{(1)}, thenh(1)=h′(1)h^{(1)}=h^{\prime}{}^{(1)}.
Conclusion.
By induction, we have thath(l)=h′(l)h^{(l)}=h^{\prime}{}^{(l)}of every layerl=1,…,Ll=1,\dots,L. In particular, sinceh=h(L)h=h{}^{(L)}andh′=h′(L)h^{\prime}=h^{\prime}{}^{(L)}, thenh=h=(L)h′=(L)h′h=h{}^{(L)}=h^{\prime}{}^{(L)}=h^{\prime}. ∎
B.3Proof of Theorem3.4
Theorem B.3(Restate of Theorem3.4).
The time complexity for each agent of LatentMAS isO((dh2m+dhm2+dhtm)L)O\big((d_{h}^{2}m+d_{h}m^{2}+d_{h}tm)L\big), wherettis the input length of this agent, andmmis the length of latent thoughts. In contrast, assuming Theorem3.1, the time complexity for each agent of the vanilla text-based MAS needs to beO((dh3m1log|𝒱|+dh3m21log2|𝒱|+dh2tm1log|𝒱|)L+dh2|𝒱|m1log|𝒱|)O\big(\big(d_{h}^{3}m\frac{1}{\log|\mathcal{V}|}+d_{h}^{3}m^{2}\frac{1}{\log^{2}|\mathcal{V}|}+d_{h}^{2}tm\frac{1}{\log|\mathcal{V}|}\big)L+d_{h}^{2}|\mathcal{V}|m\frac{1}{\log|\mathcal{V}|}\big)to achieve the same expressiveness.
Proof.
We analyze the time complexity of our LatentMAS and the vanilla text-based MAS separately.
Time complexity of our method.
Recall that a transformer layer consists of two main components: self-attention and feed-forward networks. For a length-(t+m)(t+m)sequence, the time complexity to compute self-attention formmlatent reasoning steps isO(dh(t+m)m)=O(dh(m2+tm))O(d_{h}(t+m)m)=O(d_{h}(m^{2}+tm))due to the attention computation betweenO(t2)O(t^{2})pairs of tokens, and the time complexity to compute feed-forward networks formmlatent reasoning steps isO(dh2m)O(d_{h}^{2}m)due to matrix–vector multiplication. Since there areLLlayers, the overall time complexity of our method is
O((dh(m2+tm)+dh2m)L).\displaystyle O\big((d_{h}(m^{2}+tm)+d_{h}^{2}m)L\big).(42)
Time complexity of the vanilla text-based MAS.
Letm′m^{\prime}denote the number of text-based reasoning steps. Similarly with the complexity analysis of our method, the time complexity to compute the hidden embeddings is
O((dh(m′+2tm′)+dh2m′)L).\displaystyle O\big((d_{h}(m^{\prime}{}^{2}+tm^{\prime})+d_{h}^{2}m^{\prime})L\big).(43)Besides that, due to matrix–vector multiplication and softmax computation, the time complexity to decode hidden embeddings into tokens is
O(dh|𝒱|m′).\displaystyle O\big(d_{h}|\mathcal{V}|m^{\prime}\big).(44)Hence, the overall time complexity of the vanilla MAS is
O((dh(m′+2tm′)+dh2m′)L+dh|𝒱|m′).\displaystyle O\big((d_{h}(m^{\prime}{}^{2}+tm^{\prime})+d_{h}^{2}m^{\prime})L+d_{h}|\mathcal{V}|m^{\prime}\big).(45)Assuming Theorem3.1, the number of text-based reasoning steps is
m′=O(dhmlog|𝒱|).\displaystyle m^{\prime}=O\Big(\frac{d_{h}m}{\log|\mathcal{V}|}\Big).(46)It follows that the overall time complexity is
O((dh(m′+2tm)+dh2m′)L+dh|𝒱|m′)\displaystyle O\big((d_{h}(m^{\prime}{}^{2}+tm)+d_{h}^{2}m^{\prime})L+d_{h}|\mathcal{V}|m^{\prime}\big)(47)=\displaystyle={}((dh((dhmlog|𝒱|)2+t(dhmlog|𝒱|))+dh2(dhmlog|𝒱|))L+dh|𝒱|(dhmlog|𝒱|))\displaystyle\Big(\Big(d_{h}\Big(\Big(\frac{d_{h}m}{\log|\mathcal{V}|}\Big)^{2}+t\Big(\frac{d_{h}m}{\log|\mathcal{V}|}\Big)\Big)+d_{h}^{2}\Big(\frac{d_{h}m}{\log|\mathcal{V}|}\Big)\Big)L+d_{h}|\mathcal{V}|\Big(\frac{d_{h}m}{\log|\mathcal{V}|}\Big)\Big)(48)=\displaystyle={}O((dh3m2log2|𝒱|+dh3mlog|𝒱|+dh2tmlog|𝒱|)L+dh2|𝒱|mlog|𝒱|).\displaystyle O\Big(\Big(\frac{d_{h}^{3}m^{2}}{\log^{2}|\mathcal{V}|}+\frac{d_{h}^{3}m}{\log|\mathcal{V}|}+\frac{d_{h}^{2}tm}{\log|\mathcal{V}|}\Big)L+\frac{d_{h}^{2}|\mathcal{V}|m}{\log|\mathcal{V}|}\Big).(49) ∎
Appendix CExperiment Setups
C.1Evaluation Details
We introduce all datasets used in our experiments as follows:
Math & Science Reasoning.
- •GSM8K(Cobbe et al.,2021)is a widely used benchmark of 8.5K grade-school math word problems designed to evaluate multi-step numerical reasoning. Each problem requires decomposing a natural-language description into structured arithmetic steps, making it a standard testbed for assessing chain-of-thought reasoning ability.
- •AIME24(Maxwell-Jia,2024)consists of 30 competition-level problems from the 2024 American Invitational Mathematics Examination. These questions span algebra, geometry, number theory, and combinatorics, and require precise numeric answers with typically 1–3 digits, making the benchmark a compact but challenging evaluation of high-school Olympiad-style reasoning.
- •AIME25(math ai,2025)provides 30 additional problems from the 2025 AIME exam, maintaining the same answer format and difficulty profile. Compared with AIME24, this benchmark includes more multi-phase derivations and intricate combinatorial constructions, offering a complementary stress test for mathematical robustness.
- •GPQA-Diamond(Rein et al.,2023)is the most difficult split of the GPQA benchmark with 198 questions, featuring graduate-level multiple-choice questions written by domain experts in physics, biology, and chemistry. The dataset emphasizes conceptual depth, cross-disciplinary reasoning, and the ability to synthesize multi-step scientific arguments under rigorous distractor settings.
- •MedQA(Yang et al.,2024a)contains real medical licensing exam questions that assess biomedical knowledge, clinical reasoning, and diagnostic decision-making. Problems require integrating textual context with domain-specific medical understanding, making the benchmark a representative testbed for professional-level scientific reasoning.
Commonsense Reasoning.
- •ARC-Easy(Clark et al.,2018b)consists of grade-school science questions from the AI2 Reasoning Challenge that test foundational factual knowledge and straightforward commonsense reasoning. As a simplified subset of ARC, it serves as a baseline measure of basic scientific understanding without requiring complex multi-step inference.
- •ARC-Challenge(Clark et al.,2018a)includes the most difficult items from the AI2 Reasoning Challenge. These questions are intentionally adversarial, requiring multi-hop reasoning, causal and counterfactual inference, and systematic elimination of distractor choices. Performance on ARC-Challenge is widely regarded as a strong indicator of robust commonsense reasoning capabilities.
Code Generation.
- •MBPP-Plus(Liu et al.,2023)extends the original MBPP benchmark with broader input coverage, additional hidden test cases, and stricter execution-based evaluation. Each problem requires generating a self-contained Python function that satisfies a comprehensive unit-test suite, making the benchmark a robust measure of code synthesis reliability and correctness.
- •HumanEval-Plus(Liu et al.,2023)augments HumanEval with denser, more challenging test suites, significantly increasing the rigor of functional correctness evaluation. The benchmark emphasizes generalization beyond prompt examples and tests a model’s ability to produce semantically precise, executable Python code under more demanding verification settings.
C.2Implementation Details
In addition to the experimental setups described in the main paper, we provide additional implementation and evaluation details below.
Software Backend
All methods are implemented in Python using PyTorch and HuggingFaceTransformers, with an optionalvLLMbackend for fast decoding and tensor-parallel inference. We use the official chat templates and special tokens such as<|im_start|>and<|im_end|>.
Evaluation protocol.
For all non-coding benchmarks, we report accuracy based on answer matching of the final answer after text normalization (lowercasing, trimming whitespace, and removing extraneous punctuation).
For multiple-choice datasets (GPQA-Diamond, MedQA, ARC-Easy, ARC-Challenge), we first extract the model’s final answer string and then compare it via exact match to the answer letter.
For numeric problems (GSM8k, AIME24, AIME25), we evaluate correctness based on numeric equality: we extract the final predicted answer, parse both prediction and answer into numbers, and mark as correct only if the two values match. Predictions that fail numeric parsing are counted as incorrect.
For code generation tasks (MBPP-Plus and HumanEval-Plus), we evaluate the code by executing unit tests. Specifically, we extract the predicted code from model’s output, append the ground-truth tests provided by the benchmark, and execute the combined script in a sandboxed environment with a 10-second timeout. A sample is counted as correct if and only if all tests pass without runtime errors.
Appendix DAdditional Experiments
Figure 9:Illustration of the accuracy gains viaLatentMASacross the Sequential and Hierarchical MAS settings. LatentMAS consistently improves system-level reasoning accuracy compared with single model and text-based MAS.### D.1Performance of LatentMAS under Hierarchical MAS Setting
Table 3:Main results of LatentMAS on 6 general tasks under the Hierarchical MAS setting.We report accuracy, token usage, and end-to-end speed, and highlight the performance gains following the same evaluation protocol as in Table1.TasksMetricsQwen3-4BImproveQwen3-8BImproveQwen3-14BImproveSingleTextMASLatentMASSingleTextMASLatentMASSingleTextMASLatentMASHierarchical MAS SettingAcc.95.497.196.8↓\downarrow0.395.698.298.3↑\uparrow0.197.298.398.7↑\uparrow0.4ARC-EToken7242054363↓\downarrow82.3%6562237308↓\downarrow86.2%6082752619↓\downarrow77.5%Speed3692239591×\times3.840436191779×\times2.055171021884×\times3.8Acc.89.292.591.7↓\downarrow0.891.093.393.9↑\uparrow0.692.695.395.5↑\uparrow0.2ARC-CToken9132674447↓\downarrow83.3%8462854344↓\downarrow87.9%7732167295↓\downarrow86.4%Speed971275299×\times4.32662034714×\times2.833842831090×\times3.9Acc.82.489.488.4↓\downarrow1.081.190.489.5↓\downarrow0.983.790.891.6↑\uparrow0.8GSM8KToken11363098555↓\downarrow82.1%12802370353↓\downarrow85.1%11183021495↓\downarrow83.6%Speed4691878360×\times5.24491365702×\times1.953636751631×\times2.3Acc.47.765.067.3↑\uparrow2.353.076.377.0↑\uparrow0.764.778.078.3↑\uparrow0.3MedQAToken213467021015↓\downarrow84.9%209868931007↓\downarrow85.4%17465473899↓\downarrow83.6%Speed2361495557×\times2.74763387964×\times3.5136075911250×\times6.1Acc.63.569.370.6↑\uparrow1.364.871.972.2↑\uparrow0.368.573.073.8↑\uparrow0.8MBPP+Token163467821339↓\downarrow80.3%205377031264↓\downarrow83.6%185874581187↓\downarrow84.1%Speed5231766489×\times3.6106438981387×\times2.8241091622507×\times3.7Acc.75.076.279.3↑\uparrow3.174.476.878.0↑\uparrow1.276.884.186.6↑\uparrow2.5HumanEval+Token238081271373↓\downarrow83.1%250787681274↓\downarrow85.5%236681141512↓\downarrow81.4%Speed274931333×\times2.85021809439×\times4.1108439881188×\times3.4
We report additional results on three Qwen3 backbones under the hierarchical MAS setting to complement the main experiments. As shown in Table3, LatentMAS consistently achieves higher or comparable system-level accuracy relative to TextMAS while substantially reducing both end-to-end inference time and total system token usage. These results mirror the trends observed in the sequential setting and confirm that latent working-memory transfer remains effective in hierarchical collaboration, where multiple expert agents contribute to the final prediction.
D.2LatentMAS on Llama Backbones
Table 4:Main results of LatentMAS on Llama3 families under Sequential and Hierarchical MAS settings.We report accuracy, token usage, and end-to-end speed, and highlight the performance gains following the same evaluation protocol as in Table1.TasksMetricsLlama-3.2-3B-InstructImproveLlama-3.1-8B-InstructImproveSingleTextMASLatentMASSingleTextMASLatentMASSequential MAS SettingAcc.79.280.180.4↑\uparrow0.385.789.691.2↑\uparrow1.6ARC-EToken2361889229↓\downarrow87.9%2191981228↓\downarrow88.5%Speed1551670392×\times4.33522708516×\times5.2Acc.46.748.048.0↑\uparrow0.059.065.065.3↑\uparrow0.3MedQAToken4051595403↓\downarrow74.7%3822354316↓\downarrow86.6%Speed5123685×\times2.8112589182×\times3.2Acc.39.643.244.5↑\uparrow1.351.255.456.7↑\uparrow1.3HumanEval+Token2271042338↓\downarrow67.6%161740218↓\downarrow70.5%Speed6419977×\times2.680.6333101×\times3.3Hierarchical MAS SettingAcc.79.281.682.3↑\uparrow0.785.789.390.4↑\uparrow1.1ARC-EToken236956177↓\downarrow81.5%2191118206↓\downarrow81.6%Speed155615192×\times3.23521832669×\times2.7Acc.46.750.350.0↓\downarrow0.359.064.367.7↑\uparrow3.4MedQAToken4051729359↓\downarrow79.2%3821847191↓\downarrow89.7%Speed5128273×\times3.9112464204×\times2.3Acc.39.643.945.7↑\uparrow1.851.254.255.4↑\uparrow1.2HumanEval+Token227719226↓\downarrow68.6%161414187↓\downarrow54.8%Speed6413372×\times1.880.615688×\times1.8
We report additional sequential and hierarchical MAS results on Llama-3 backbones to complement the main evaluations. Consistent with Table1and2in the main paper, LatentMAS maintains higher or comparable accuracy relative to TextMAS while consistently reducing system-wise token usage and inference time. Across tasks, LatentMAS preserves the same qualitative trend observed for Qwen models, achieving faster end-to-end inference with markedly fewer decoded tokens, indicating that LatentMAS exhibits strong generalizability across different model families.
D.3Additional Efficiency Analyses on LatentMAS
Figure 10:Efficiency gains of LatentMAS over single model and TextMAS under the Hierarchical MAS setting.Left:LatentMAS achieves consistent faster end-to-end inference speed.Right:LatentMAS requires fewer system-wise token usage in total.Final-Agent Token Usage in LatentMAS.To further disentangle the source of token savings, we additionally measure the output length of the final answering agent only. This analysis excludes intermediate agents and therefore directly compares the final decoding stage of LatentMAS and TextMAS. As shown in Table5, the final agent in LatentMAS consistently generates fewer text tokens than the final agent in TextMAS across different model scales and tasks, with an average reduction of 29.1%. This indicates that the token savings in LatentMAS come not only from eliminating intermediate textual communication among agents, but also from shortening the final answer decoding process. This reduction occurs since the final agent in TextMAS receives long textual reasoning traces from preceding agents and often continues the text-based reasoning process before producing the answer. In contrast, the final agent in LatentMAS receives preceding agents’ information through compact latent working memory and aggregates these latent thoughts before decoding the final response. As a result, LatentMAS reduces both inter-agent communication tokens and the final-stage decoding length.
Table 5:Token usage of the final agent in MAS. LatentMAS consistently reduces the final decoding length compared with TextMAS.MethodSizeARC-CGSM8KMedQAHumanEval+GPQA-DTextMAS4B61460715392037–LatentMAS4B44755510151373–TextMAS8B685412166421674758LatentMAS8B344353100712743395TextMAS14B492573130320084517LatentMAS14B29549589915123606
Appendix EDetailed Analyses on LatentMAS
E.1Additional Analyses on Input-Output Alignment
Figure 11:Effectiveness of the input-output alignmentWaW_{a}on 8B models.Unaligned output embeddings (hth_{t}) drift away from the original input embeddings (ete_{t}), while the aligned vectors (et+1e_{t+1}) realign withete_{t}, demonstrating thatWaW_{a}preserves embedding-space structure and prevents representation drift.In Figure11, we extend the input-output alignment analysis to additional 8B model scales beyond the Qwen3-4B and Qwen3-14B examples shown in Figure6. For Qwen3-8B models, the same pattern holds: unaligned latent outputs drift away from the input embedding distribution, while applying the alignment matrix restores geometric and density-level consistency with the original embedding space.
E.2Quantifying Latent Thought Diversity.
While Figure5provides a qualitative visualization of the embedding distributions, we further quantify the representational diversity of latent thoughts using a cosine-similarity-based metric. Specifically, for each method, we compute the average pairwise cosine similarity among the generated embeddings. A lower value indicates that the representations are less collapsed and more dispersed in the embedding space, corresponding to higher representational diversity.
Table 6:Diversity analysis of generated representations. We report the average pairwise cosine similarity among generated embeddings, where lower values indicate less collapsed and more diverse representations.MethodQwen3-4BQwen3-8BQwen3-14BTextMAS0.1260.1420.155LatentMAS0.1040.0930.108
As shown in Table6, LatentMAS consistently achieves lower average cosine similarity than TextMAS across all three Qwen3 backbones. This confirms that the latent thoughts are not merely aligned with the semantic region of text-based reasoning but also span a broader and less concentrated representation space. Together with Figure5, these results provide both qualitative and quantitative evidence that latent thoughts preserve semantic consistency with text reasoning while offering richer representational diversity.
E.3Ablations on Latent Reasoning and Communication in LatentMAS
To isolate the contribution of each component in LatentMAS, we further conduct hybrid ablations by selectively replacing either latent reasoning or latent communication with its text-based counterpart. Specifically, we compare LatentMAS with the following two variants:
- •Latent Reasoning + Text Communication, where each agent performs latent reasoning but communicates to the next agent through decoded text, using only the last 128 decoded tokens;
- •Text Reasoning + Latent Communication, where each agent performs standard text-based reasoning while transferring intermediate information to the next agent through latent working memory.
As shown in Table7, both hybrid variants underperform full LatentMAS across all evaluated tasks. This demonstrates that the gains of LatentMAS cannot be attributed to only one component. Latent reasoning improves the expressiveness of each agent’s internal computation, while latent communication preserves and transfers intermediate information without forcing it through a compressed text interface. Combining both components enables end-to-end latent collaboration and yields the strongest performance.
Table 7:Hybrid ablations on latent reasoning and latent communication using Qwen3-8B. Replacing either component with its text-based counterpart leads to consistent performance drops, indicating that both components contribute to the overall gains of LatentMAS.MethodGSM8KMBPP+MedQALatent Reasoning + Text Communication85.566.465.9Text Reasoning + Latent Communication90.168.071.2LatentMAS93.874.675.3
Appendix FHow to Debug LatentMAS?
In LatentMAS“debug mode”, we introduce a switch that enables us to inspect the content of the latent thoughts of agents without altering the overall LatentMAS pipeline. Specifically, when we switch on the“debug mode”, each agent should simultaneously generate its latent thoughts and another text response within the same contextual information (e.g., agent prompt and latent working memory). The latent thoughts are passed to the next agent, and the parallel text response serves as a probe for the latent thoughts.
By checking the details of text response of intermediate agents, we can identify the root causes of errors in the final answer. To illustrate this, we present two distinct case studies generated in“debug mode”using Qwen3-14B model on GSM8K dataset, corresponding to a failure case and a success case below.
- •Case 1: Incorrect Intermediate Text Leads to Error in LatentMAS Answer. In the first case of SectionI, we examine a problem where LatentMAS produces an incorrect final answer. According to the intermediate agent output, we find the Refiner agent gives an incorrect numerical relationship (as highlighted in red), leading the final Solver agent to utilize this erroneous dependency in its calculation.
- •Case 2: Correct Intermediate Text Leads to Correct LatentMAS Answer. Conversely, in the second case of SectionI, all intermediate agents produce logically sound intermediate outputs, which successfully guides the final agent to derive the correct answer.
Finally, to quantitatively validate that the text in“debug mode”serves as a faithful probe for LatentMAS, we conduct a correlation study on 100 debug text - final answer pairs using Qwen3-14B model on GSM8K dataset, which consists of 80 instances with correct final answers and 20 instances with incorrect final answers. We manually annotated the correctness of the each debug text and analyzed their correlation with the final correctness of LatentMAS.
Table 8:Correlation analysis between the correctness of intermediate text from“debug mode”and the final answers of LatentMAS. The reported statistics demonstrate a high consistency between intermediate text correctness and final answer accuracy.LatentMAS Final AnswerCorrect(N=80N=80)Incorrect(N=20N=20)Intermediate Debug TextCorrect96.2%(77/80)10.0% (2/20)Incorrect3.8% (3/80)90.0%(18/20)As shown in Table8, we observe a high consistency between the correctness of the intermediate text and the final answer. We find that 96.2% of the intermediate text are valid when the final answer is correct, while 90.0% contain errors when the final answer contains error. The result shows that the correctness of intermediate text output aligns well with LatentMAS, validating the“debug mode”as a reliable probe for our framework.
Appendix GAdditional Discussions on LatentMAS
Extension to Heterogeneous Agents.For simplicity and training-free purposes, we assume that all agents in LatentMAS share the same shape of transformer layers. To relax this assumption and support heterogeneous agents in practice, one can directly leverage prior studies on layer mapping and ensemble learning(Ainsworth et al.,2022; Wortsman et al.,2022)by introducing a trainable adapter to align and share latent representations across different models.
In addition, a potential future direction is to adapt advanced post-training paradigms from text-based MAS to optimize LatentMAS ’s latent collaboration protocols to unlock more effective multi-agent reasoning strategies.
Appendix HAdditional Related Works
Latent Reasoning in LLMs.Beyond explicit chain-of-thought (CoT) reasoning, recent work has explored the continuous latent space of LLMs as an alternative reasoning medium(Hao et al.,2024; Chen et al.,2025b; Zhu et al.,2025), revealing that hidden states encode richer semantic structures than what discrete token generation can express(Zhang et al.,2025; Liu et al.,2024). Latent reasoning methods such as CoCoNut(Hao et al.,2024)and latent-space editing approaches (e.g., RepE(Zou et al.,2023), LoT(Fungwacharakorn et al.,2024)) demonstrate that manipulating internal representations can guide models to reason more coherently and improve controllability without explicit token-level rationales. Other works(Li et al.,2025a; Wang et al.,2025b; Shi et al.,2025)have also extended latent reasoning paradigms to vision-language models. These methods leverage the structure of hidden states to perform interventions, such as steering, editing, or optimizing latent trajectories, that shape downstream reasoning behavior while remaining agnostic to surface-level text. By operating directly in the continuous space, they can induce reasoning steps that would be difficult or inefficient to express(Zhang et al.,2025; Liu et al.,2024; Coda-Forno et al.,2025). Despite these benefits, existing techniques are confined to a single model’s internal computations and do not consider interaction or coordination across multiple reasoning entities(Hao et al.,2024). On the other hand, LatentMAS extends latent reasoning to a multi-agent setting, enabling each agent to generate latent thoughts and propagate latent information to others. Our new framework shifts latent reasoning from an isolated capability of individual models to a system-level collaborative mechanism.
Agentic Model Collaboration.With increasing algorithmic advances, LLM-based MAS have been applied across diverse domains(Sagi & Rokach,2018; Chen et al.,2025a; Yan et al.,2025), such as math and science reasoning(Pezeshkpour et al.,2024; Yue et al.,2024; Wang et al.,2025a), open-domain question answering(Fourney et al.,2024; Wu et al.,2025), and multi-modal GUI interaction(Zhang et al.,2024a; Ye et al.,2025b), demonstrating their versatility in complex real-world settings. A large amount of prior works follow sequential planner-solver pipelines or hierarchical expert-summarizer structures, which correspond to the two MAS settings we adopt for evaluating LatentMAS.
Appendix ICase Study on Intermediate Error Tracking of LatentMAS
Incorrect Case with LatentMAS (Debug Mode)Question: In a room, there are various toys: 5 red cars, 3 action figures, and a doll. The doll costs as much as 3 action figures and one red car costs 4 dollars. How much are all toys worth, of an action figure costs 5 dollars? Planner: Okay, I need to make a detailed plan to solve the math problem.1.Determine the cost of one action figure (given as 5\)\.2\.Calculate the total cost of the doll; calculate the total cost of all red cars using the given cost of one red car \(4); calculate the total cost of all action figures by multiplying the number of action figures by the cost of one action figure.3.Add the costs of all toys (red cars, action figures, and the doll) to find the total value of all toys.Critic: The original plan from the Planner Agent was:1.Identify the cost of action figure.2.Calculate the total cost for each type by multiplying the quantity by the individual cost.3.Sum up all the total costs to find the total value of all toys.But the problem is that the Planner Agent’s plan didn’t account for the doll’s cost being dependent on the action figures and the red car. So, the original plan might have missed that step. Let me check the original plan again. But the doll’s cost isn’t given directly. It’s given in terms of other toys. So, the first step should be to calculate the doll’s cost based on the given relationship. The original plan might have missed that step, assuming that the cost of each type is given directly, which is not the case for the doll. Therefore, the plan needs to be adjusted to first calculate the doll’s cost using the given relationship before proceeding to calculate the total cost for each type. So, the feedback should point out that the plan didn’t account for the doll’s cost calculation based on the given relationship. The user needs to first determine the doll’s cost by using the values of the action figures and the red car, then proceed with the rest of the steps. Refiner: The Original Plan is:1.Identify the cost of each type of toy.2.Calculate the total cost for each type by multiplying the quantity by the individual cost.3.Sum up all the total costs to find the total value of all toys.Feedback: The original plan fails to explicitly address the doll’s cost, which is not directly given but derived from the cost of 3 action figures and 1 red car. To refine the plan:•First, calculate the doll’s cost using the given relationship(3×action figure cost+1×red car cost)(3\times\text{action figure cost}+1\times\text{red car cost}).•Ensure all quantities (5 red cars, 3 action figures, 1 doll) are accounted for.•Clarify that the total value includes all toys with their respective quantities and derived costs.The Refined Plan is:1.Determine individual costs.2.Calculate total cost for each toy type.3.Sum all totals.Solver: To determine the total value of all the toys, the doll costs as much as33action figures and11red car, soDoll=3⋅5\+1⋅4=19\.\\text\{Doll\}=3\\cdot\\mathdollar 5\+1\\cdot\\mathdollar 4=\\mathdollar 19\.ThenRed cars=5⋅4=20,Action figures=3⋅5=15,Doll=1⋅19=19\.\\text\{Red cars\}=5\\cdot\\mathdollar 4=\\mathdollar 20,\\quad\\text\{Action figures\}=3\\cdot\\mathdollar 5=\\mathdollar 15,\\quad\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\text\{Doll\}=1\\cdot\\mathdollar 19=\\mathdollar 19\}\.Thus the total value is20+15\+19=$54.The final answer is54.\mathdollar 20+\mathdollar 15+{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathdollar 19}={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathdollar 54}.{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\text{ The final answer is }\boxed{54}}.
Correct Case with LatentMAS (Debug Mode)Question: Jean is two years older than Mark. Two years ago Mark was 5 years older than half Jan’s age. If Jan is 30 how old is Jean? Planner:1.Determine Mark’s current age by using the information about his age two years ago relative to Jan’s age.2.Calculate half of Jan’s age two years ago and then determine Mark’s age at that time based on the given relationship.3.Use the relationship between Jean and Mark’s ages to find Jean’s current age.Critic: The user wants to know Jean’s age based on the given information. The original plan provided by the Planner Agent was a bit incomplete, so I need to figure out where it went wrong. The original plan is a good start but lacks specific details on how to calculate Mark’s age two years ago and how to derive Jean’s age from Mark’s current age. To improve, the plan should explicitly outline the mathematical operations needed, such as calculating half of Jan’s current age, adding 5 years, adjusting for the two-year time difference, and then applying the age difference between Jean and Mark. Additionally, the plan should emphasize verifying each step for consistency with the given information. Refiner: First, I need to break down the information step by step. Let’s start by noting down all the given information clearly. Jan’s current age is3030. That’s a concrete number, so that’s a good starting point. Next, the problem mentions something about two years ago. So, I should probably calculate Jan’s age two years ago. Since Jan is3030now, two years ago she was30−2=2830-2=28years old. That seems straightforward. Now, the problem says that two years ago, Mark was55years older than half of Jan’s age at that time. So, half of Jan’s age two years ago is28/2=1428/2=14. Then adding 5 years to that gives14+5=1914+5=19. So, Mark was1919years old two years ago. Therefore, Mark’s current age would be19+2=2119+2=21years old. Wait, that seems a bit young, but maybe that’s correct based on the given info. Then, the problem states that Jean is two years older than Mark. If Mark is currently2121, then Jean would be21+2=2321+2=23years old. But wait, that seems like a possible answer, but I need to check if I did all the steps correctly.Let me verify each step again. Jan is3030now. Two years ago, she was2828. Half of2828is1414. Adding55gives1919, which is Mark’s age two years ago. Adding 2 years brings Mark’s current age to2121. Jean is two years older than Mark, so21+2=2321+2=23. That seems consistent. Feedback: The original plan is a good start but lacks specific details on how to calculate Mark’s age two years ago and how to derive Jean’s age from Mark’s current age. To improve, the plan should explicitly outline the mathematical operations needed, such as calculating half of Jan’s current age, adding 5 years, adjusting for the two-year time difference, and then applying the age difference between Jean and Mark. Additionally, the plan should emphasize verifying each step for consistency with the given information. The Refined Plan is:1.Identify known values.2.Calculate Jan’s age two years ago.3.Determine Mark’s age two years ago.4.Calculate Mark’s current age.5.Find Jean’s current age.Solver: According to the provided plan, to solve the problem step-by-step: Jan is currently3030, so two years ago Jan was30−2=2830-2=28. Half of Jan’s age two years ago is28/2=1428/2=14, and Mark was55years older than that at the time, so Mark was14+5=1914+5=19two years ago and is therefore19+2=2119+2=21now. Since Jean is22years older than Mark, Jean is21+2=2321+2=23years old. Hence, the final answer is23.\boxed{23}.
Appendix JCase Study on LatentMAS
Case Study (TextMAS on GSM8K)Input Question:Debra is monitoring a beehive to see how many bees come and go in a day. She sees 30 bees leave the hive in the first 6 hours, and then she sees 1/2 that many bees return in the next 6 hours. She sees two times as many bees as she saw first leave the hive fly from the hive and leave in the next 6 hours. Then every bee that left before that hadn’t already returned returns to the hive in the next 6 hours. How many bees did Debra see return to the hive in the last 6 hours of the day?TextMAS OutputPlanner Agent:Okay, let’s see. I need to figure out how many bees Debra saw return to the hive in the last 6 hours. Let me break down the problem step by step. First, the problem mentions different time intervals: first 6 hours, next 6 hours, then the next 6 hours, and finally the last 6 hours. So there are four 6-hour periods in a day, making up 24 hours total. Let me note each period and what happens in each. Starting with the first 6 hours: Debra sees 30 bees leave the hive. Let me write that down as the first data point. So, Period 1 (0-6 hours): 30 bees leave. Then, the next 6 hours (Period 2: 6-12 hours), she sees half that number return. Half of 30 is 15. So, 15 bees return in Period 2. Next, the third period (Period 3: 12-18 hours), she sees two times as many bees as she saw first leave the hive fly from the hive and leave. The first time she saw 30 bees leave, so two times that is 60. So, 60 bees leave in Period 3. Then, the fourth period (Period 4: 18-24 hours), every bee that left before that hadn’t already returned returns to the hive.Critic Agent:The reasoning plan correctly identifies all four 6-hour periods and reaches the right final conclusion, but the intermediate analysis contains several issues that reduce clarity and efficiency. First, the interpretation of the phrase “before that” is revisited multiple times, even though it simply refers to the event in the previous sentence (the 60 bees leaving), meaning only bees that left in Period 1 are relevant. Second, the plan temporarily entertains the possibility that bees may have left in Period 2, despite the problem statement explicitly describing only returns during this interval. Third, the plan repeatedly questions whether the 15 returning bees in Period 2 originate from the initial 30, even though the text directly states they are “half that many” and therefore unambiguously tied to Period 1. These detours introduce redundancy and confusion. Overall, while the conclusion of1515returning bees is correct, the path to it is unnecessarily convoluted and would benefit from a more concise timeline-based analysis.Refiner Agent:I will provide a refined plan based on previous agents’ information.1. Identify the Time Periods and Events. We divide the day into four 6-hour intervals and record the events in each:•Period 1 (0–6 hours): 30 bees leave the hive.•Period 2 (6–12 hours): Half of those bees return, i.e.,1515bees.•Period 3 (12–18 hours): Twice the number of bees that left in Period 1 leave the hive:2×30=602\times 30=60bees.•Period 4 (18–24 hours): Every bee that leftbefore Period 3and has not yet returned comes back to the hive.2. Track Bees Leaving and Returning.•Period 1:3030bees leave.•Period 2:1515of these3030bees return, leaving1515still outside the hive.•Period 3:6060new bees leave (these are not relevant for the final return count).•Period 4: Only bees that leftbefore Period 3and have not returned will come back.3. Determine Which Bees Return in Period 4. The only bees that left before Period 3 are the3030from Period 1. Of these,1515returned in Period 2. Thus, the remaining1515bees from Period 1 return in Period 4.4. Return the Final Answer.Solver Agent:Okay, let’s see. I need to figure out how many bees Debra saw return to the hive in the last 6 hours. Let me break down the problem step by step. First, the problem mentions different time intervals: first 6 hours, next 6 hours, then the next 6 hours, and finally the last 6 hours. So there are four 6-hour periods in a day, making up 24 hours total. Let me note each period and what happens in each. Starting with the first 6 hours: Debra sees 30 bees leave the hive. Let me write that down as the first data point. So, Period 1 (0-6 hours): 30 bees leave. Then, the next 6 hours (Period 2: 6-12 hours), she sees half that number return. Half of 30 is 15. So, 15 bees return in Period 2. Next, the third period (Period 3: 12-18 hours), she sees two times as many bees as she saw first leave the hive fly from the hive and leave. The first time she saw 30 bees leave, so two times that is 60. So, 60 bees leave in Period 3. Then, the fourth period (Period 4: 18-24 hours), every bee that left before that hadn’t already returned returns to the hive. So, bees that left in Period 1 and Period 2. But some of those bees may have already returned in Period 2.### Final AnswerDebra saw15\boxed{15}bees return to the hive in the last 6 hours of the day.(✗Wrong Final Answer)
Case Study (LatentMAS on GSM8K)Input Question:Debra is monitoring a beehive to see how many bees come and go in a day. She sees 30 bees leave the hive in the first 6 hours, and then she sees 1/2 that many bees return in the next 6 hours. She sees two times as many bees as she saw first leave the hive fly from the hive and leave in the next 6 hours. Then every bee that left before that hadn’t already returned returns to the hive in the next 6 hours. How many bees did Debra see return to the hive in the last 6 hours of the day?LatentMAS OutputWe are given a sequence of events involving bees leaving and returning to the hive over the course of a day, broken into 6-hour intervals. According to the refiner’s plan, we need to track how many bees leave and return during each 6-hour interval, noting that 30 leave first, 15 return next, and 60 more leave afterward. In the final interval, all bees that had left and not yet returned came back, giving a total of 15 remaining from the first group, plus 60 from the third group, for a total of 75 returning bees. Let’s break it down step by step based on the plan:Step 1: Bees that leave in the first 6 hours.Debra sees30bees30\text{ bees}leave the hive during the first 6-hour period.Step 2: Bees that return in the next 6 hours.In the second 6-hour period, she sees half as many bees return:12×30=15bees return.\frac{1}{2}\times 30=15\text{ bees return}.Step 3: Bees that leave in the third 6-hour period.In the next 6 hours, twice the number from Step 1 leave:2×30=60bees leave.2\times 30=60\text{ bees leave}.Step 4: Bees that return in the final 6 hours.During the final 6 hours, every bee that had left earlier and had not yet returned comes back to the hive.We track the bees that left and determine which have already returned:•From the first 6 hours:3030bees left;1515returned in Step 2, leaving1515bees still out.•From the third 6 hours:6060bees left; none have returned yet.Thus, the total number of bees returning in the final 6 hours is15+60=75bees.15+60=75\text{ bees}.###Final Answer:75\boxed{75}(✓Correct Final Answer)
To comprehensively learn how latent collaboration changes multi-agent reasoning dynamics, we conduct a detailed case study on GSM8K using the Qwen3-14B backbone under the Sequential MAS setting. As shown in the example, TextMAS agents rely on lengthy textual exchanges that often amplify early reasoning errors, misinterpretations by the planner propagate through the critic and refiner, ultimately constraining the solver’s search space. In contrast, LatentMAS operates entirely through latent working-memory transfer: each agent receives rich, continuous representations of prior reasoning rather than brittle text, enabling later agents to reinterpret, refine, and correct upstream reasoning without inheriting surface-level mistakes. This latent collaboration leads to more coherent intermediate steps, more stable numerical reasoning, and ultimately yields the correct final answer, where TextMAS fails. The case study illustrates how LatentMAS mitigates error compounding in multi-agent pipelines and demonstrates the qualitative advantage of latent over text-based communication.
Appendix KPrompt Template for LatentMAS
Sequential LatentMAS Prompts on Numeric Tasks (GSM8K / AIME2024 / AIME2025)System Prompt for All Agents:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.Prompt for Planner Agent:
You are a Planner Agent. Given an input question, design a clear, step-by-step plan for how to solve the question. Question: {question}Your outlined plan should be concise with a few bulletpoints for each step. Do not produce the final answer. Now output your plan to solve the question below:Prompt for Critic Agent:
You are a Critic Agent to evaluate the correctness of the input plan for the given question and provide helpful feedback for improving the plan. Question: {question}The plan information is provided in latent KV representation format. Review the plan and question and output: (1) original plan contents (2) constructive feedback on the original plan. Format your response as follows: Original Plan: [Copy the provided Planner Agent’s plan here] Feedback: [Your detailed feedback to improve the plan here] Now, output your response below:Prompt for Refiner Agent:
You are a Refiner Agent to provide a refined step-by-step plan for solving the given question. Question: {question}You are provided with: (1) latent-format information: a previous plan with feedback (2) text-format information: the input question you need to solve. Based on the input, write a refined and improved plan to solve the question. Make sure your output plan is correct and concise. Now, output your refined plan below:Prompt for Solver Agent:
You are a helpful assistant. You are provided with latent information for reference and a target question to solve. Target Question: {question}The latent information might contain irrelevant contents. Ignore it if it is not helpful for solving the target question. Now, reason step by step and output the final answer inside \boxed{YOUR_FINAL_ANSWER}:Sequential LatentMAS prompts for multiple-choice tasks (ARC-E, ARC-C, GPQA, MedQA)System Prompt for All Agents:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.Prompt for Planner Agent:
You are a Planner Agent. Given an input question, design a clear, step-by-step plan for how to solve the question. Question: {question}Your outlined plan should be concise with a few bulletpoints for each step. Do not produce the final answer. Now output your plan to solve the question below:Prompt for Critic Agent:
You are a Critic Agent to evaluate the correctness of the input plan for the given question and provide helpful feedback for improving the plan. Question: {question}The plan information is provided in latent KV representation format. Review the plan and question and output: (1) original plan contents (2) constructive feedback on the original plan. Format your response as follows: Original Plan: [Copy the provided Planner Agent’s plan here] Feedback: [Your detailed feedback to improve the plan here] Now, output your response below:Prompt for Refiner Agent:
You are a Refiner Agent to provide a refined step-by-step plan for solving the given question. Question: {question}You are provided with: (1) latent-format information: a previous plan with feedback (2) text-format information: the input question you need to solve. Based on the input, write a refined and improved plan to solve the question. Make sure your output plan is correct and concise. Now, output your refined plan below:Prompt for Solver Agent:
You are a helpful assistant. You are provided with latent information for reference and a target question to solve. Target Question: {question}The latent information might contain irrelevant contents. Ignore it if it is not helpful for solving the target question. Your final answer must be selected from A,B,C,D. For example \boxed{A}. Do not add any other contents inside the box. Now, reason step by step and output the final answer inside \boxed{YOUR_FINAL_ANSWER}:Sequential LatentMAS prompts for python coding tasks (MBPP-Plus, HumanEval-Plus)System Prompt for All Agents:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.Prompt for Planner Agent:
You are a Planner Agent. Given an input question, design a clear, step-by-step plan for how to solve the question. Question: {question}Your outlined plan should be concise with a few bulletpoints for each step. Do not produce the final answer. Now output your plan to solve the question below:Prompt for Critic Agent:
You are a Critic Agent to evaluate the correctness of the input plan for the given question and provide helpful feedback for improving the plan. Question: {question}The plan information is provided in latent KV representation format. Review the plan and question and output: (1) original plan contents (2) constructive feedback on the original plan. Format your response as follows: Original Plan: [Copy the provided Planner Agent’s plan here] Feedback: [Your detailed feedback to improve the plan here] Now, output your response below:Prompt for Refiner Agent:
You are a Refiner Agent to provide a refined step-by-step plan for solving the given question. Question: {question}You are provided with: (1) latent-format information: a previous plan with feedback (2) text-format information: the input question you need to solve. Based on the input, write a refined and improved plan to solve the question. Make sure your output plan is correct and concise. Now, output your refined plan below:Prompt for Solver Agent:
You are a helpful assistant. You are provided with latent information for reference and a target question to solve. Target Question: {question}The latent information might contain irrelevant contents. Ignore it if it is not helpful for solving the target question. Your must reason step-by-step to solve the **provided Target Question** witout outputing other irrelevant inforamtion. You must put all python code as self-contained Python function in markdown code blocks. For example
‘‘‘python
import math
def add(a, b):
return a + b‘‘‘.
Do not add any other contents inside the markdown code block. Now, reason step by step and output the final answer inside‘‘‘python YOUR_PYTHON_CODE‘‘‘:Hierarchical LatentMAS prompts for numeric-answer tasks (GSM8K, AIME2024, AIME2025)System prompt for All Agents:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.Prompt for Math Agent:
You are a math agent. Given the input question, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}. Question: {question}Your response:Prompt for Science Agent:
You are a science agent. Given the input question, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}. Question: {question}Your response:Prompt for Code Agent:
You are a code agent. Given the input question, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}. Question: {question}Your response:Prompt for Task Summarizer Agent:
You are a task summarizer. Given the input question and responses from previous agents as reference, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}.Question: {question}Your response:Hierarchical LatentMAS prompts for multiple-choice tasks (ARC-E, ARC-C, GPQA, MedQA)System Prompt for All Agents:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.Prompt for Math Agent:
You are a math agent. Given the input question, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}. Your final answer must be selected from A,B,C,D. For example \boxed{A}. Do not add any other contents inside the box. Question: {question}Your response:Prompt for Science Agent:
You are a science agent. Given the input question, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}. Your final answer must be selected from A,B,C,D. For example \boxed{A}. Do not add any other contents inside the box. Question: {question}Your response:Prompt for Code Agent:
You are a code agent. Given the input question, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}. Your final answer must be selected from A,B,C,D. For example \boxed{A}. Do not add any other contents inside the box. Question: {question}Your response:Prompt for Task Summarizer Agent:
You are a task summarizer. Given the input question and responses from previous agents as reference, reason step-by-step and put the final answer inside \boxed{YOUR_FINAL_ANSWER}. Your final answer must be selected from A,B,C,D. For example \boxed{A}. Do not add any other contents inside the box. Question: {question}Your response:Hierarchical LatentMAS prompts for python coding tasks (MBPP-Plus, HumanEval-Plus)System Prompt for All Agents:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.Prompt for Math Agent:
You are a math agent. Given the input question, reason step by step and provide an efficient and self-contained Python function that solves the following problem in a markdown code block. You must put all python code as self-contained Python function in markdown code blocks. For example
‘‘‘python
import math
def add(a, b):
return a + b‘‘‘. Do not add any other contents inside the markdown code block.Question: {question}Your response:Prompt for Science Agent:
You are a science agent. Given the input question, reason step by step and provide an efficient and self-contained Python function that solves the following problem in a markdown code block.
You must put all python code as self-contained Python function in markdown code blocks. For example
‘‘‘python
import math
def add(a, b):
return a + b‘‘‘. Do not add any other contents inside the markdown code block.Question: {question}Your response:Prompt for Code Agent:
You are a code agent. Given the input question, reason step by step and provide an efficient and self-contained Python function that solves the following problem in a markdown code block.
You must put all python code as self-contained Python function in markdown code blocks. For example
‘‘‘python
import math
def add(a, b):
return a + b‘‘‘. Do not add any other contents inside the markdown code block.Question: {question}Your response:Prompt for Task Summarizer Agent:
You are a task summarizer. Given the input question and responses from previous agents as reference, reason step by step and provide an efficient and self-contained Python function that solves the following problem in a markdown code block.
You must put all python code as self-contained Python function in markdown code blocks. For example
‘‘‘python
import needed_library
def FUNC_NAME(a, b):
return a + b‘‘‘. Do not add any other contents inside the markdown code block.Input Question: {question}Your response:
Similar Articles
@FinanceYF5: Multi-Agent Collaboration Without Speaking: LatentMAS Accepted as ICML 2026 Spotlight — Agents Directly Transfer Reasoning State in Latent Space, Skipping Text Encoding/Decoding. Accuracy +13.3%, Speed 4.3x, Token Usage Reduced 83.7%. No Extra Training, ...
LatentMAS is a new multi-agent collaboration method where agents directly transfer reasoning states in latent space without text encoding/decoding, achieving a 13.3% accuracy improvement, 4.3x speed, and 83.7% reduction in token usage. It requires no extra training and can be plugged into existing LLMs. It has been accepted as an ICML 2026 Spotlight.
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
NeuroMAS treats multi-agent language systems as trainable neural-network-like architectures with LLM agents as nodes, using reinforcement learning to learn communication and specialization. It shows improved performance and that progressive growth from smaller systems works better than training large systems from scratch.
Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems
This paper presents a unified framework for latent communication in LLM-based multi-agent systems, categorizing methods by what information is communicated, sender-receiver alignment, and fusion technique, and reviews eighteen representative methods from 2024-2026.
Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate
Researchers from Boston University propose IMAD (Internalized Multi-Agent Debate), a two-stage fine-tuning framework that distills multi-agent debate into a single LLM, achieving up to 93% fewer tokens while matching or exceeding explicit multi-agent debate performance. The work also reveals agent-specific subspaces in activation space, enabling practical control over internalized reasoning behaviors including suppression of malicious agents.