Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning

arXiv cs.AI 06/30/26, 04:00 AM Papers
mixture-of-experts multi-agent-reasoning debate large-language-models multimodal-reasoning self-debate arxiv
Summary
Proposes Mixture of Debaters (MoD), a framework using Mixture-of-Experts to enable dynamic self-debate within a single LLM, achieving superior accuracy with drastically lower latency and token consumption.
arXiv:2606.29425v1 Announce Type: new Abstract: Existing multi-agent debate frameworks suffer from two critical limitations: they rely on static architectures where agent roles and coordination patterns are fixed at design time, and they require instantiating multiple model copies, incurring substantial computational overhead. We propose Mixture of Debaters (MoD), a unified framework that enables dynamic self-debate within a single model by leveraging the Mixture-of-Experts paradigm. We address three key challenges in adapting MoE for dialectical reasoning: (1) dual-routing that decouples role allocation from process flow, dynamically determining when to debate versus when to synthesize; (2) momentum switching that smooths token-level routing with local context, reducing expert-switch jitter; and (3) unified self-debate that encapsulates diverse debating personas into lightweight expert modules, eliminating inter-agent communication while preserving behavioral diversity. Extensive experiments on multimodal benchmarks demonstrate that MoD outperforms both single-model baselines and conventional multi-agent systems, achieving superior accuracy with 3.7x lower latency and 87% reduction in token consumption.The source code can be accessed at https://github.com/YongLD/MoD.
Original Article
View Cached Full Text
Cached at: 06/30/26, 05:33 AM
# Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning
Source: [https://arxiv.org/html/2606.29425](https://arxiv.org/html/2606.29425)
###### Abstract\.

Existing multi\-agent debate frameworks suffer from two critical limitations: they rely on static architectures where agent roles and coordination patterns are fixed at design time, and they require instantiating multiple model copies, incurring substantial computational overhead\. We proposeMixture of Debaters \(MoD\), a unified framework that enables dynamic self\-debate within a single model by leveraging the Mixture\-of\-Experts paradigm\. We address three key challenges in adapting MoE for dialectical reasoning: \(1\)*dual\-routing*that decouples role allocation from process flow, dynamically determining when to debate versus when to synthesize; \(2\)*momentum switching*that smooths token\-level routing with local context, reducing expert\-switch jitter; and \(3\)*unified self\-debate*that encapsulates diverse debating personas into lightweight expert modules, eliminating inter\-agent communication while preserving behavioral diversity\. Extensive experiments on multimodal benchmarks demonstrate that MoD outperforms both single\-model baselines and conventional multi\-agent systems, achieving superior accuracy with 3\.7×\\timeslower latency and 87% reduction in token consumption\. The source code can be accessed at[https://github\.com/YongLD/MoD](https://github.com/YongLD/MoD)\.

multi\-agent, self\-debate, mixture\-of\-expert, multimodal reasoning

††isbn:978\-1\-4503\-XXXX\-X/2026/06††ccs:Computing methodologies Natural language processing††ccs:Computing methodologies Neural networks††ccs:Computing methodologies Computer vision††ccs:Computing methodologies Knowledge representation and reasoning## 1\.Introduction

Large Language Models \(LLMs\) have demonstrated remarkable capabilities in complex reasoning tasks, ranging from mathematical problem solving to logical deduction and code generation\(Didolkaret al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib40); Seals and Shalin,[2024](https://arxiv.org/html/2606.29425#bib.bib41); Weiet al\.,[2022](https://arxiv.org/html/2606.29425#bib.bib42); Penget al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib5)\)\. Despite the success of prompting strategies like Chain\-of\-Thought \(CoT\), reasoning traces generated by a single monolithic model remain prone to degeneration and hallucination, particularly when navigating intricate logical landscapes\(Zheng and Li,[2026](https://arxiv.org/html/2606.29425#bib.bib3)\)\. To address these limitations, recent research has pivoted toward multi\-agent debate frameworks, where diverse agents improve reasoning fidelity through adversarial critique and cross\-examination\(Lianget al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib11); Liet al\.,[2025b](https://arxiv.org/html/2606.29425#bib.bib45); Zhenget al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib2)\)\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x1.png)Internal vs\. external debate architectures\. \(A\) Conventional multi\-agent debate requires instantiating independent model copies \($A\_\{1\}$–$A\_\{4\}$\), incurring substantial runtime complexity\. \(B\) Our Mixture\-of\-Debaters internalizes diverse agents into a shared expert pool with dynamic routing, improving framework flexibility while eliminating inter\-agent communication overhead\.

Figure 1\.Internal vs\. external debate architectures\. \(A\) Conventional multi\-agent debate requires instantiating independent model copies \(A1A\_\{1\}–A4A\_\{4\}\), incurring substantial runtime complexity\. \(B\) Our Mixture\-of\-Debaters internalizes diverse agents into a shared expert pool with dynamic routing, improving framework flexibility while eliminating inter\-agent communication overhead\.However, we argue that current debate paradigms are structurally constrained\. Existing systems operate with static architectures where agent roles and communication patterns are fixed at design time\. For instance, a standard setup might rigidly enforce a cycle of “propose→\\rightarrowcritique→\\rightarrowrefine\.” Real\-world multimodal problems, however, are fluid and multifaceted: an advanced physics problem might initially appear as a simple kinematics question, evolve into a complex thermodynamics puzzle upon closer inspection of a diagram, and ultimately require mathematical synthesis\. In such dynamic contexts, a fixed topology fails to adapt to the shifting epistemic needs of the problem\.

To bridge the gap between static architectures and fluid reasoning requirements, the Mixture\-of\-Experts \(MoE\) framework offers a promising structural foundation\(Zhouet al\.,[2022](https://arxiv.org/html/2606.29425#bib.bib57); Liuet al\.,[2023b](https://arxiv.org/html/2606.29425#bib.bib59); Liet al\.,[2024b](https://arxiv.org/html/2606.29425#bib.bib61)\)\. By dynamically activating different subsets of parameters \(experts\) based on the input, MoE has the potential to realize robust and generalized reasoning\. Nevertheless, directly applying generic MoE architectures to the dialectical process of debate is non\-trivial and exposes distinct architectural gaps\.

In this work, we proposeMixture\-of\-Debaters\(MoD\), a pilot framework that unifies dynamic self\-debate within a single architecture\. We articulate our contributions by identifying three critical challenges in adapting MoE for debate and presenting our corresponding architectural advantages:

∙\\bulletDual\-Routing for Dialectical Flow Control\.Standard MoE architectures typically rely on a single router to dispatch tokens to experts\. This design is insufficient for debate, which requires orchestration across different dimensions: assigning specific*roles*\(e\.g\., proposer vs\. critic\) and managing the*process flow*\(e\.g\., debating vs\. concluding\)\. A single router cannot simultaneously optimize for these orthogonal objectives\. We propose a dual\-routing mechanism that decouples role allocation from stage management\. By introducing specialized gating functions, our model dynamically regulates when to debate \(invoking dialectical confrontation\) and when to summarize \(invoking synthesis\)\. This allows the system to switch strategies adaptively, skipping unnecessary argumentation for simple sub\-problems or invoking early synthesis to resolve conflicts, thereby optimizing the reasoning trajectory\.

∙\\bulletMomentum Switching for Consistency\.Conventional MoE gating operates at the token level and is sensitive to local noise and minor representation fluctuations\. In a debate context, such jittery routing may swap experts too frequently \(e\.g\., flipping between a “pro” and a “critic” stance within a short span\), fragmenting the reasoning trace and hurting coherence\. We introduce*Momentum Switching*, which smooths routing decisions via local contextual aggregation so expert usage evolves more gradually over time, improving short\-range persistency while remaining responsive to genuine context shifts\. This stabilization is lightweight, autoregressive\-friendly, and complements our dual\-routing design by reducing unnecessary role flips during multi\-turn reasoning\.

∙\\bulletUnified Efficient Self\-Debate\.Prior multi\-agent debate frameworks rely on instantiating multiple distinct models or maintaining several heavy copies of LLMs to simulate interaction\. This inter\-model communication is computationally expensive, memory\-intensive, and introduces significant latency due to network overhead or context switching\. We propose a unified self\-debate mode based on the MoE architecture\. By encapsulating diverse debating personas into lightweight, specialized expert modules within a*single*backbone, we eliminate the need for external multi\-model orchestration\. This design reduces inter\-agent communication cost while retaining the behavioral diversity required for effective self\-debate, allowing for efficient, high\-quality reasoning within a compact computational budget\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x2.png)Overview of Mixture\-of\-Debaters \(MoD\)\. Left: Viewpoint\-Shift Data Synthesis constructs debate trajectories from correct and incorrect samples for belief revision\. Right: MoD architecture with dual\-routing, momentum switching, and decoupled A\-side/B\-side expert pools for diverse reasoning pathways\.

Figure 2\.Overview of Mixture\-of\-Debaters \(MoD\)\. Left: Viewpoint\-Shift Data Synthesis constructs debate trajectories from correct and incorrect samples for belief revision\. Right: MoD architecture with dual\-routing, momentum switching, and decoupled A\-side/B\-side expert pools for diverse reasoning pathways\.
## 2\.Related Work

Multi\-Agent Reasoning and Debate\.Multi\-agent debate has emerged as an effective paradigm for improving reasoning by exposing latent inconsistencies through adversarial critique and cross\-examination\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.29425#bib.bib36); Lianget al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib11); Boet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib37)\)\. Recent studies extend this idea to multimodal settings, showing that interaction among multiple vision\-language agents can improve reasoning reliability and reduce hallucination\(Yuet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib38); Lianget al\.,[2025a](https://arxiv.org/html/2606.29425#bib.bib39),[2026](https://arxiv.org/html/2606.29425#bib.bib15); Huet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib12); Zheng,[2025](https://arxiv.org/html/2606.29425#bib.bib9)\)\. This line is further generalized by Mixture\-of\-Agents \(MoA\), which aggregates responses from multiple external models through layered coordination\(Wanget al\.,[2024a](https://arxiv.org/html/2606.29425#bib.bib13)\)\. Despite their effectiveness, these approaches typically require multiple model instances or external APIs as independent agents, leading to substantial memory, communication, and latency overhead\. Moreover, their performance often depends on the zero\-shot debate capability of the underlying models, which may limit gains when base models are weak or prone to shallow contradiction and paraphrastic disagreement\(Subramaniamet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib49); Choiet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib50)\)\.

Internalized Deliberation and Self\-Correction\.To reduce the cost of explicit multi\-agent orchestration, a parallel line of work seeks to internalize deliberative behavior within a single model\. Early efforts such as Debate, Reflect, and Distill \(DRD\) distill multi\-agent debate trajectories into a student model\(Zhouet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib23)\), while SMoA emulates MoA\-style aggregation using a single LLM that generates multiple candidate responses\(Liet al\.,[2025a](https://arxiv.org/html/2606.29425#bib.bib20)\)\. More recent work has increasingly shifted toward strengthening self\-critique and self\-correction directly\. For example, S2R teaches models to self\-verify and self\-correct during inference through lightweight supervision and reinforcement learning\(Maet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib22)\), SCRIT improves critique ability using self\-generated training data\(Tanget al\.,[2025b](https://arxiv.org/html/2606.29425#bib.bib16)\), and SPOC enables interleaved solution generation and verification within a single inference pass\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib17)\)\. In multimodal reasoning, Sherlock further extends this line by introducing trajectory\-level self\-correction for vision\-language models\(Ding and Zhang,[2025](https://arxiv.org/html/2606.29425#bib.bib21)\)\. At the same time, recent evaluations such as RealCritic and Self\-Correction Bench suggest that self\-critique remains unreliable for many current models, especially in self\-critique and iterative correction settings\(Tanget al\.,[2025a](https://arxiv.org/html/2606.29425#bib.bib19); Tsui,[2025](https://arxiv.org/html/2606.29425#bib.bib18)\)\. Overall, these methods internalize deliberation, but still rely on explicit critique loops, multi\-pass generation, or specialized self\-improvement procedures\. This motivates architectures that support deliberation more directly and efficiently within a single forward reasoning framework\.

Parameter\-Efficient Mixture\-of\-Experts\.Mixture\-of\-Experts \(MoE\) provides a natural mechanism for conditional computation within a single backbone, making it a promising foundation for scalable reasoning models\(Lepikhinet al\.,[2021](https://arxiv.org/html/2606.29425#bib.bib56); Zhouet al\.,[2022](https://arxiv.org/html/2606.29425#bib.bib57)\)\. Recent parameter\-efficient variants combine MoE with low\-rank adaptation by replacing dense adaptation with sparsely activated LoRA experts\. Early approaches such as MoE\-LoRA, AdaMV\-MoE, MixLoRA, and Uni\-MoE show that routing among lightweight adapters can improve the capacity–efficiency trade\-off in both language and multimodal settings\(Liuet al\.,[2023b](https://arxiv.org/html/2606.29425#bib.bib59); Liet al\.,[2024a](https://arxiv.org/html/2606.29425#bib.bib60); Chenet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib48); Liet al\.,[2024b](https://arxiv.org/html/2606.29425#bib.bib61)\)\. More recent work further advances this line through improved optimization, layer\-wise expert allocation, expert specialization, and adaptive expert selection\(Sunet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib78); Gaoet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib82); Fenget al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib79); Kunwaret al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib81)\)\. However, most existing PE\-MoE methods still inherit two assumptions that are not ideal for debate\-style reasoning\. First, each expert is typically instantiated as a monolithic adapter unit, coupling the down\- and up\-projection under a single routing decision\. This limits the asymmetric routing needed for dialectical reasoning\. Second, prior PE\-MoE methods predominantly adopt token\-level routing\(Liet al\.,[2024a](https://arxiv.org/html/2606.29425#bib.bib60); Zhanget al\.,[2025a](https://arxiv.org/html/2606.29425#bib.bib80)\)\. While effective for general sequence modeling, such fine\-grained routing may cause unstable expert assignments across neighboring tokens, making it less suitable for dialectical generation requiring local coherence\.

In contrast to the above literature, our Mixture\-of\-Debaters \(MoD\) treats debate as an internal routing problem within a single model rather than an interaction among external agents\. MoD combines explicit viewpoint\-shift supervision with a parameter\-efficient MoE architecture, decouples interpretation and synthesis expert pools to enable asymmetric expert composition, and introduces dual routing together with Momentum Switching to stabilize local reasoning trajectories\. This design preserves the behavioral diversity of debate while avoiding the overhead of conventional multi\-agent systems\.

## 3\.Mixture\-of\-Debaters

We presentMixture\-of\-Debaters\(MoD\), a unified framework that enables dynamic self\-debate within a single model\. As illustrated in Figure[2](https://arxiv.org/html/2606.29425#S1.F2), MoD builds upon a frozen vision\-language backbone and introduces lightweight MoD adapters that replace standard LoRA modules\. We first describe the model architecture in §[3\.1](https://arxiv.org/html/2606.29425#S3.SS1)\. We then present how we construct the MoD instruction\-tuning dataset in §[3\.2](https://arxiv.org/html/2606.29425#S3.SS2), which provides explicit supervision for viewpoint shifts\. Finally, we detail the tuning procedure in §[3\.3](https://arxiv.org/html/2606.29425#S3.SS3)\.

### 3\.1\.Model Architecture

MoD supports both text\-only and multimodal inputs, so we write the input as\(V,Q\)\(V,Q\)with the text\-only case asV=∅V=\\varnothing\. We focus on the multimodal setting where modality interactions make routing stability and perspective diversity especially valuable\(Lianget al\.,[2025b](https://arxiv.org/html/2606.29425#bib.bib4)\)\.

Given an imageVVand questionQQ, the vision encoder extracts features, projects them into the language space, and concatenates them with text embeddings to form the input sequence, which is processed by Transformer blocks with MSA\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.29425#bib.bib29)\)and FFN\. We freeze the backbone projections \(q/k/v/o\) and inject MoD adapters in parallel for parameter\-efficient adaptation\. While standard LoRA uses a single low\-rank bypassh=W0x\+αrBAxh=W\_\{0\}x\+\\frac\{\\alpha\}\{r\}BAx\(Huet al\.,[2022](https://arxiv.org/html/2606.29425#bib.bib28)\), MoD replaces\(A,B\)\(A,B\)with decoupled expert pools and dynamic routing, enabling diverse perspectives within one forward pass\. For each tokenx\(t\)x^\{\(t\)\}, the MoD adapter computes its output through three stages:

#### 3\.1\.1\.Stage 1: Momentum Switching

Conventional MoE gating operates at the token level, potentially selecting different experts for every generated token\. In a debate context, this high\-frequency switching is detrimental: it leads to disjointed arguments where a model’s stance fluctuates mid\-sentence, destroying the logical consistency required for coherent argumentation\. We introduce a momentum switching strategy that smooths routing decisions and improves short\-range expert persistency\.

Token\-level switching \(xroute=xx\_\{\\text\{route\}\}=x\) is flexible but prone to noisy expert flips across adjacent tokens, fragmenting locally coherent reasoning\. Region\-level switching partitions the sequence into fixed\-size blocks, but rigid boundaries can cause abrupt transitions\.

To maintain argumentative consistency while preserving adaptability, we adopt sliding window switching\. Rather than per\-token switching, we compute a causal moving average over the most recent𝒲\\mathcal\{W\}tokens:

\(1\)xroute\(t\)=∑k=max⁡\(0,t−𝒲\+1\)tx\(k\)min⁡\(t\+1,𝒲\),x\_\{route\}\(t\)=\\frac\{\\sum\_\{k=\\max\(0,\\,t\-\\mathcal\{W\}\+1\)\}^\{t\}x\(k\)\}\{\\min\(t\+1,\\mathcal\{W\}\)\},
Wherettdenotes the index of the current token within the batch\. This mechanism smooths routing inputs over local context, reducing noisy expert switching and stabilizing expert usage across neighboring tokens\.

To ensure that the routing behavior during inference aligns strictly with the training phase, we augment the conventional KV cache with a dedicated token cache with a capacity of𝒲−1\\mathcal\{W\}\-1to each MoD layer, which is utilized to maintain the most recent tokens\.

The sliding window routing and buffer design ensure compatibility with autoregressive generation, and we implement sliding windows via cumulative sums\. We set𝒲=16\\mathcal\{W\}=16by default, providing sufficient context for stable routing while remaining responsive to topic shifts\.

#### 3\.1\.2\.Stage 2: Dual\-Routing Mechanism

Standard MoE architectures rely on a single router to dispatch tokens to experts\(Zhouet al\.,[2022](https://arxiv.org/html/2606.29425#bib.bib57)\)\. However, effective debate requires orchestration across two orthogonal dimensions: assigning*roles*\(e\.g\., proposer vs\. critic\) and managing*process flow*\(e\.g\., debating vs\. concluding\)\. A single router cannot simultaneously optimize for these objectives\. We introduce a dual\-routing mechanism that decouples role allocation from stage management\.

Specifically, we employ two separate routers, A and B, each parameterized by a learnable projectionWgA,WgB∈ℝE×KW^\{A\}\_\{g\},W^\{B\}\_\{g\}\\in\\mathbb\{R\}^\{E\\times K\}, whereEEdenotes the number of experts\. For an input token representationxroutex\_\{\\text\{route\}\}, the routing logitsℓ\\elland selected expert indicesℐ\\mathcal\{I\}are computed as:

\(2\)ℓA\\displaystyle\\ell^\{A\}=WgAxroute,ℓB=WgBxroute,\\displaystyle=W^\{A\}\_\{g\}x\_\{\\text\{route\}\},\\quad\\ell^\{B\}=W^\{B\}\_\{g\}x\_\{\\text\{route\}\},\(3\)ℐA\\displaystyle\\mathcal\{I\}^\{A\}=TopK\(ℓA,r\),ℐB=TopK\(ℓB,r\)\.\\displaystyle=\\text\{TopK\}\(\\ell^\{A\},r\),\\quad\\mathcal\{I\}^\{B\}=\\text\{TopK\}\(\\ell^\{B\},r\)\.During training, optional Gaussian noise can be added to the logits to encourage exploration\. The normalized gating scoresgA∈ℝrg^\{A\}\\in\\mathbb\{R\}^\{r\}andgB∈ℝrg^\{B\}\\in\\mathbb\{R\}^\{r\}are obtained by applying softmax overℓA\\ell^\{A\}andℓB\\ell^\{B\}then normalize only on the top\-rrselected experts:

\(4\)gA=\[siA∑j∈ℐAsjA\]i∈ℐA,g^\{A\}=\\left\[\\frac\{s^\{A\}\_\{i\}\}\{\\sum\_\{j\\in\\mathcal\{I\}^\{A\}\}s^\{A\}\_\{j\}\}\\right\]\_\{i\\in\\mathcal\{I\}^\{A\}\},wheresA=Softmax\(ℓA\)s^\{A\}=\\text\{Softmax\}\(\\ell^\{A\}\), and analogously forgBg^\{B\}\. Since the A\-side and B\-side experts are independently selected, we combine their gating scores to obtain the final weightgg:

\(5\)g=gA⊙gB\.g=\\sqrt\{g^\{A\}\\odot g^\{B\}\}\.By default, we adopt a sqrt\-product combination strategy, which captures the joint agreement of both routers while mitigating the score suppression that arises from direct multiplication\.

Enabled by the decoupled design of the two routers, the model can dynamically decide when to debate and when to synthesize\. For simple sub\-problems, it can skip unnecessary argumentation; conversely, when conflicts emerge, it selectively allocates capacity to focus on correction\.

#### 3\.1\.3\.Stage 3: Dialectical Expert Pools

We maintain two decoupled pools of rank\-1 experts: interpretation expertsℰA=\{A1,…,AE\}\\mathcal\{E\}^\{A\}=\\\{A\_\{1\},\\dots,A\_\{E\}\\\}withAi∈ℝ1×kA\_\{i\}\\in\\mathbb\{R\}^\{1\\times k\}, and synthesis expertsℰB=\{B1,…,BE\}\\mathcal\{E\}^\{B\}=\\\{B\_\{1\},\\dots,B\_\{E\}\\\}withBi∈ℝ1×dB\_\{i\}\\in\\mathbb\{R\}^\{1\\times d\}\. Given selected indicesℐA,ℐB\\mathcal\{I\}\_\{A\},\\mathcal\{I\}\_\{B\}and gating weightgg, the adapter output is:

\(6\)hMoD\(t\)=αr⋅\(𝐁ℐB\)⊤\(g⊙\(𝐀ℐAx\(t\)\)\),h^\{\\text\{MoD\}\}\(t\)=\\frac\{\\alpha\}\{r\}\\cdot\(\\mathbf\{B\}^\{\\mathcal\{I\}^\{B\}\}\)^\{\\top\}\\left\(g\\odot\(\\mathbf\{A\}^\{\\mathcal\{I\}^\{A\}\}\\,x\(t\)\)\\right\),where𝐀ℐA∈ℝr×k\\mathbf\{A\}^\{\\mathcal\{I\}^\{A\}\}\\in\\mathbb\{R\}^\{r\\times k\}and𝐁ℐB∈ℝr×d\\mathbf\{B\}^\{\\mathcal\{I\}^\{B\}\}\\in\\mathbb\{R\}^\{r\\times d\}are stacked selected experts\. The final hidden state ish\(t\)=W0x\(t\)\+hMoD\(t\)h\{\(t\)\}=W\_\{0\}x\{\(t\)\}\+h^\{\\text\{MoD\}\}\{\(t\)\}\.

Unlike MoE\-LoRA which couples\(Ai,Bi\)\(A\_\{i\},B\_\{i\}\)as atomic units\(Liuet al\.,[2023b](https://arxiv.org/html/2606.29425#bib.bib59)\), our decoupled design enablesN×NN\\times Ncombinatorial pathways\. This allows the model to pair different interpretation and synthesis experts, effectively simulating diverse debating perspectives within a single forward pass\.

Table 1\.Performance comparison on multimodal reasoning benchmarks\. MoD\-Single denotes single\-round inference, while MoD\-Debate employs multi\-turn dialectical reasoning\. Best results are inbold, and second\-best areunderlined\.CategoryTextMultimodalMethodMMLUSQA/valSQA/testMMMU/valMMStarPOPEMMEInstructBLIP\-7B\(Daiet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib31)\)\-54\.754\.130\.632\.786\.11137/254Qwen\-VL\-Chat\(Baiet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib30)\)50\.765\.568\.837\.037\.561\.81487/360ShareGPT4V\-13B\(Chenet al\.,[2024a](https://arxiv.org/html/2606.29425#bib.bib74)\)\-70\.772\.636\.638\.387\.51569/284LLaVA\-Next\-Mistral\-7B\(Liuet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib62)\)\-69\.573\.037\.038\.487\.31512/308Gemma3\-4B\(Teamet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib75)\)\-76\.077\.147\.347\.984\.61353/391MoE\-LLaVA\-1\.8B×4\(Linet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib27)\)\-\-63\.1\-\-87\.01291MoE\-LLaVA\-2\.7Bx4\(Linet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib27)\)\-\-68\.5\-\-86\.31423LLaVA\-v1\.6\-13b\(Liuet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib62)\)55\.4270\.6573\.6235\.0041\.1386\.201567/321w/Self\-Correction\(Heet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib1)\)56\.0670\.9874\.1635\.6641\.7686\.911564/326w/Multi\-Agent Debate\(Lianget al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib11)\)55\.9771\.2374\.3136\.2242\.7887\.271567/334w/MoE\-LoRA\(Liet al\.,[2024a](https://arxiv.org/html/2606.29425#bib.bib60)\)55\.6271\.5674\.5137\.2943\.3587\.421564/332w/MoD\-Single Round \(ours\)55\.9571\.6774\.5137\.8842\.8087\.511559/347w/MoD\-Multi Round \(ours\)56\.3572\.1075\.2138\.4444\.4087\.651560/341Qwen2\.5VL\-3b\-Instruct\(Wanget al\.,[2024b](https://arxiv.org/html/2606.29425#bib.bib71)\)62\.8479\.3081\.4053\.1056\.3085\.901592/607w/Self\-Correction\(Heet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib1)\)63\.7079\.7381\.5748\.4656\.5786\.911464/591w/Multi\-Agent Debate\(Lianget al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib11)\)63\.2179\.4581\.6248\.2856\.6786\.141464/582w/MoE\-LoRA\(Liet al\.,[2024a](https://arxiv.org/html/2606.29425#bib.bib60)\)63\.5979\.6480\.4547\.9756\.8585\.491592/608w/MoD\-Single Round \(ours\)63\.7379\.8780\.6646\.4457\.8485\.881598/623w/MoD\-Multi Round \(ours\)63\.6979\.9281\.7149\.4757\.4784\.161595/608

### 3\.2\.Viewpoint\-Shift Data Synthesis

The architecture provides the capacity for self\-debate, but realizing this potential requires supervision capturing stance\-taking, counter\-argumentation, and belief revision\. We synthesize training data targetingviewpoint\-shift episodes\.

Given a reasoning instance\(V,Q,A∗\)\(V,Q,A^\{\*\}\), we performKKsampling rounds from a base model, partitioning responses into correct pool𝒫\+\\mathcal\{P\}^\{\+\}and incorrect pool𝒫−\\mathcal\{P\}^\{\-\}\. We construct a scalable dataset of single\-turn interactions \(specifically, Human\-GPT pairs\)\. To ensure the accuracy of the final output, the target responsea′a^\{\\prime\}\(assigned to the GPT role\) is consistently sampled from𝒫\+\\mathcal\{P\}^\{\+\}\. For the input prompt \(assigned to the Human role\), we aggregate the original questionQQand imagesVVwithRRresponses\(a1,a2,…,aR\)\(a^\{1\},a^\{2\},\\dots,a^\{R\}\)randomly sampled from𝒫\+∪𝒫−\\mathcal\{P\}^\{\+\}\\cup\\mathcal\{P\}^\{\-\}, supplemented by a specific debate prompt𝒟\\mathcal\{D\}\.

This process is designed to simulate the model’s ability to derive the correct conclusion after evaluating diverse viewpoints from other agents\. Subsequently, we divided the constructed data into three categories:𝒯pos\\mathcal\{T\}\_\{pos\}\(consistent correct chains\),𝒯rev\\mathcal\{T\}\_\{rev\}\(error identification and viewpoint shifts\), and𝒯rob\\mathcal\{T\}\_\{rob\}\(maintaining correct stances despite misleading information\)\. Finally, the fine\-tuning dataset is constructed by combining data from the three categories according to different strategies\.

For a specific fine\-tuning data withRRintermediate responses\(a1,a2,…,aR\)\(a^\{1\},a^\{2\},\\dots,a^\{R\}\), the model receives, questionQQ, imagesVV, and all intermediate responses concatenated with debate prompt𝒟\\mathcal\{D\}as input prompt, then the model is supervised to generate the target responsea′a^\{\\prime\}\. This teaches the model to evaluate prior reasoning, identify potential errors, and produce a grounded answer that either confirms or revises earlier viewpoints\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x3.png)

Comparison of layer\-wise expert activation distributions between Router A and B\. Each stacked bar represents the percentage of tokens routed to different experts at a given attention layer\. The left panel shows Router\-A \(Interpretation\) and the right panel shows Router\-B \(Synthesis\), enabling a direct comparison of expert utilization patterns across layers\.

Figure 3\.Comparison of layer\-wise expert activation distributions between Router A and B\. Each stacked bar represents the percentage of tokens routed to different experts at a given attention layer\. The left panel shows Router\-A \(Interpretation\) and the right panel shows Router\-B \(Synthesis\), enabling a direct comparison of expert utilization patterns across layers\.
### 3\.3\.Training Objective

The training objective combines autoregressive language modeling loss with auxiliary load\-balancing loss:

\(7\)ℒ=ℒreg\+λℒaux\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{reg\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{aux\}\}\.The autoregressive lossℒreg\\mathcal\{L\}\_\{\\text\{reg\}\}is computed over the target responsea′a^\{\\prime\}:

\(8\)ℒreg=−∑t=1\|a′\|log⁡p\(at′∣V,Q,𝒟,a1,a2…,aR,a<t′\),\\mathcal\{L\}\_\{\\text\{reg\}\}=\-\\sum\_\{t=1\}^\{\|a^\{\\prime\}\|\}\\log p\(a^\{\\prime\}\_\{t\}\\mid V,Q,\\mathcal\{D\},a^\{1\},a^\{2\}\\dots,a^\{R\},a^\{\\prime\}\_\{<t\}\),where the model is conditioned on the visual inputVV, questionQQ, and concatenated preceding responses\(𝒟,a1,a2,…,aR\)\(\\mathcal\{D\},a^\{1\},a^\{2\},\\dots,a^\{R\}\)as context\. The auxiliary lossℒaux\\mathcal\{L\}\_\{\\text\{aux\}\}encourages balanced expert utilization:

\(9\)ℒaux=E2L∑l=1L∑i=1E\(fiA\(l\)⋅PiA\(l\)\+fiB\(l\)⋅PiB\(l\)\),\\mathcal\{L\}\_\{\\text\{aux\}\}=\\frac\{E\}\{2L\}\\sum\_\{l=1\}^\{L\}\\sum\_\{i=1\}^\{E\}\\left\(f^\{A\}\_\{i\}\(l\)\\cdot P^\{A\}\_\{i\}\(l\)\+f^\{B\}\_\{i\}\(l\)\\cdot P^\{B\}\_\{i\}\(l\)\\right\),
whereEErepresents the number of experts, andLLis the total number of MoD layers\. The variablefiAf^\{A\}\_\{i\}measures theexpert usage frequencyin LoRA\-A \(the fraction of tokens assigning expertiivia TopK selection\), whilePiAP^\{A\}\_\{i\}represents theaverage routing probabilityfor expertiiin LoRA\-A\. The definitions offiBf\_\{i\}^\{B\}andPiBP\_\{i\}^\{B\}for LoRA\-B follow analogously:

\(10\)fiA=1T∑t=1T1\(i∈ℐA\(t\)\),PiA=1T∑t=1TℓiA\(t\),f^\{A\}\_\{i\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}1\\left\(i\\in\\mathcal\{I\}^\{A\}\(t\)\\right\),P^\{A\}\_\{i\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\ell^\{A\}\_\{i\}\{\(t\)\},whereTTdenotes the total number of tokens in the batch\. The total auxiliary lossℒaux\\mathcal\{L\}\_\{\\text\{aux\}\}is computed by averaging the load balancing objective across all layers, independently minimizing the expert collapse for both LoRA\-A and LoRA\-B routers\. During training, the model employs self\-generated content as ground truth, yielding a relatively low loss magnitude\. To prevent the auxiliary loss from overshadowing the language modeling loss, we modify the conventional multi\-layer auxiliary loss calculation by normalizing by the number of layersLL, and setλ=0\.01\\lambda=0\.01by default\.

## 4\.Experiments

We conduct extensive experiments to evaluate the proposed Mixture\-of\-Debaters \(MoD\) framework\. Our evaluation focuses on whether MoD improves reasoning performance over strong baselines, how each proposed component contributes to the overall gains, and whether it achieves a better accuracy\-efficiency trade\-off across diverse reasoning scenarios\.

### 4\.1\.Experimental Setup

We build MoD upon LLaVA\-v1\.6\-Vicuna\-13B\(Liuet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib62)\)and Qwen 2\.5VL\-3B\-Instruct\(Wanget al\.,[2024b](https://arxiv.org/html/2606.29425#bib.bib71)\), keeping base model weights frozen while training only the expert modules and routers\. We evaluate on six benchmarks: MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.29425#bib.bib25)\)for multitask language understanding, ScienceQA\(Luet al\.,[2022](https://arxiv.org/html/2606.29425#bib.bib65)\)for multimodal science reasoning, MMMU\(Yueet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib66)\)for college\-level reasoning, MMStar\(Chenet al\.,[2024b](https://arxiv.org/html/2606.29425#bib.bib69)\)for vision\-indispensable multimodal evaluation, POPE\(Liet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib43)\)for hallucination evaluation, and MME\(Fuet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib44)\)for perception and cognition abilities\. For MoD, we evaluate two modes: MoD\-Single \(single\-round inference\) and MoD\-Multi \(two\-round self\-debate\)\. Detailed implementation, configurations, and data construction details are provided in the Appendix\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x4.png)Efficiency comparison on ScienceQA\-TEST\. X\-axis shows inference latency relative to LoRA \(1\.0$\\times$\)\. Y\-axis shows accuracy\. Bubble size indicates token consumption\.

Figure 4\.Efficiency comparison on ScienceQA\-TEST\. X\-axis shows inference latency relative to LoRA \(1\.0×\\times\)\. Y\-axis shows accuracy\. Bubble size indicates token consumption\.
### 4\.2\.Main Results

As shown in Table[1](https://arxiv.org/html/2606.29425#S3.T1), MoD achieves the strongest overall performance, with especially clear gains over Standard MoE\-LoRA on reasoning\-intensive benchmarks\. On LLaVA\-v1\.6\-13B, MoD\-Multi improves ScienceQA\-TEST from 74\.51 to 75\.21, MMMU\-VAL from 37\.29 to 38\.44, and MMStar from 43\.35 to 44\.40\. These results show that the proposed decoupled routing design yields consistent gains over existing routed LoRA baselines, rather than merely matching them with a different formulation\.

MoD also generalizes across backbones\. On Qwen2\.5VL\-3B\-Instruct, MoD\-Multi improves ScienceQA\-TEST from 80\.45 to 81\.71 and MMMU\-VAL from 47\.97 to 49\.47 over Standard MoE\-LoRA, while MoD\-Single achieves the best MMStar score of 57\.84\. Overall, these results suggest that internalizing debate through dynamic expert routing provides a stronger reasoning\-oriented alternative to standard MoE\-LoRA\-style routing\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x5.png)Ablation analysis on \(a\) switching granularity, \(b\) gating combination, and \(c\) router independence, validating our proposed momentum switching, sqrt\-product combination, and dual\-routing designs\.

Figure 5\.Ablation analysis on \(a\) switching granularity, \(b\) gating combination, and \(c\) router independence, validating our proposed momentum switching, sqrt\-product combination, and dual\-routing designs\.
### 4\.3\.Ablation Studies

We conduct systematic ablations to isolate the contribution of each proposed component\. All experiments use LLaVA\-v1\.6\-Vicuna\-13B as the base model and report accuracy on ScienceQA\-TEST, POPE, MMMU\-VAL and MMStar\.

#### 4\.3\.1\.Effect of Data Synthesis Strategy

Table[2](https://arxiv.org/html/2606.29425#S4.T2)examines the contribution of each trajectory topology\. We compare: \(1\)𝒯pos\\mathcal\{T\}\_\{pos\}only, training solely on consistent correct chains; \(2\)𝒯pos\\mathcal\{T\}\_\{pos\}\+𝒯rev\\mathcal\{T\}\_\{rev\}, adding correction trajectories; and \(3\)Full\(ours\), combining all three topologies including robustness trajectories𝒯rob\\mathcal\{T\}\_\{rob\}\.

Training on𝒯pos\\mathcal\{T\}\_\{pos\}alone reduces to standard supervised fine\-tuning, lacking epistemic conflict to teach error recognition\. Adding𝒯rev\\mathcal\{T\}\_\{rev\}provides viewpoint\-shift supervision, yielding substantial gains by forcing the model to identify flawed reasoning and revise its stance\.The full mixture further incorporates𝒯rob\\mathcal\{T\}\_\{rob\}, training the model to maintain correct beliefs despite intermediate misleading information, a critical capability when debating against adversarial counterparts\.

#### 4\.3\.2\.Effect of Architecture Design

Figure[4](https://arxiv.org/html/2606.29425#S4.F4)compares four representative paradigms: \(1\)Standard LoRA, a single low\-rank adapter without expert decomposition; \(2\)MoE, which replaces the feed\-forward network with full\-size FFN experts at substantial parameter cost; \(3\)MoE\-LoRA, a parameter\-efficient routed adapter baseline that couples each low\-rank expert under a single routing decision; and \(4\)MoD\(ours\), which decouples interpretation and synthesis expert pools with independent routing\. We additionally includeMulti\-Agent Debate \(MAD\)as an external\-debate reference\. This comparison addresses two questions: whether MoD improves over existing routed LoRA\-style architectures, and whether internalized debate is more efficient than external multi\-agent debate\.

Table 2\.Ablation on trajectory topology composition\.𝒯pos\\mathcal\{T\}\_\{pos\}: correct samples only\.𝒯rev\\mathcal\{T\}\_\{rev\}: correction trajectories\.𝒯rob\\mathcal\{T\}\_\{rob\}: robustness trajectories with interleaved samples\.Decoupling interpretation \(A\-side\) and synthesis \(B\-side\) experts enablesN×NN\\times Ncombinatorial pathways, compared with theNNpathways of coupled MoE\-LoRA, providing richer compositional capacity at the same parameter scale\.

Figure[4](https://arxiv.org/html/2606.29425#S4.F4)reports the accuracy–efficiency trade\-off on ScienceQA\-TEST\. Compared with MoE\-LoRA, MoD operates at the same learnable parameter scale \(12M\) and similar token budget, yet achieves higher accuracy, showing that the gain comes from the proposed decoupled routing design rather than increased model size or token cost\. Compared with MAD, MoD achieves better accuracy with much lower latency and token consumption by internalizing deliberation within a single backbone\. Overall, these results show that MoD improves over routed PEFT baselines such as MoE\-LoRA, while also offering a substantially better accuracy and efficiency trade\-off than external multi\-agent debate\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x6.png)Routing stability analysis\. Top: Comparison of different switching strategies in terms of switch rate and accuracy\. Bottom: Effect of window size in Momentum Switching\.

Figure 6\.Routing stability analysis\.Top: Comparison of different switching strategies in terms of switch rate and accuracy\.Bottom: Effect of window size in Momentum Switching\.
#### 4\.3\.3\.Effect of Routing Mechanism

We ablate three key design choices in our routing mechanism on MMStar and MMMU\-VAL, corresponding to challenges identified in adapting MoE to dialectical reasoning\.

Dual\-Routing vs\. Single\-Routing\.Our first contribution argues that effective debate requires decoupling role allocation from process flow, which a single router cannot simultaneously optimize\. Figure[5](https://arxiv.org/html/2606.29425#S4.F5)\(c\) validates this by comparing independent dual routers against a shared router \(forcingℐA=ℐB\\mathcal\{I\}\_\{A\}=\\mathcal\{I\}\_\{B\}\)\. Dual routing achieves 42\.80% on MMStar and 37\.88% on MMMU\-VAL, outperforming shared routing by \+0\.93% and \+2\.65% respectively\. This confirms that decoupling role allocation from process control enables asymmetric expert combinations essential for dialectical reasoning\.

Momentum Switching\.Our second contribution addresses the instability of token\-level gating, where noisy per\-token decisions can trigger frequent expert switches and cause stance jitter mid\-argument\. Figure[5](https://arxiv.org/html/2606.29425#S4.F5)\(a\) compares three granularities: token\-level achieves 35\.22% on MMMU\-VAL due to erratic expert switching; region\-level drops to 34\.0% as rigid boundaries disrupt autoregressive generation\. Our sliding window achieves 38\.44% by stabilizing expert usage over short spans while preserving autoregressive compatibility\. On MMStar, sliding window \(42\.80%\) outperforms token\-level \(41\.24%\) by 3\.2%, confirming that stable expert usage is critical for complex multimodal reasoning tasks\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x7.png)

Category\-wise expert activation heatmaps for Router A and B on the ScienceQA\-TEST benchmark\. Each row corresponds to a topic category and each column to an expert\. Color intensity indicates the relative activation frequency of an expert within a category\.

Figure 7\.Category\-wise expert activation heatmaps for Router A and B on the ScienceQA\-TEST benchmark\. Each row corresponds to a topic category and each column to an expert\. Color intensity indicates the relative activation frequency of an expert within a category\.Expert Pool Size \(EE\)\.We ablate the total number of experts in each pool \(i\.e\.,\|ℰA\|=\|ℰB\|=E\|\\mathcal\{E\}^\{A\}\|=\|\\mathcal\{E\}^\{B\}\|=E\) while fixing Top\-r=4r=4for both routers\. As shown in Fig\.[5](https://arxiv.org/html/2606.29425#S4.F5)\(b\),E=8E=8performs best \(MMMU 37\.9%, MMStar 42\.8%\)\. Increasing toE=12E=12reduces MMMU by 0\.5 points \(37\.4%\) and MMStar by 4\.9 points \(37\.9%\);E=16E=16further drops MMMU by 1\.6 points \(36\.3%\) and MMStar by 5\.0 points \(37\.8%\)\. This suggests that enlarging the expert pool under fixed sparsity makes routing/training less effective, weakening specialization\. We useE=8E=8by default\. To further verify the proposed routing design, we analyze its conflict sensitivity and routing stability\.

#### 4\.3\.4\.Routing Stability Analysis

Our second design claim is that Momentum Switching improves short\-range routing consistency by reducing expert\-switch jitter during autoregressive generation\. We quantify this effect on the MMMU\-VAL benchmark usingSwitch Rate, defined as the fraction of adjacent token pairs with different top\-1 experts, where lower values indicate more stable routing\. We compare three routing granularities, namely token\-level routing, region\-level routing, and our sliding\-window Momentum Switching, and further vary the window size𝒲\\mathcal\{W\}to examine the trade\-off between routing stability and task performance\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x8.png)

Case Study\. Left: MoD produces more grounded responses than LLaVA\-Next by attending to fine\-grained visual details \(top\), and momentum switching prevents mid\-response stance shifts that occur with token\-level routing \(bottom\)\. Right: Expert activation analysis on the same image with different questions\.

Figure 8\.Case Study\.Left: MoD produces more grounded responses than LLaVA\-Next by attending to fine\-grained visual details \(top\), and momentum switching prevents mid\-response stance shifts that occur with token\-level routing \(bottom\)\.Right: Expert activation analysis on the same image with different questions\.As shown in Figure[6](https://arxiv.org/html/2606.29425#S4.F6), the three routing strategies exhibit different stability–performance trade\-offs\. Token\-level routing has the highest switch rate \(0\.71\) and low accuracy \(35\.2%\), while region\-level routing greatly reduces switching \(0\.18\) but further lowers accuracy to 34\.0%\. In contrast, sliding\-window Momentum Switching achieves a lower switch rate \(0\.39\) and the best accuracy \(37\.88%\), indicating improved routing stability without sacrificing reasoning flexibility\.

The bottom panel shows the effect of window size𝒲\\mathcal\{W\}\. As𝒲\\mathcal\{W\}increases from 1 to 32, switch rate decreases from 0\.71 to 0\.29, confirming that larger windows smooth routing decisions\. Accuracy, however, peaks at𝒲=16\\mathcal\{W\}=16\(37\.88%\) and slightly drops at𝒲=32\\mathcal\{W\}=32\(37\.3%\), indicating that overly large windows may oversmooth routing and weaken adaptability\. Overall,𝒲=16\\mathcal\{W\}=16provides the best trade\-off between routing stability and reasoning performance\.

#### 4\.3\.5\.Expert Utilization Analysis

To validate that our architectural innovations lead to meaningful expert specialization, we analyze routing patterns across two dimensions: layer\-wise activation and topic\-wise distribution\.

Expert Activation Analysis\.Figure[3](https://arxiv.org/html/2606.29425#S3.F3)compares expert activation for Router\-A \(role allocation\) and Router\-B \(process control\)\. Both routers maintain balanced utilization across all eight experts throughout layers, confirming that our load\-balancing objective prevents expert collapse\. Notably, the two routers exhibit distinct activation preferences: Router\-A shows higher activation on E0, while Router\-B distributes more uniformly across experts\. This divergence validates that our dual\-routing mechanism learns differentiated strategies for role allocation versus process control\. Detailed activation heatmaps are provided in Appendix\.

Topic\-wise Expert Specialization\.Figure[7](https://arxiv.org/html/2606.29425#S4.F7)visualizes expert activation frequencies stratified by question topic for both routers\. While overall activation remains balanced, subtle topic\-dependent patterns emerge\. For Router\-A, experts E0 and E3 show elevated activation for science\-related topics \(biology, chemistry, physics\), while E5 and E7 are more active for humanities topics \(us\-history, world\-history\)\. Router\-B exhibits complementary patterns, with E1 and E4 favored for science and E2 and E6 for humanities\. This cross\-router complementarity enables theN×NN\\times Ncombinatorial pathways central to our design: by pairing topic\-specialized interpretation experts with process\-appropriate synthesis experts, the model can construct diverse reasoning strategies tailored to each input\.

### 4\.4\.Case Study

Figure[8](https://arxiv.org/html/2606.29425#S4.F8)presents qualitative comparisons\. Firstly, compared to LLaVA\-Next, MoD attends to finer\-grained visual details: it correctly identifies a desktop setup by recognizing the separate monitor and keyboard, rather than misclassifying the scene as a laptop\. Secondly, the comparison between token\-level routing and Momentum Switching reveals not only improved consistency, but also evidence of internal debate\. Token\-level routing exposes competing hypotheses, such as “sketch or photograph” or “land a clean trick,” whereas MoD\-Sliding resolves them into a more coherent final judgment\. This suggests that MoD internally evaluates alternative hypotheses and stabilizes them through smoother routing\. Finally, we analyze expert activations on the same image under different questions\. Both routers adapt their selections according to the query, while Router\-B exhibits more pronounced variation across questions, indicating stronger question\-dependent synthesis behavior\. In particular, the right panel shows that even for the same visual input, the routing trajectories differ substantially when the model answers “What football club is the player representing?” versus “What is happening in the image?” This supports our claim that MoD internalizes debate as dynamic coordination between interpretation and synthesis pathways, rather than a static protocol\.

## 5\.Conclusion

We present Mixture\-of\-Debaters \(MoD\), a framework that enables dynamic self\-debate within a single model\. MoD addresses three key challenges: dual\-routing decouples role allocation from process control, momentum switching maintains argumentative consistency, and unified self\-debate eliminates multi\-agent overhead\. Experiments show MoD outperforms both single\-model baselines and multi\-agent debate systems, achieving 3\.7×\\timeslower latency and 87% token reduction with only 12M additional parameters\. We hope MoD provides a practical and efficient alternative for enhancing reasoning capabilities in large language models\.

## References

- J\. Bai, S\. Bai, S\. Yang, S\. Wang, S\. Tan, P\. Wang, J\. Lin, C\. Zhou, and J\. Zhou \(2023\)Qwen\-vl: a frontier large vision\-language model with versatile abilities\.arXiv preprint arXiv:2308\.129661\(2\),pp\. 3\.Cited by:[1st item](https://arxiv.org/html/2606.29425#A2.I2.i1.p1.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.4.4.1)\.
- X\. Bo, Z\. Zhang, Q\. Dai, X\. Feng, L\. Wang, R\. Li, X\. Chen, and J\. Wen \(2024\)Reflective multi\-agent collaboration based on large language models\.Advances in Neural Information Processing Systems37,pp\. 138595–138631\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- L\. Chen, J\. Li, X\. Dong, P\. Zhang, C\. He, J\. Wang, F\. Zhao, and D\. Lin \(2024a\)ShareGPT4V: improving large multi\-modal models with better captions\.InEuropean Conference on Computer Vision,Cham, Switzerland,pp\. 370–387\.Cited by:[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.5.5.1)\.
- L\. Chen, J\. Li, X\. Dong, P\. Zhang, Y\. Zang, Z\. Chen, H\. Duan, J\. Wang, Y\. Qiao, D\. Lin,et al\.\(2024b\)Are we on the right way for evaluating large vision\-language models?\.Advances in Neural Information Processing Systems37,pp\. 27056–27087\.Cited by:[3rd item](https://arxiv.org/html/2606.29425#A2.I1.i3.p1.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- T\. Chen, X\. Chen, X\. Du, A\. Rashwan, F\. Yang, H\. Chen, Z\. Wang, and Y\. Li \(2023\)AdamV\-MoE: adaptive multi\-task vision mixture\-of\-experts\.InProceedings of the IEEE/CVF International Conference on Computer Vision,Paris, France,pp\. 17346–17357\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- H\. K\. Choi, X\. Zhu, and Y\. Li \(2025\)Debate or vote: which yields better decisions in multi\-agent large language models?\.External Links:2508\.17536Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- W\. Dai, J\. Li, D\. Li, A\. Tiong, J\. Zhao, W\. Wang, B\. Li, P\. N\. Fung, and S\. Hoi \(2023\)Instructblip: towards general\-purpose vision\-language models with instruction tuning\.Advances in neural information processing systems36,pp\. 49250–49267\.Cited by:[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.3.3.1)\.
- A\. Didolkar, A\. Goyal, N\. R\. Ke, S\. Guo, M\. Valko, T\. Lillicrap, D\. Jimenez Rezende, Y\. Bengio, M\. C\. Mozer, and S\. Arora \(2024\)Metacognitive capabilities of llms: an exploration in mathematical problem solving\.Advances in Neural Information Processing Systems37,pp\. 19783–19812\.Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p1.1)\.
- Y\. Ding and R\. Zhang \(2025\)Sherlock: self\-correcting reasoning in vision\-language models\.External Links:2505\.22651Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- J\. Feng, C\. Wei, T\. Qiu, T\. Hu, and Z\. Pu \(2025\)CoMoE: contrastive representation for mixture\-of\-experts in parameter\-efficient fine\-tuning\.External Links:2505\.17553Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- C\. Fu, P\. Chen, Y\. Shen, Y\. Qin, M\. Zhang, X\. Lin, J\. Yang, X\. Zheng, K\. Li, X\. Sun, Y\. Wu, and R\. Ji \(2023\)MME: a comprehensive evaluation benchmark for multimodal large language models\.External Links:2306\.13394Cited by:[5th item](https://arxiv.org/html/2606.29425#A2.I1.i5.p1.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- C\. Gao, K\. Chen, J\. Rao, R\. Liu, B\. Sun, Y\. Zhang, D\. Peng, X\. Guo, and V\. S\. Subrahmanian \(2025\)MoLA: MoE LoRA with layer\-wise expert allocation\.InFindings of the Association for Computational Linguistics: NAACL 2025,Albuquerque, New Mexico, USA,pp\. 5097–5112\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- J\. He, H\. Lin, Q\. Wang, Y\. R\. Fung, and H\. Ji \(2025\)Self\-correction is more than refinement: a learning framework for visual and language reasoning tasks\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 6405–6421\.Cited by:[3rd item](https://arxiv.org/html/2606.29425#A2.I2.i3.p1.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.11.11.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.17.17.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Note:Published as a conference paper at ICLR 2021External Links:2009\.03300Cited by:[6th item](https://arxiv.org/html/2606.29425#A2.I1.i6.p1.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[§3\.1](https://arxiv.org/html/2606.29425#S3.SS1.p2.5)\.
- W\. Hu, W\. Zhang, Y\. Jiang, C\. J\. Zhang, X\. Wei, and Q\. Li \(2025\)Removal of hallucination on hallucination: debate\-augmented rag\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 15839–15853\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- P\. Kunwar, M\. N\. Vu, M\. Gupta, M\. Abdelsalam, and M\. Bhattarai \(2025\)TT\-LoRA MoE: using parameter\-efficient fine\-tuning and sparse mixture\-of\-experts\.InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,New York, NY, USA,pp\. 1332–1350\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- D\. Lepikhin, H\. Lee, Y\. Xu, D\. Chen, O\. Firat, Y\. Huang, M\. Krikun, N\. Shazeer, and Z\. Chen \(2021\)GShard: scaling giant models with conditional computation and automatic sharding\.Note:Published as a conference paper at ICLR 2021External Links:2006\.16668Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- D\. Li, Z\. Tan, P\. Qian, Y\. Li, K\. S\. Chaudhary, L\. Hu, and J\. Shen \(2025a\)SMoA: improving multi\-agent large language models with sparse mixture\-of\-agents\.InPacific\-Asia Conference on Knowledge Discovery and Data Mining,Singapore,pp\. 54–65\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- D\. Li, Y\. Ma, N\. Wang, Z\. Ye, Z\. Cheng, Y\. Tang, Y\. Zhang, L\. Duan, J\. Zuo, C\. Yang, and M\. Tang \(2024a\)MixLoRA: enhancing large language models fine\-tuning with lora\-based mixture of experts\.External Links:2404\.15159Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.13.13.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.19.19.1)\.
- H\. Li, Z\. Su, Y\. Xue, Z\. Tian, Y\. Song, and M\. Huang \(2025b\)Advancing collaborative debates with role differentiation through multi\-agent reinforcement learning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 22655–22666\.Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p1.1)\.
- Y\. Li, Y\. Du, K\. Zhou, J\. Wang, X\. Zhao, and J\. Wen \(2023\)Evaluating object hallucination in large vision\-language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 292–305\.Cited by:[4th item](https://arxiv.org/html/2606.29425#A2.I1.i4.p1.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- Y\. Li, S\. Jiang, B\. Hu, L\. Wang, W\. Zhong, W\. Luo, L\. Ma, and M\. Zhang \(2024b\)Uni\-moe: scaling unified multimodal llms with mixture of experts\.External Links:2405\.11273Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p3.1),[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- D\. Liang, X\. Wei, and C\. Zheng \(2025a\)Multi\-agent undercover gaming: hallucination removal via counterfactual test for multimodal reasoning\.External Links:2511\.11182Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- D\. Liang, X\. Wei, and C\. Zheng \(2026\)Multi\-agent undercover gaming: hallucination removal through counterfactual test for multimodal reasoning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,Palo Alto, CA, USA,pp\. 6807–6815\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- D\. Liang, C\. Zheng, Z\. Wen, Y\. Cai, X\. Wei, and Q\. Li \(2025b\)Seeing beyond the scene: enhancing vision\-language models with interactional reasoning\.External Links:2505\.09118Cited by:[§3\.1](https://arxiv.org/html/2606.29425#S3.SS1.p1.2)\.
- T\. Liang, Z\. He, W\. Jiao, X\. Wang, Y\. Wang, R\. Wang, Y\. Yang, S\. Shi, and Z\. Tu \(2024\)Encouraging divergent thinking in large language models through multi\-agent debate\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 17889–17904\.Cited by:[2nd item](https://arxiv.org/html/2606.29425#A2.I2.i2.p1.1),[§1](https://arxiv.org/html/2606.29425#S1.p1.1),[§2](https://arxiv.org/html/2606.29425#S2.p1.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.12.12.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.18.18.1)\.
- B\. Lin, Z\. Tang, Y\. Ye, J\. Huang, J\. Zhang, Y\. Pang, P\. Jin, M\. Ning, J\. Luo, and L\. Yuan \(2024\)MoE\-LLaVA: mixture of experts for large vision\-language models\.External Links:2401\.15947Cited by:[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.8.8.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.9.9.1)\.
- H\. Liu, C\. Li, Y\. Li, B\. Li, Y\. Zhang, S\. Shen, and Y\. J\. Lee \(2024\)LLaVA\-next: improved reasoning, ocr, and world knowledge\.External Links:[Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/)Cited by:[§B\.1](https://arxiv.org/html/2606.29425#A2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.10.10.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.6.6.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023a\)Visual instruction tuning\.Note:NeurIPS 2023External Links:2304\.08485Cited by:[1st item](https://arxiv.org/html/2606.29425#A2.I2.i1.p1.1)\.
- Q\. Liu, X\. Wu, X\. Zhao, Y\. Zhu, D\. Xu, F\. Tian, and Y\. Zheng \(2023b\)When MOE meets LLMs: parameter efficient fine\-tuning for multi\-task medical applications\.External Links:2310\.18339Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p3.1),[§2](https://arxiv.org/html/2606.29425#S2.p3.1),[§3\.1\.3](https://arxiv.org/html/2606.29425#S3.SS1.SSS3.p2.2)\.
- P\. Lu, S\. Mishra, T\. Xia, L\. Qiu, K\. Chang, S\. Zhu, O\. Tafjord, P\. Clark, and A\. Kalyan \(2022\)Learn to explain: multimodal reasoning via thought chains for science question answering\.Note:NeurIPS 2022External Links:2209\.09513Cited by:[1st item](https://arxiv.org/html/2606.29425#A2.I1.i1.p1.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- R\. Ma, P\. Wang, C\. Liu, X\. Liu, J\. Chen, B\. Zhang, X\. Zhou, N\. Du, and J\. Li \(2025\)S2R: teaching llms to self\-verify and self\-correct via reinforcement learning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 22632–22654\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- Q\. Peng, J\. Li, S\. Huang, Y\. Jiang, K\. Gong, R\. Ding, S\. Ye, X\. Wei, C\. Zheng, and Q\. Li \(2025\)Aligning clinical needs and ai capabilities: a survey on llms for medical reasoning\.Note:TechRxiv preprintCited by:[§1](https://arxiv.org/html/2606.29425#S1.p1.1)\.
- S\. M\. Seals and V\. L\. Shalin \(2024\)Evaluating the deductive competence of large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Mexico City, Mexico,pp\. 8614–8630\.Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p1.1)\.
- V\. Subramaniam, Y\. Du, J\. B\. Tenenbaum, A\. Torralba, S\. Li, and I\. Mordatch \(2025\)Multiagent finetuning: self improvement with diverse reasoning chains\.External Links:2501\.05707Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- M\. Sun, Y\. Wang, T\. Feng, D\. Zhang, Y\. Zhu, and J\. Tang \(2025\)A stronger mixture of low\-rank experts for fine\-tuning foundation models\.External Links:2502\.15828Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- Z\. Tang, Z\. Li, Z\. Xiao, T\. Ding, R\. Sun, B\. Wang, D\. Liu, F\. Huang, T\. Liu, B\. Yu, and J\. Lin \(2025a\)RealCritic: towards effectiveness\-driven evaluation of language model critiques\.External Links:2501\.14492Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- Z\. Tang, Z\. Li, Z\. Xiao, T\. Ding, R\. Sun, B\. Wang, D\. Liu, F\. Huang, T\. Liu, B\. Yu,et al\.\(2025b\)Self\-evolving critique abilities in large language models\.External Links:2501\.05727Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.7.7.1)\.
- K\. Tsui \(2025\)Self\-correction bench: uncovering and addressing the self\-correction blind spot in large language models\.arXiv preprint arXiv:2507\.02778\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§3\.1](https://arxiv.org/html/2606.29425#S3.SS1.p2.5)\.
- J\. Wang, J\. Wang, B\. Athiwaratkun, C\. Zhang, and J\. Zou \(2024a\)Mixture\-of\-agents enhances large language model capabilities\.arXiv preprint arXiv:2406\.04692\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024b\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§B\.1](https://arxiv.org/html/2606.29425#A2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.29425#S3.T1.5.16.16.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p1.1)\.
- C\. J\. Yu, B\. Jalaian, and N\. D\. Bastian \(2024\)Mitigating large vision\-language model hallucination at post\-hoc via multi\-agent system\.InProceedings of the AAAI Symposium Series,Vol\.4,pp\. 110–113\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- X\. Yue, Y\. Ni, K\. Zhang, T\. Zheng, R\. Liu, G\. Zhang, S\. Stevens, D\. Jiang, W\. Ren, Y\. Sun,et al\.\(2024\)Mmmu: a massive multi\-discipline multimodal understanding and reasoning benchmark for expert agi\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9556–9567\.Cited by:[2nd item](https://arxiv.org/html/2606.29425#A2.I1.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.29425#S4.SS1.p1.1)\.
- D\. Zhang, K\. Zhang, S\. Chu, L\. Wu, X\. Li, and S\. Wei \(2025a\)More: a mixture of low\-rank experts for adaptive multi\-task learning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 1311–1324\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p3.1)\.
- K\. Zhang, Q\. Liu, L\. Zhang, C\. Zheng, S\. Li, B\. Xu, M\. Yang, X\. Qiao, and W\. Lu \(2025b\)MADAWSD: multi\-agent debate framework for adversarial word sense disambiguation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 22294–22313\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- X\. Zhao, T\. Xu, X\. Wang, Z\. Chen, D\. Jin, L\. Tan, Z\. Yu, Z\. Zhao, Y\. He, S\. Wang,et al\.\(2025\)Boosting llm reasoning via spontaneous self\-correction\.arXiv preprint arXiv:2506\.06923\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- C\. Zheng and Q\. Li \(2026\)Multimodal knowledge systems: construction and reasoning\.Springer Nature\.Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p1.1)\.
- C\. Zheng, D\. Liang, W\. Zhang, X\. Wei, T\. Chua, and Q\. Li \(2024\)A picture is worth a graph: a blueprint debate paradigm for multimodal reasoning\.InProceedings of the 32nd ACM International Conference on Multimedia,pp\. 419–428\.Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p1.1)\.
- C\. Zheng \(2025\)Learning versatile multimodal representation for knowledge extraction and reasoning\.Ph\.D\. Thesis,The Hong Kong Polytechnic University\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p1.1)\.
- X\. Zhou, H\. Huang, and L\. Liao \(2025\)Debate, reflect, and distill: multi\-agent feedback with tree\-structured preference optimization for efficient language model enhancement\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 9122–9137\.Cited by:[§2](https://arxiv.org/html/2606.29425#S2.p2.1)\.
- Y\. Zhou, T\. Lei, H\. Liu, N\. Du, Y\. Huang, V\. Zhao, A\. M\. Dai, Q\. V\. Le, J\. Laudon,et al\.\(2022\)Mixture\-of\-experts with expert choice routing\.Advances in Neural Information Processing Systems35,pp\. 7103–7114\.Cited by:[§1](https://arxiv.org/html/2606.29425#S1.p3.1),[§2](https://arxiv.org/html/2606.29425#S2.p3.1),[§3\.1\.2](https://arxiv.org/html/2606.29425#S3.SS1.SSS2.p1.1)\.

## Appendix AAlgorithm

Algorithm[1](https://arxiv.org/html/2606.29425#alg1)presents the complete forward pass of a specific MoD layer, including dual\-routing, Momentum Switching via sliding window, and gating score combination\.

Algorithm 1Mixture\-of\-Debaters Layer Forward Pass0:Input batch sequence

X=\{x\(1\),x\(2\),…,x\(T\)\}X=\\\{x\(1\),x\(2\),\\dots,x\(T\)\\\}, window size

𝒲\\mathcal\{W\}, top\-k

rr
0:Expert pools

ℰA=\{A1,…,AE\}\\mathcal\{E\}^\{A\}=\\\{A\_\{1\},\\dots,A\_\{E\}\\\},

ℰB=\{B1,…,BE\}\\mathcal\{E\}^\{B\}=\\\{B\_\{1\},\\dots,B\_\{E\}\\\}
0:Router projections

WgAW^\{A\}\_\{g\},

WgBW^\{B\}\_\{g\}
0:Output sequence

H=\{h\(1\),h\(2\),…,h\(T\)\}H=\\\{h\(1\),h\(2\),\\dots,h\(T\)\\\},

ℒaux\_layer\\mathcal\{L\}\_\{\\text\{aux\\\_layer\}\}
1:for

t=1t=1to

TTdo

2:// Momentum Switching via Sliding Window

3:

xroute\(t\)=∑k=max⁡\(0,t−𝒲\+1\)tx\(k\)min⁡\(t\+1,𝒲\)x\_\{route\}\(t\)=\\frac\{\\sum\_\{k=\\max\(0,\\,t\-\\mathcal\{W\}\+1\)\}^\{t\}x\(k\)\}\{\\min\(t\+1,\\mathcal\{W\}\)\}
4:// Dual\-Routing: Independent Expert Selection

5:

ℓA←WgA⋅xroute\(t\)\\ell^\{A\}\\leftarrow W^\{A\}\_\{g\}\\cdot x\_\{route\}\(t\)
6:

ℓB←WgB⋅xroute\(t\)\\ell^\{B\}\\leftarrow W^\{B\}\_\{g\}\\cdot x\_\{route\}\(t\)
7:

ℐA←TopK\(ℓA,r\)\\mathcal\{I\}^\{A\}\\leftarrow\\text\{TopK\}\(\\ell^\{A\},r\)
8:

ℐB←TopK\(ℓB,r\)\\mathcal\{I\}^\{B\}\\leftarrow\\text\{TopK\}\(\\ell^\{B\},r\)
9:

sA←Softmax\(ℓA\)s^\{A\}\\leftarrow\\text\{Softmax\}\(\\ell^\{A\}\)
10:

sB←Softmax\(ℓB\)s^\{B\}\\leftarrow\\text\{Softmax\}\(\\ell^\{B\}\)
11:

gA←Normalize\(sA\[ℐA\]\)g^\{A\}\\leftarrow\\text\{Normalize\}\(s^\{A\}\[\\mathcal\{I\}^\{A\}\]\)
12:

gB←Normalize\(sB\[ℐB\]\)g^\{B\}\\leftarrow\\text\{Normalize\}\(s^\{B\}\[\\mathcal\{I\}^\{B\}\]\)
13:// Gating Score Combination

14:

g←gA⊙gBg\\leftarrow\\sqrt\{g^\{A\}\\odot g^\{B\}\}
15:// MoD Adapter Forward

16:

𝐀ℐA←Stack\(\{Ai\}i∈ℐA\)\\mathbf\{A\}^\{\\mathcal\{I\}^\{A\}\}\\leftarrow\\text\{Stack\}\(\\\{A\_\{i\}\\\}\_\{i\\in\\mathcal\{I\}^\{A\}\}\)\{

ℝr×k\\mathbb\{R\}^\{r\\times k\}\}

17:

𝐁ℐB←Stack\(\{Bi\}i∈ℐB\)\\mathbf\{B\}^\{\\mathcal\{I\}^\{B\}\}\\leftarrow\\text\{Stack\}\(\\\{B\_\{i\}\\\}\_\{i\\in\\mathcal\{I\}^\{B\}\}\)\{

ℝr×d\\mathbb\{R\}^\{r\\times d\}\}

18:

hMoD\(t\)←αr⋅𝐁ℐB⊤\(g⊙\(𝐀ℐA⋅x\(t\)\)\)h^\{\\text\{MoD\}\}\(t\)\\leftarrow\\frac\{\\alpha\}\{r\}\\cdot\{\\mathbf\{B\}^\{\{\\mathcal\{I\}\_\{B\}\}\}\}^\{\\top\}\(g\\odot\(\\mathbf\{A\}^\{\\mathcal\{I\}\_\{A\}\}\\cdot x\(t\)\)\)
19:

h\(t\)←W0x\(t\)\+hMoD\(t\)h\(t\)\\leftarrow W\_\{0\}x\(t\)\+h^\{\\text\{MoD\}\}\(t\)
20:endfor

21:// Load Balancing Loss for current layer \(Training Only\)

22:

fiA=1T∑t=1T1\(i∈ℐA\(t\)\)f^\{A\}\_\{i\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}1\\left\(i\\in\\mathcal\{I\}^\{A\}\(t\)\\right\)
23:

PiA=1T∑t=1TℓiA\(t\)P^\{A\}\_\{i\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\ell^\{A\}\_\{i\}\{\(t\)\}
24:

fiB=1T∑t=1T1\(i∈ℐB\(t\)\)f^\{B\}\_\{i\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}1\\left\(i\\in\\mathcal\{I\}^\{B\}\(t\)\\right\)
25:

PiB=1T∑t=1TℓiB\(t\)P^\{B\}\_\{i\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\ell^\{B\}\_\{i\}\{\(t\)\}
26:

ℒaux\_layer\\mathcal\{L\}\_\{\\text\{aux\\\_layer\}\}=E2∑i=1E\(fiA⋅PiA\+fiB⋅PiB\)=\\frac\{E\}\{2\}\\sum\_\{i=1\}^\{E\}\\left\(f^\{A\}\_\{i\}\\cdot P^\{A\}\_\{i\}\+f^\{B\}\_\{i\}\\cdot P^\{B\}\_\{i\}\\right\)
27:

28:return

HH,

ℒaux\_layer\\mathcal\{L\}\_\{\\text\{aux\\\_layer\}\}

Algorithm[2](https://arxiv.org/html/2606.29425#alg2)details the concrete implementation of self\-debate in MoD during inference\. At theKK\-th reasoning round, the model revisits its own responses generated in the previousK−1K\-1rounds, which are explicitly incorporated into the prompt and concatenated with a dedicated self\-debate instruction—specifically, “Review the following responses from other assistants and determine your final answer\.” This formulation enables the model to effectively simulate a multi\-agent debate process within a single model\.

Algorithm 2Iterative Self\-Debate Pipeline0:Image

II, User Question

QQ
0:Debate Prompt

𝒟\\mathcal\{D\}
0:Multimodal LLM

ℳ\\mathcal\{M\}
0:Max Rounds

KK
0:Response Sequence

ℛ=\{R0,R1,…,RK\}\\mathcal\{R\}=\\\{R\_\{0\},R\_\{1\},\\dots,R\_\{K\}\\\}
1:// Phase 1: Generate Initial Response

2:

Cinit←\{I,Q\}C\_\{init\}\\leftarrow\\\{I,Q\\\}
3:

R0←ℳ\(Cinit\)R\_\{0\}\\leftarrow\\mathcal\{M\}\(C\_\{init\}\)
4:Initialize history accumulator:

ℋ←R0\\mathcal\{H\}\\leftarrow R\_\{0\}
5:Add

R0R\_\{0\}to result set

ℛ\\mathcal\{R\}
6:// Phase 2: Iterative Self\-Debate

7:for

k=1k=1to

KKdo

8:// Construct independent context with full history

9:

Ck←\{I,Q,𝒟,ℋ\}C\_\{k\}\\leftarrow\\\{I,Q,\\mathcal\{D\},\\mathcal\{H\}\\\}
10:// Generate debate response

11:

Rk←ℳ\(Ck\)R\_\{k\}\\leftarrow\\mathcal\{M\}\(C\_\{k\}\)
12:// Accumulate history

13:

ℋ←ℋ⊕Rk\\mathcal\{H\}\\leftarrow\\mathcal\{H\}\\oplus R\_\{k\}
14:Add

RkR\_\{k\}to result set

ℛ\\mathcal\{R\}
15:endfor

16:

17:return

ℛ\\mathcal\{R\}

## Appendix BExperimental Details

### B\.1\.Implementation Details

We build MoD upon LLaVA\-v1\.6\-Vicuna\-13B\(Liuet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib62)\)and Qwen2\.5VL\-3B\-Instruct\(Wanget al\.,[2024b](https://arxiv.org/html/2606.29425#bib.bib71)\), two large vision language models with strong zero\-shot reasoning capabilities\. The base model weights are kept frozen during fine\-tuning, and only the dialectical expert modules and routing networks are optimized\. We implement our framework using PyTorch and conduct all experiments on NVIDIA A100 GPUs\. Table[3](https://arxiv.org/html/2606.29425#A2.T3)and Table[4](https://arxiv.org/html/2606.29425#A2.T4)list the fine\-tuning hyperparameters\.

Table 3\.Fine\-tuning hyperparameters for llavaTable 4\.Fine\-tuning hyperparameters for qwen![Refer to caption](https://arxiv.org/html/2606.29425v1/x9.png)

Examples of viewpoint\-shift training data\. Left: $\\mathcal\{T\}\_\{rev\}$ trajectory where the model reviews conflicting responses and corrects errors\. Right: $\\mathcal\{T\}\_\{pos\}$ trajectory where the model confirms correct prior reasoning\. Green highlights indicate correct reasoning; red highlights indicate errors to be revised\.

Figure 9\.Examples of viewpoint\-shift training data\. Left:𝒯rev\\mathcal\{T\}\_\{rev\}trajectory where the model reviews conflicting responses and corrects errors\. Right:𝒯pos\\mathcal\{T\}\_\{pos\}trajectory where the model confirms correct prior reasoning\. Green highlights indicate correct reasoning; red highlights indicate errors to be revised\.
### B\.2\.Tuning Data Format

Figure[9](https://arxiv.org/html/2606.29425#A2.F9)illustrates our viewpoint\-shift training data format\. Each sample contains a visual input, a question, prior responses from previous debate rounds, and an instruction to produce a structured JSON output\.

##### Correction and Revision \(𝒯rev\\mathcal\{T\}\_\{rev\}\)\.

The left example shows a table\-based reasoning task\. The model receives two conflicting prior responses: \(1\) correctly identifies ”Reno, Nevada” from the Mascot column; \(2\) incorrectly answers ”North Valleys, Nevada” based on a different row\. The model must evaluate both responses and produce the correct final answer by grounding in the visual evidence\.

##### Consistent Reasoning \(𝒯pos\\mathcal\{T\}\_\{pos\}\)\.

The right example shows a food web diagram\. The prior response correctly reasons that if all plants die, herbivores like deer and insects will lack food, leading to ”B deer” as the answer\. The model confirms this reasoning and maintains the correct conclusion\.

##### Format Structure\.

Each training sample follows the template: \(1\)<image\>token referencing the visual input; \(2\) the original question; \(3\) prior assistant responses as context for review; \(4\) instruction to output in JSON format withReasoningandAnswerfields\. This format trains the model to critically evaluate prior reasoning, identify errors when present, and produce grounded final answers\.

### B\.3\.Benchmarks

We evaluate on four benchmarks spanning diverse multimodal reasoning challenges:

- •ScienceQA\(Luet al\.,[2022](https://arxiv.org/html/2606.29425#bib.bib65)\): Multimodal science question answering across natural, social, and language sciences\. We report accuracy on both VAL and TEST splits\.
- •MMMU\(Yueet al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib66)\): College\-level reasoning involving spatial relationships, abstract concepts, and expert\-level knowledge\. DEV contains 150 samples and VAL contains 900 samples\.
- •MMStar\(Chenet al\.,[2024b](https://arxiv.org/html/2606.29425#bib.bib69)\): Vision\-indispensable multimodal benchmark designed to reduce language bias and test whether models truly use visual evidence\. It covers diverse perception\-and\-reasoning scenarios; we report overall accuracy\.
- •POPE\(Liet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib43)\): Evaluates object hallucination through binary yes/no questions about object existence in images\.
- •MME\(Fuet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib44)\): Comprehensive evaluation covering perception abilities and cognition abilities\. We report perception and cognition scores separately\.
- •MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.29425#bib.bib25)\): A text\-only benchmark spanning 57 subjects across STEM, humanities, social sciences, and professional domains, used to verify that multimodal adaptation does not degrade language reasoning capabilities\.

### B\.4\.Baselines

We compare against three reasoning paradigms:

- •Single Model Reasoning: Standard zero\-shot inference using LLaVA\-v1\.6\-Vicuna\-13B\(Liuet al\.,[2023a](https://arxiv.org/html/2606.29425#bib.bib58)\)or Qwen2\.5VL\-3B\-Instruct\(Baiet al\.,[2023](https://arxiv.org/html/2606.29425#bib.bib30)\)\.
- •Multi\-Agent Debate \(MAD\)\(Lianget al\.,[2024](https://arxiv.org/html/2606.29425#bib.bib11)\): Instantiates three independent model instances \(two debaters and one judge\) for structured argumentation over multiple rounds\.
- •Self\-Correction \(SC\)\(Heet al\.,[2025](https://arxiv.org/html/2606.29425#bib.bib1)\): A self\-refinement approach where the model critiques and revises its initial response\.

For LLaVA\-v1\.6, we evaluate all baselines and both MoD variants \(MoD\-Single and MoD\-Debate\)\. For Qwen2\.5VL\-3B\-Instruct, we apply MoD to demonstrate generalizability across different architectures and scales\. All comparisons within each base model use identical configurations for fair evaluation\.

### B\.5\.Evaluation Protocol

We evaluate MoD in two inference modes:

- •MoD\-Single: Single\-round inference with MoD\.
- •MoD\-Multi: Two\-round self\-debate within a single model instance, achieving effective self\-correction with significantly lower token cost than MAD\.

For multiple\-choice questions \(MMMU, ScienceQA\), we compute exact match accuracy\. For MME, we report perception and cognition scores separately\. All results are averaged over three random seeds\.

## Appendix CAdditional Expert Analysis

### C\.1\.Topic\-wise Expert Activation

Figure[13](https://arxiv.org/html/2606.29425#A3.F13)presents detailed expert activation distributions for Router\-A across all 15 topics in ScienceQA\-TEST\. While overall activation remains balanced due to our load\-balancing objective, subtle topic\-dependent patterns emerge\. For example, E0 shows consistently high activation across most topics, while E4 exhibits relatively lower activation\. Science\-related topics \(biology, chemistry, physics\) show similar activation profiles, as do humanities topics \(us\-history, world\-history\), suggesting that the router learns to group semantically related categories\.

### C\.2\.Router Comparison Heatmap

Figure[10](https://arxiv.org/html/2606.29425#A3.F10)provides a direct comparison of expert activation frequencies between Router\-A \(role allocation\) and Router\-B \(process control\) on ScienceQA\-TEST\. Router\-A shows higher activation concentration on E0 \(approximately 13%\), while Router\-B distributes more uniformly across experts \(all within 12\.2%\-12\.7%\)\. This divergence validates that our dual\-routing mechanism learns differentiated strategies: Router\-A develops stronger expert preferences for role\-specific processing, while Router\-B maintains broader coverage for flexible synthesis\.

### C\.3\.Router Activation Patterns

Figure[11](https://arxiv.org/html/2606.29425#A3.F11)visualizes layer\-wise routing activations for Router\-A \(orange\) and Router\-B \(blue\) across six benchmarks\. Both routers exhibit smooth, continuous activation patterns rather than erratic switching, validating that opinion\-level switching prevents token\-level fluctuations\. Router\-A maintains relatively stable patterns across benchmarks, while Router\-B shows more pronounced variations in deeper layers, adapting synthesis strategies to task requirements\. ScienceQA\-TEST and ScienceQA\-VAL show highly similar patterns within the same task domain\. MMMU\-VAL exhibits sharper Router\-B transitions for complex reasoning\. POPE displays smoother patterns consistent with binary classification\. MMStar and MME show more dynamic Router\-B activations for diverse question types\.

### C\.4\.Effect of auxiliary Loss

Figure[12](https://arxiv.org/html/2606.29425#A3.F12)compares expert activation patterns with and without the auxiliary loss\. Without the loss \(left\), expert selection collapses to a narrow subset, with E2 and E5 dominating across all topics while other experts remain underutilized\. With the loss \(right\), activation distributes more evenly across all experts, preserving the fullN×NN\\times Ncombinatorial capacity\. This validates that the auxiliary loss is essential for preventing expert collapse and maintaining diverse reasoning pathways\.

![Refer to caption](https://arxiv.org/html/2606.29425v1/x10.png)Expert activation frequency comparison between Router\-A \(A\-side\) and Router\-B \(B\-side\) on ScienceQA\-TEST\. Router\-A exhibits higher concentration on specific experts, while Router\-B shows more uniform distribution\.

Figure 10\.Expert activation frequency comparison between Router\-A \(A\-side\) and Router\-B \(B\-side\) on ScienceQA\-TEST\. Router\-A exhibits higher concentration on specific experts, while Router\-B shows more uniform distribution\.![Refer to caption](https://arxiv.org/html/2606.29425v1/x11.png)

Layer\-wise router activations across benchmarks\. Orange: Router\-A; Blue: Router\-B\. Smooth curves validate momentum switching; divergent A/B patterns confirm learned specialization\.

Figure 11\.Layer\-wise router activations across benchmarks\. Orange: Router\-A; Blue: Router\-B\. Smooth curves validate momentum switching; divergent A/B patterns confirm learned specialization\.![Refer to caption](https://arxiv.org/html/2606.29425v1/x12.png)

Expert activation heatmaps without \(left\) and with \(right\) auxiliary loss in ScienceQA\-TEST\. Without the auxiliary loss, activation concentrates on few experts; with the auxiliary loss, utilization is balanced across all experts\.

Figure 12\.Expert activation heatmaps without \(left\) and with \(right\) auxiliary loss in ScienceQA\-TEST\. Without the auxiliary loss, activation concentrates on few experts; with the auxiliary loss, utilization is balanced across all experts\.![Refer to caption](https://arxiv.org/html/2606.29425v1/x13.png)

![Refer to caption](https://arxiv.org/html/2606.29425v1/x14.png)Router\-A expert activation by topic on ScienceQA\-TEST\. Each subplot shows the activation percentage for all 8 experts within a specific topic\. Activation remains balanced across experts while exhibiting subtle topic\-dependent variations\.

Figure 13\.Router\-A expert activation by topic on ScienceQA\-TEST\. Each subplot shows the activation percentage for all 8 experts within a specific topic\. Activation remains balanced across experts while exhibiting subtle topic\-dependent variations\.
Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning

Similar Articles

Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

Investigating Multi-Agent Deliberation in Law

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

Submit Feedback

Similar Articles

Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate
The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge
Investigating Multi-Agent Deliberation in Law
Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate
Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection