TriAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation
Summary
This paper proposes TriAdReview, a triangular adversarial review architecture that uses two independent reviewer models (engineering and boundary perspectives) and a judging mechanism to iteratively improve a generator model's output for technical document generation. Experiments show a 10.1% overall improvement over single-model baselines, with strong gains in security audit, code generation, and architecture design, but a degradation on requirements analysis indicating task-dependent effectiveness.
View Cached Full Text
Cached at: 06/16/26, 11:37 AM
# Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation
Source: [https://arxiv.org/html/2606.15074](https://arxiv.org/html/2606.15074)
Zhiqiang Zhou, Junliang Dai, Xu Ling Hunan Chemical Industry Vocational and Technical College, Hunan, China willenchow@126\.com
###### Abstract
Large language models \(LLMs\) are increasingly used for technical document generation, yet single\-model outputs often suffer from over\-engineering, security blind spots, and incomplete coverage\. We proposeTriAdReview, a triangular adversarial review architecture that employs two independent reviewer models \(engineering and boundary perspectives\) and a triangular judging mechanism to iteratively improve a generator model’s output\. We evaluate TriAdReview across five benchmark tasks—architecture design, code generation, proposal review, security audit, and requirements analysis—using three configurations: single model \(baseline\), dual model \(single review\), and triple model \(full system\)\. Results across 75 experiments \(n=5n\{=\}5per cell\) show that the triple model configuration achieves a10\.1% overall improvementover the single model baseline \(26\.2 vs\. 23\.8 out of 50;p<0\.05p\{<\}0\.05, pairedtt\-test\), with particularly strong gains on security audit \(\+27\.6%\), code generation \(\+20\.8%\), and architecture design \(\+15\.6%\)\. A second scorer \(mimo\-v2\.5\-pro\) confirms the direction with a smaller effect \(\+2\.7%\), suggesting moderate inter\-rater agreement\. However, the system shows a\-7\.5% degradationon requirements analysis, revealing that adversarial review architectures have a structural bias toward simplification that is counterproductive for completeness\-oriented tasks\. We analyze this boundary condition through a task\-type framework and demonstrate that reviewer prompt adaptation partially mitigates the issue\. Our findings provide the first empirical characterization of when multi\-model adversarial review helps versus harms, with implications for the design of collaborative AI systems\.
## 1Introduction
The deployment of large language models \(LLMs\) for technical document generation—including system architecture proposals, code implementations, and security audits—has become widespread\. However, single\-model outputs frequently exhibit characteristic deficiencies: over\-engineering of non\-critical components, security blind spots, and technology selection driven by training data popularity rather than engineering fit\.
A natural approach to mitigate these deficiencies is*multi\-model review*: having one or more auxiliary models review and critique the primary model’s output before finalization\. This draws inspiration from human peer review processes, where independent reviewers catch errors that the original author overlooks\.
However, the design space for multi\-model review systems is large and poorly understood\. Key questions include:
- •How should reviewer models be prompted—as adversarial critics, balanced assessors, or constructive collaborators?
- •How should disagreements between the generator and reviewers be resolved?
- •Does adding more reviewers always improve output quality, or are there diminishing returns?
- •Are there task types where review architectures systematically fail?
In this paper, we proposeTriAdReview\(Triangular Adversarial Review\), a system that addresses these questions through three design decisions: \(1\) dual\-perspective reviewers covering engineering robustness and security boundaries; \(2\) a triangular judging mechanism where disputes are resolved by a third\-party model rather than majority vote; and \(3\) iterative refinement with memory of rejected suggestions\.
We evaluate TriAdReview on five diverse technical writing tasks and report three primary findings:
1. 1\.Adversarial review is effective for “de\-fatting” tasks\.On architecture design, code generation, and security audit, the triple model achieves a mean improvement of \+21\.3%, primarily by eliminating over\-engineering and technology fragmentation\.
2. 2\.Adversarial review harms completeness\-oriented tasks\.On requirements analysis, the system degrades by \-7\.5% because reviewer models trained to challenge and simplify remove necessary components\.
3. 3\.Task\-type awareness is essential\.A simple prompt adaptation for the requirements analysis task partially mitigates the degradation, suggesting that review architectures must be tuned to task type\.
## 2Related Work
### 2\.1Multi\-Agent LLM Systems
Recent work has explored multi\-agent architectures for improving LLM output quality\.Wu et al\.\[[9](https://arxiv.org/html/2606.15074#bib.bib9)\]proposed AutoGen, a framework for multi\-agent conversation that enables flexible agent topologies\.Liang et al\.\[[5](https://arxiv.org/html/2606.15074#bib.bib5)\]introduced a multi\-agent collaboration network for reasoning tasks\. Our work differs in focusing specifically on*adversarial*review for technical document generation, where the goal is not reasoning accuracy but engineering quality\.
### 2\.2Adversarial and Debate\-Based Approaches
The idea of using LLMs to debate and critique each other has been explored in several contexts\.Du et al\.\[[3](https://arxiv.org/html/2606.15074#bib.bib3)\]showed that multi\-agent debate improves factual accuracy\.Liang et al\.\[[4](https://arxiv.org/html/2606.15074#bib.bib4)\]demonstrated that encouraging divergent thinking in LLM debates improves reasoning\.Chan et al\.\[[1](https://arxiv.org/html/2606.15074#bib.bib1)\]proposed ChatEval, a multi\-agent framework for evaluation using debate\. Our triangular judging mechanism extends these ideas by introducing a formal dispute resolution process with explicit verdict categories\.
### 2\.3Iterative Refinement
Self\-refinement approaches\[[6](https://arxiv.org/html/2606.15074#bib.bib6),[7](https://arxiv.org/html/2606.15074#bib.bib7)\]have shown that LLMs can improve their own outputs through iterative feedback\. However, these approaches typically use a single model for both generation and review, creating a “self\-review” problem where the reviewer shares the generator’s blind spots\. Our architecture addresses this by using*independent*reviewer models with different training distributions and prompt configurations\.
### 2\.4Code and Document Quality Assessment
Automated assessment of code and document quality has been studied extensively\.Chen et al\.\[[2](https://arxiv.org/html/2606.15074#bib.bib2)\]evaluated LLM code generation capabilities\.Wang et al\.\[[8](https://arxiv.org/html/2606.15074#bib.bib8)\]introduced self\-consistency for improving reasoning outputs\. Our work focuses on the*process*of improving outputs through structured review rather than post\-hoc evaluation\.
## 3Method
### 3\.1System Overview
TriAdReview operates as a three\-stage pipeline:generation,review, anditeration\. The system comprises three models with distinct roles:
- •Generator\(DeepSeek v4 Pro\): Produces the initial technical proposal and iterates based on feedback\.
- •Reviewer A\(agnes\-2\.0\-flash\): Provides an engineering\-focused review, challenging design decisions and identifying over\-engineering\.
- •Reviewer B\(mimo\-v2\.5\-pro\): Provides a boundary\-focused review, identifying failure scenarios, security gaps, and reliability issues\.
Figure[1](https://arxiv.org/html/2606.15074#S3.F1)illustrates the system architecture\.
Figure 1:TriAdReview system architecture\. The generator model produces an initial proposal, which is reviewed in parallel by two independent reviewers \(engineering and boundary perspectives\)\. Disputes between the generator and reviewers are resolved through triangular judging, where a third\-party model \(neither the generator nor the disputed reviewer\) adjudicates\.
### 3\.2Review Protocol
Each reviewer receives the full proposal text \(truncated to 3,000 characters for efficiency\) and outputs a structured JSON array of improvement suggestions\. Each suggestion contains:
- •id: Unique identifier \(e\.g\., S1, B1\)
- •severity: One ofcritical,major, orminor
- •suggestion: Specific, actionable improvement
- •reasoning: Why the current design will fail without this change
Reviewer A \(agnes\) is prompted as an “engineering review expert and devil’s advocate” who must challenge design decisions and suggest both additions and removals\. Reviewer B \(mimo\) is prompted as a “proposal destroyer” who must identify scenarios causing system crashes or data loss\.
### 3\.3Triangular Judging Mechanism
When the generator rejects a reviewer’s suggestion, the dispute enters thetriangular judgingprocess\. The key design principle is that*the judge must be a different model than the reviewer whose suggestion was rejected*\. Specifically:
- •agnes’s rejected suggestions are judged by mimo
- •mimo’s rejected suggestions are judged by DeepSeek
- •DeepSeek’s rejected suggestions are judged by agnes
This circular judging topology prevents self\-adjudication and ensures that each dispute is evaluated by a model with a different perspective\. The judge outputs one of three verdicts:
- •suggestion\_wins: The reviewer’s suggestion is adopted regardless of the generator’s objection\.
- •main\_wins: The generator’s rejection is upheld\.
- •compromise: A middle ground is identified \(e\.g\., “keep the feature but add security audit”\)\.
### 3\.4Iterative Refinement
The system runs for a configurable number of rounds \(default: 2\)\. In each round, the generator receives:
1. 1\.Accepted suggestions \(must be implemented\)
2. 2\.Resolved disputes \(must be executed per verdict\)
3. 3\.A memory of previously rejected suggestions \(to prevent persistent disagreements\)
The generator then produces a revised proposal that incorporates all mandated changes while maintaining its engineering judgment on non\-disputed aspects\.
### 3\.5Experimental Configurations
We evaluate three configurations to isolate the contribution of each component:
Table 1:Experimental configurations\.
## 4Experimental Setup
### 4\.1Benchmark Tasks
We designed five benchmark tasks spanning different technical writing categories \(Table[2](https://arxiv.org/html/2606.15074#S4.T2)\)\. Each task requires the model to produce a complete technical document from a structured requirements prompt\.
Table 2:Benchmark tasks\.
### 4\.2Evaluation Metrics
Each output is scored by two independent LLM judges—DeepSeek v4 Pro \(GPT scorer\) and mimo\-v2\.5\-pro \(MIMO scorer\), both at temperature 0\.3—on five dimensions, each on a 1–10 scale\. We report GPT scorer results as the primary analysis and use MIMO scorer results for inter\-rater reliability:
1. 1\.Completeness: Coverage of necessary aspects
2. 2\.Technical Depth: Sufficiency of technical details
3. 3\.Feasibility: Practical implementability
4. 4\.Novelty: Unique insights or innovative solutions
5. 5\.Clarity: Expression clarity and structural coherence
The total score is the sum of all five dimensions \(range: 5–50\)\.
### 4\.3Experimental Protocol
Each configuration\-task pair is repeated 5 times \(n=5n=5\) to account for stochastic variation, yielding5×3×5=755\\times 3\\times 5=75total experiments\. Outputs are scored by two independent LLM judges—DeepSeek v4 Pro \(GPT scorer\) and mimo\-v2\.5\-pro \(MIMO scorer\)—both at temperature 0\.3, to provide inter\-rater reliability\. All experiments use the same API endpoints and model versions\. Experiments were conducted on a server with RTX 3090 GPU \(used for local model inference\) and cloud APIs for DeepSeek and agnes\.
## 5Results
### 5\.1Overall Performance
Table[3](https://arxiv.org/html/2606.15074#S5.T3)presents the mean scores across all tasks for each configuration\.
Table 3:Overall mean scores \(GPT scorer, across all 5 tasks,n=5n=5per cell\)\.The triple model configuration achieves the highest overall score \(26\.2\), representing a 10\.1% improvement over the single model baseline \(p<0\.05p\{<\}0\.05, pairedtt\-test, Cohen’sd=0\.43d\{=\}0\.43\)\. A second scorer \(mimo\-v2\.5\-pro\) confirms the direction but with a smaller effect \(C=30\.2 vs\. A=29\.4, \+2\.7%, not significant\), suggesting moderate inter\-rater agreement; the true effect likely lies between 2\.7% and 10\.1%\. The largest dimension\-level improvement is innovelty\(\+1\.1 points, \+31%\), suggesting that adversarial review pushes the generator toward more creative solutions\.
### 5\.2Task\-Level Analysis
Figure[2](https://arxiv.org/html/2606.15074#S5.F2)shows the per\-task scores for each configuration\.
Figure 2:Task\-level quality scores by configuration\. Error bars represent standard deviation across 5 repetitions\.The improvement varies dramatically by task type \(Figure[3](https://arxiv.org/html/2606.15074#S5.F3)\):
Figure 3:Per\-task improvement of triple model over single model\. Green indicates improvement; red indicates degradation\.- •T4 \(Security Audit\): \+27\.6%— The largest improvement\. Reviewer B’s boundary\-focused perspective identified security gaps that the single model missed, with gains across all five dimensions \(completeness \+1\.6, technical depth \+1\.6, novelty \+2\.0\)\.
- •T2 \(Code Generation\): \+20\.8%— Significant improvement\. The adversarial review enhanced novelty \(2\.8→\\to4\.6\) and clarity \(5\.2→\\to6\.6\)\.
- •T1 \(Architecture Design\): \+15\.6%— Reviewers successfully identified over\-engineering and suggested simplification\.
- •T3 \(Proposal Review\): \-0\.8%— No meaningful change\. Review\-type tasks are inherently adversarial, making additional review redundant\.
- •T5 \(Requirements Analysis\): \-7\.5%— Degradation\. The adversarial review removed necessary components \(completeness: 4\.4→\\to2\.8\) from a task that required completeness\.
### 5\.3Dimension\-Level Analysis
Figure[4](https://arxiv.org/html/2606.15074#S5.F4)presents the dimension\-level comparison across configurations\.
Figure 4:Left: Overall dimension scores by configuration\. Right: Per\-dimension delta of triple model vs single model\.The triple model improves on four of five dimensions, with the largest gains in novelty \(\+1\.1\) and technical depth \(\+0\.4\)\. Completeness shows a marginal overall increase \(\+0\.1\), which we attribute to the adversarial review’s structural bias toward simplification\.
### 5\.4Process Metrics
Across all Config C experiments, the system processed 212 dispute resolutions\. Figure[5](https://arxiv.org/html/2606.15074#S5.F5)shows the verdict distribution and per\-task acceptance rates\.
Figure 5:Left: Verdict distribution across all Config C disputes\. Right: Suggestion acceptance rate by task\.Key observations:
- •main\_wins dominates\(58\.0%\): The generator’s independent judgment is upheld in the majority of disputes\.
- •compromise is frequent\(26\.9%\): About one\-quarter of disputes result in nuanced middle\-ground solutions that neither party proposed initially\.
- •suggestion\_wins is rare but impactful\(15\.1%\): When reviewers are right, their suggestions address critical issues \(typically security or over\-engineering\)\.
### 5\.5Cost\-Quality Tradeoff
Figure[6](https://arxiv.org/html/2606.15074#S5.F6)presents the cost\-quality relationship across configurations\.
Figure 6:Cost\-quality tradeoff\. The triple model costs 7\.0×\\timesmore per run but achieves 10\.1% higher quality\.The triple model costs $0\.117 per run compared to $0\.017 for the single model \(7\.0×\\times\)\. However, the cost per quality point is $0\.0045 for the triple model vs\. $0\.0007 for the single model, suggesting that the quality improvement is not purely a function of spending more tokens\.
## 6Analysis
### 6\.1Why Adversarial Review Helps “De\-Fatting” Tasks
Tasks T1 \(Architecture Design\), T2 \(Code Generation\), and T4 \(Security Audit\) share a common characteristic: the generator model tends to*over\-produce*—including unnecessary components, overly complex architectures, and popular but unnecessary technology choices\. In these cases, adversarial review serves as a quality filter that:
1. 1\.Identifies technology fragmentation \(e\.g\., suggesting removal of redundant databases\)
2. 2\.Challenges over\-engineering \(e\.g\., questioning the need for service mesh in small\-scale deployments\)
3. 3\.Exposes security blind spots \(e\.g\., missing encryption or authentication\)
The adversarial framing is critical: a reviewer prompted to “improve” might add more features, but a reviewer prompted to “challenge” naturally identifies what can be removed or simplified\. The mean de\-fatting improvement across T1, T2, and T4 is \+21\.3%\.
### 6\.2Why Adversarial Review Harms Completeness Tasks
Task T5 \(Requirements Analysis\) requires the generator to produce a comprehensive system design covering all specified requirements\. The adversarial review mechanism fails here because:
1. 1\.Reviewer A’s “devil’s advocate” prompt leads to suggesting removal of necessary components \(e\.g\., “cut the real\-time sentiment analysis module”\)
2. 2\.The generator’s content shrinks through iterations \(mean \-17% in length for Config C\)
3. 3\.The completeness dimension suffers most \(4\.4→\\to2\.8 in Config C\)
This reveals a fundamental tension:adversarial review is optimized for precision \(removing bad content\) but harms recall \(covering necessary content\)\.
### 6\.3T5 Prompt Adaptation
We attempted to mitigate the T5 degradation by adapting Reviewer A’s prompt from adversarial to completeness\-focused\. The adapted prompt instructs the reviewer to identify*missing*requirements rather than challenging existing ones\.
Figure[7](https://arxiv.org/html/2606.15074#S6.F7)shows the results\.
Figure 7:T5 scores before and after prompt adaptation\.The adaptation partially mitigates the issue: the C vs A gap narrowed\. However, the system still underperforms the single model baseline \(\-7\.5% withn=5n\{=\}5\), suggesting that prompt adaptation alone is insufficient\.
### 6\.4Task\-Type Framework
Based on our findings, we propose a task\-type framework for predicting when adversarial review is beneficial:
Table 4:Task\-type framework for adversarial review effectiveness\.
## 7Discussion
### 7\.1Positioning within the Literature
Unlike prior work on multi\-agent debate\[[3](https://arxiv.org/html/2606.15074#bib.bib3),[4](https://arxiv.org/html/2606.15074#bib.bib4)\]that focuses on factual accuracy, our work addresses*engineering quality*of technical documents\. The triangular judging mechanism is, to our knowledge, the first formal dispute resolution protocol for multi\-model review that prevents self\-adjudication\.
### 7\.2Limitations
1. 1\.Moderate sample size:n=5n=5per cell provides adequate power for the primary analysis \(GPT scorer:p<0\.05p\{<\}0\.05\), though the MIMO scorer effect remains not significant\. Cross\-scorer agreement suggests the true effect lies between 2\.7% and 10\.1%\.
2. 2\.Scorer bias: GPT\-based scoring may favor the triple model due to its preference for structured, revised outputs\. The MIMO scorer’s smaller effect \(\+2\.7%\) partially addresses this but introduces its own biases\.
3. 3\.Limited task diversity: Five tasks may not represent the full range of technical writing\.
4. 4\.LLM\-as\-Judge limitations: Automated scoring may not capture all dimensions of document quality\.
5. 5\.Cost: The 7\.0×\\timescost multiplier may be prohibitive for some use cases\.
### 7\.3Future Work
1. 1\.Task\-adaptive review: Automatically detect task type and adjust reviewer prompts accordingly\.
2. 2\.Human evaluation: Validate LLM\-as\-Judge scores against human expert ratings\.
3. 3\.Scaling: Test with larger models \(70B\+\) to determine if review effectiveness scales with model capability\.
4. 4\.Cost optimization: Explore selective review \(only trigger adversarial review when initial output quality is below threshold\)\.
5. 5\.Multi\-round convergence: Study the convergence properties of iterative adversarial review over more rounds\.
## 8Conclusion
We presented TriAdReview, a triangular adversarial review architecture for multi\-model technical document generation\. Through 75 experiments \(n=5n\{=\}5\) across five benchmark tasks, we demonstrated that adversarial review achieves a statistically significant 10\.1% overall quality improvement \(p<0\.05p\{<\}0\.05, Cohen’sd=0\.43d\{=\}0\.43\), with particularly strong gains on security audit \(\+27\.6%\) and code generation \(\+20\.8%\)\. A second scorer \(mimo\-v2\.5\-pro\) confirms the direction \(\+2\.7%\), providing inter\-rater reliability\. However, we identified a critical boundary condition: adversarial review degrades performance on completeness\-oriented tasks \(\-7\.5%\) due to its structural bias toward simplification\.
Our key contribution is the empirical characterization of*when*multi\-model review helps versus harms\. We propose a task\-type framework \(de\-fatting, neutral, completeness\) that predicts review effectiveness and can guide the design of future multi\-agent systems\. The triangular judging mechanism provides a principled approach to dispute resolution that prevents self\-adjudication and produces nuanced compromise verdicts in 26\.9% of cases\.
These findings suggest that the next generation of multi\-model review systems should be*task\-aware*: not applying a one\-size\-fits\-all adversarial approach, but dynamically adapting review strategy based on the nature of the task and the characteristics of the initial output\.
## References
- Chan et al\. \[2023\]Chan, C\., Chen, W\., Su, Y\., et al\. \(2023\)\.ChatEval: Towards better LLM\-based evaluators through multi\-agent debate\.*arXiv preprint arXiv:2308\.07201*\.
- Chen et al\. \[2021\]Chen, M\., Tworek, J\., Jun, H\., et al\. \(2021\)\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*\.
- Du et al\. \[2023\]Du, Y\., Li, S\., Torralba, A\., et al\. \(2023\)\.Improving factuality and reasoning in language models through multiagent debate\.*arXiv preprint arXiv:2305\.14325*\.
- Liang et al\. \[2023a\]Liang, T\., He, Z\., Jiao, W\., et al\. \(2023a\)\.Encouraging divergent thinking in large language models through multi\-agent debate\.*arXiv preprint arXiv:2305\.19118*\.
- Liang et al\. \[2023b\]Liang, T\., He, Z\., Jiao, W\., et al\. \(2023b\)\.MACNET: Multi\-agent collaboration network for reasoning\.*arXiv preprint arXiv:2312\.05693*\.
- Madaan et al\. \[2023\]Madaan, A\., Pryor, A\., et al\. \(2023\)\.Self\-refine: Iterative refinement with self\-feedback\.*arXiv preprint arXiv:2303\.17651*\.
- Shinn et al\. \[2023\]Shinn, N\., Cassano, F\., et al\. \(2023\)\.Reflexion: Language agents with verbal reinforcement learning\.*arXiv preprint arXiv:2303\.11366*\.
- Wang et al\. \[2023\]Wang, X\., Wei, J\., et al\. \(2023\)\.Self\-consistency improves chain of thought reasoning in language models\.*ICLR 2023*\.
- Wu et al\. \[2023\]Wu, Q\., Bansal, G\., et al\. \(2023\)\.AutoGen: Enabling next\-gen LLM applications via multi\-agent conversation\.*arXiv preprint arXiv:2308\.08155*\.Similar Articles
Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review
This paper introduces PaperGuard, a benchmark for evaluating and defending against adversarial attacks on multimodal AI peer review systems, covering both text and figure-based attacks across multiple scientific domains.
we replaced single-model code review with a consensus of models. the one rule that made it actually work
The article describes replacing single-model code review with a consensus of multiple AI models, where only explicit approvals count, leading to more reliable code reviews at the cost of longer discussions.
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.
TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling
TriVAL introduces a tri-validation framework that performs explicit validation at three stages of automatic optimization modeling (semantic specification, mathematical formulation, code generation) to improve faithfulness, and also presents NL4COP, a new benchmark for combinatorial optimization problems.
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
TRIDENT is a novel framework and dataset synthesis pipeline for enhancing LLM safety through tri-dimensional red-teaming data that covers lexical diversity, malicious intent, and jailbreak tactics. Fine-tuning Llama-3.1-8B on TRIDENT-Edge achieves 14.29% reduction in Harm Score and 20% decrease in Attack Success Rate compared to baseline models.