Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework
Summary
This paper presents a systematic analysis of three agent interaction paradigms (Generator-Evaluator, ReAct Loop, and Adversarial Evaluation) implemented in the buddyMe framework, with empirical case studies from real-world deployments. It formalizes a five-stage pipeline and a six-dimensional evaluation schema, offering practical design guidelines for multi-paradigm agent systems.
View Cached Full Text
Cached at: 05/19/26, 06:36 AM
# Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework Source: [https://arxiv.org/abs/2605.16821](https://arxiv.org/abs/2605.16821) [View PDF](https://arxiv.org/pdf/2605.16821) > Abstract:The rapid evolution of Large Language Model \(LLM\) agents has produced diverse interaction paradigms, yet few production systems integrate multiple paradigms within a unified architecture\. This paper presents a systematic analysis of three principal agent interaction paradigms, including Multi\-Agent Orchestration \(Generator\-Evaluator\), ReAct Tool\-Use Loops, and Memory\-Augmented Interaction, as implemented in buddyMe, an open\-source multi\-model agent programming framework\. We formalize a five\-stage processing pipeline: Requirement Pre\-Review \-\> Task Decomposition \-\> ReAct Execution \-\> Real\-Execution Verification \-\> Adversarial Evaluation Discussion, and establish a six\-dimensional evaluation schema with weighted scoring\. Through four empirical case studies drawn from real\-world deployment logs covering museum guide generation, scheduled weather tasks, and comprehensive tour planning, we draw three key conclusions\. First, Generator\-Evaluator pre\-review detects requirement omissions in 20 percent of complex tasks, with 80 percent tasks passing initial inspection\. Second, the ReAct loop ensures stable subtask execution but leads to around 30 percent redundant tool invocations\. Third, adversarial Evaluator\-Defender discussions reach consensus within 2\-3 rounds for nearly 70 percent of scenarios, functioning mainly for content refinement rather than logical reversal\. We additionally provide three Mermaid\-based architectural diagrams and conduct cross\-paradigm comparisons with CrewAI, AutoGen, LangGraph, MemGPT and A\-Mem across six system dimensions\. The research outcomes offer practical design guidelines for constructing stable and reliable multi\-paradigm agent systems\. ## Submission history From: Xiaohua Wang \[[view email](https://arxiv.org/show-email/4a968fa8/2605.16821)\] **\[v1\]**Sat, 16 May 2026 05:35:50 UTC \(158 KB\)
Similar Articles
Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents
Proposes Online Agent-as-a-Judge, an evaluation framework that uses an in-world evaluator agent to actively generate situations for testing interactive social agents, improving coverage and reliability over passive methods.
An Empirical Study of Automating Agent Evaluation
This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.
Agent Evaluation: A Detailed Guide (53 minute read)
A comprehensive guide on evaluating LLM-based agent systems, covering fundamental concepts, evaluation frameworks, and case studies from recent benchmarks.
Demystifying evals for AI agents
Anthropic provides a guide on designing rigorous automated evaluations for AI agents, addressing the complexities of multi-turn interactions and state modifications.
Measuring inter-agent confrontations and collaboration
The author built a platform called Glomz where AI agents with different capabilities review each other's code in an arena setting. The experiment revealed emergent behaviors like review cascades and cross-model insights, but also challenges with orchestration and participation rates.