Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

arXiv cs.AI Papers

Summary

This paper from eBay presents a modular two-agent simulation framework for evaluating conversational shopping assistant architectures, enabling controlled comparisons of responder designs. Key findings include that rolling-window memory outperforms intent-extraction memory by 35% in speed, and that systematic failure analysis reduced failure rates by 62%.

arXiv:2606.12924v1 Announce Type: new Abstract: We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16--0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:54 AM

# A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
Source: [https://arxiv.org/html/2606.12924](https://arxiv.org/html/2606.12924)
Jetlir Duraj∗, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma, Rui Kong, Ishita Khan, Qunzhi Zhou eBay Inc\. \{jduraj, jyetukuri, shuazhou, dvarma, rukong, ishikhan, qunzhou\} @ebay\.com ∗Corresponding author

###### Abstract

We present a modular two\-agent simulation framework for evaluating conversational shopping assistant architectures\. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e\-commerce search API\. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios\. Using 2,011 conversations across 14 persona buckets, we establish four empirical findings\. First, rolling\-window memory outperforms intent\-extraction memory on all quality metrics while being 35% faster per query\. Second, illustrating rapid evidence\-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near\-failure rates by 62% across the full dataset\. Third, swapping the responder LLM backbone from Gemini 2\.5 to Llama 3\.3 70B costs 0\.16–0\.45 points despite identical architecture\. Finally, we document that judge selection is itself a consequential architectural decision: SOTA Gemini and Claude models disagree on 30% of conversations by two or more points despite identical prompts\.

Iterating Toward Better Search: A Two\-Agent Simulation Framework for Evaluating Agentic Search Architectures in E\-Commerce

Jetlir Duraj∗, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma, Rui Kong, Ishita Khan, Qunzhi ZhoueBay Inc\.\{jduraj, jyetukuri, shuazhou, dvarma, rukong, ishikhan, qunzhou\} @ebay\.com∗Corresponding author\.

## 1Introduction

Building conversational shopping assistants that reliably serve diverse buyer needs requires controlled evaluation of architectural choices, memory strategies, intent extraction, response generation, before committing to production deployment\. In an ideal world, one would conduct AB tests at scale covering all relevant buyer types with enough statistical power to draw conclusions\. In practice, high\-powered AB tests covering a buyer typology are rarely achievable at low cost\. When introducing a new technology such as conversational shopping to a customer base, it is difficult to anticipate all the ways buyer behavior will evolve once the technology is deployed\. Cheap, reproducible offline evaluation tools that cover diverse buyer types become essential for system iteration and improvement\. Once the framework is in place, marginal cost per iteration is near zero, given LLM access\.

Two standard approaches fall short\.Beta user testingprovides realistic signals but is slow, cannot replay identical scenarios across candidate architectures, and raises privacy constraints\.Single\-agent generation, where one model produces both buyer queries and assistant responses, is faster but produces conversations that differ systematically from real user behavior: queries tend to be more formal, action commands \(clicking or putting in a cart\) are rarer, and the generated buyer “knows” what the assistant will return before asking\. Critically, single\-agent generation cannot testresponder\-specificarchitectural features, because these components do not participate in the generation at all\.

We present amodular two\-agent simulationsystem that addresses both limitations\. Abuyer agentgenerates queries independently, reacting only to what the responder actually returns; aresponder agentprocesses those queries using a real e\-commerce search API and returns search result pages \(SRP\) and/or conversational guidance \(CHAT\)\. Because either agent can be replaced without modifying the other, architectures can be compared on identical buyer scenarios with the buyer held as a controlled variable\. The buyer’s configurability also allows systematic testing of how buyer behavior may evolve after launch\.

Using this framework with 2,011 buyer conversations across 14 persona buckets, we conduct four experiments illustrating the iterative cycle of responder architecture design and evaluation\. Our contributions are:

- •A modular two\-agent simulation system enabling controlled, reproducible comparison of responder architectures \(§[3](https://arxiv.org/html/2606.12924#S3)\)\.
- •An empirical demonstration that simpler rolling\-window memory may outperform explicit intent extraction while reducing per\-query latency \(§[5\.1](https://arxiv.org/html/2606.12924#S5.SS1)\)\.
- •A systematic study of failures from 2,011 conversations that directly enables targeted fixes of a responder architecture, recovering\+\+1\.3–1\.8 points on the worst\-performing scenarios and reducing near\-failures by 62% across the full corpus, validated in just 2 days \(§[5\.3](https://arxiv.org/html/2606.12924#S5.SS3)\)\.
- •Evidence that underlying LLM contributes independently of architecture: Llama 3\.3 70B vs Gemini 2\.5 costs 0\.16 \- 0\.45 points \(§[5\.4](https://arxiv.org/html/2606.12924#S5.SS4)\)\.
- •Evidence that frontier LLM judges embed different evaluation philosophies from an identical prompt: exact agreement on CHAT Helpfulness is only 13%, 30% of conversations show≥\\geq2\-point divergence, and the evaluator gap \(≈\\approx0\.5 points\) exceeds the architecture gap, making judge selection a consequential design decision \(§[5\.5](https://arxiv.org/html/2606.12924#S5.SS5)\)\.

## 2Related Work

Behavioral fidelity evaluation\.Wanget al\.\([2025](https://arxiv.org/html/2606.12924#bib.bib26)\)andLuet al\.\([2025](https://arxiv.org/html/2606.12924#bib.bib24)\)evaluate shopping agents by measuring how closely predicted next actions match historical user logs, treating real human sessions as the ground truth\. This optimises formimicryrather than system effectiveness: a responder that predicts the next click correctly need not be one that actually helps users accomplish their goals\. Our framework moves the metric towards mission success, and buyer behavior is generated fresh each run rather than replayed from a log\.

Production monitoring\.Zhaoet al\.\([2025a](https://arxiv.org/html/2606.12924#bib.bib28)\)embed human feedback into a live retrieval flywheel to prevent knowledge decay;Warneet al\.\([2026](https://arxiv.org/html/2606.12924#bib.bib27)\)use simulation to stress\-test prompt changes against a structured “case state” reasoning architecture before deployment\. Both treat an architecture as fixed and optimise its operation\. Our framework is anR&D laboratory: it testsarchitecturesbefore commitment, mapping failure boundaries that production flywheels may not be able to surface\.

Personalization and satisfaction evaluation\.Zhaoet al\.\([2025b](https://arxiv.org/html/2606.12924#bib.bib29)\)andSunet al\.\([2025](https://arxiv.org/html/2606.12924#bib.bib25)\)use LLM agents to score personalization or satisfaction as aggregate metrics\. We instead evaluatearchitectural design choicesand expose that the evaluator is itself a design choice \(§[5\.5](https://arxiv.org/html/2606.12924#S5.SS5)\)\.

Memory architectures for conversational agents\.Shinnet al\.\([2023](https://arxiv.org/html/2606.12924#bib.bib14)\)introduce Reflexion, showing episodic memory with self\-reflection improves agent performance\.Packeret al\.\([2023](https://arxiv.org/html/2606.12924#bib.bib15)\)demonstrate memory compression can maintain performance while reducing context length\. We empirically validate that simpler rolling\-window memory can outperform intent\-extraction pipelines in conversational commerce\.

Agent architectures and tool use\.Schicket al\.\([2023](https://arxiv.org/html/2606.12924#bib.bib20)\)show models can learn to use external APIs through demonstration;Yaoet al\.\([2023](https://arxiv.org/html/2606.12924#bib.bib21)\)propose ReAct, interleaving reasoning and action for improved tool use\. Our responder architectures integrate tool use with real search APIs, and our empirical comparison confirms that simpler designs can outperform more complex extraction pipelines\.

LLM\-as\-judge\.Using LLMs to evaluate model outputs is now standard practice\(Zhenget al\.,[2023](https://arxiv.org/html/2606.12924#bib.bib9)\)\.Duboiset al\.\([2024](https://arxiv.org/html/2606.12924#bib.bib16)\)show that LLM judges can match human agreement on many tasks, whileChiang and Lee \([2023](https://arxiv.org/html/2606.12924#bib.bib17)\)document systematic biases in model\-based evaluation\. We extend this paradigm with a systematic cross\-judge comparison on identical shopping conversations, showing that judge selection is equivalent to selecting a quality criterion and that the resulting gap may be large enough to affect architectural conclusions\.

## 3System Architecture

### 3\.1Two\-Agent Design

Figure[1](https://arxiv.org/html/2606.12924#S3.F1)shows the system we build\. Theorchestratorruns a conversation loop: it forwards buyer queries to the responder, returns the response to the buyer, logs the turns and various dynamic conversation statistics, and repeats until the buyer sends\[TERMINATE\_SESSION\]or exhausts its turn budget\. The critical design property isagent independence: the buyer observes only the responder’s output \(SRP and/or CHAT\); the responder observes only the buyer’s raw text query; and either component can be replaced without modifying the other\. This enables controlled experiments where only the responder architecture changes across runs\.

Buyer Agentconfigurable: mission, persona, tone, patienceOrchestratorroutes messages & logs all turns & conversation statsResponder Agentmemory, classifiers, search API, responses DBqueryquery\{SRP, CHAT\}\{SRP, CHAT\}Figure 1:Two\-agent simulation\. Either agent is replaceable, enabling controlled architecture comparisons\.
### 3\.2Buyer Agent

The buyer is a Gemini 2\.5 Pro reasoning model operating in a mission\-based loop \(Figure[2](https://arxiv.org/html/2606.12924#S3.F2)\)\.111We also experimented with OpenAI GPT\-5\.2 and GPT\-5\.4 as buyer LLMs but found that the Gemini 2\.5 series generates more realistic buyer speech patterns\.Each session contains 1–3 independent missions; each mission specifies a shopping goal, buyer persona, communication tone, and patience level encoded asmax\_turns\(4 turns for impatient buyers; 10 turns for patient buyers\)\. The buyer persona, patience and communication tone do not vary across missions of the same buyer session\. The buyer LLM and the set of buyer configurations is heldconstantacross all experiments\. Hence, performance differences are due to the responder architecture alone\.

#### Memory architecture\.

The buyer maintains three parallel streams:query memory\(recent queries, auto\-compacted when\>\>10\);SRP memory\(keeps most recent 3 when\>\>5\);CHAT memory\(keeps most recent 3 when\>\>5\)\. On mission transition, the last 3 turns are injected into the next mission’s context, enabling natural handoffs \(e\.g\., “ok changing topic”\) rather than abrupt resets\.

#### Query generation pipeline\.

Each turn, the buyer \(1\) checks whether the current mission has terminated or exhausted its turn budget; \(2\) prepares context from the current mission goal, persona/tone, and all three memory streams; \(3\) constructs an LLM prompt with available actions and cart command syntax; \(4\) calls Gemini 2\.5 Pro \(temperature 0\.2\); \(5\) parses and validates the response; and \(6\) updates memory, triggering compaction if any threshold is exceeded\.

#### Action types and cart logic\.

The buyer produces three action types:search query\(natural language or keyword\-style\),item click\(requests full listing details\), andcart addition\(primary purchase\-intent signal\)\. Cart additions are validated against current SRP results only, preventing hallucinated actions\. A mission terminates on satisfaction, turn\-budget exhaustion, or explicit abandonment\.

Prepare context\(mission \+ memory\)Check termination / turn budgetNew turnTERMINATEBuild LLM promptLLM call \(Gemini 2\.5 Pro\)Actiontype?Search queryClick itemCart additemId inlast SRP?DiscardUpdate memory\+ compactif neededReturn query stringoklimitsearchclickcartnoyes\{SRP, CHAT\}

Figure 2:Buyer Agent query generation loop\. Each turn: check budget, prepare context from three memory streams \(queries/SRPs/CHATs\), call LLM, produce one action type\. Cart additions are validated against the current SRP\.

### 3\.3Responder Architectures

All responders share the same pipeline structure: amemory module, anorchestration layer via query understanding classifiers and query re\-writers,real search API integrationreturning live eBay listings, and aresponse generator\. They differ in memory design and underlying LLM\.

#### Sys\-A: Intent Tracking\.

Memory stores raw queriesandLLM\-extracted intent statements\. After each buyer query, a dedicated LLM call extracts structured intents and generates a condensed search digest, adding one extra LLM call per query\.

#### Sys\-B: Rolling Window\.

Memory accumulates raw buyer queries; once the count exceeds 6, all accumulated queries are compacted into a single keyword digest via one LLM call and accumulation restarts\.222A systematic ablation over the compaction threshold; rerunning the same 2,011 conversations with thresholds of 4, 6, 8, and 10 is left for future work\.There is no per\-query intent\-extraction call \(unlike Sys\-A\); the digest LLM call fires only at compaction, reducing per\-query latency by 35% compared to Sys\-A\.

#### Sys\-B\+: Targeted Fixes of Sys\-B\.

3 targeted changes addressing failure patterns, §[5\.2](https://arxiv.org/html/2606.12924#S5.SS2)\.

#### Sys\-C: Llama Backbone\.

Same rolling\-window architecture and orchestration as Sys\-B; the generative backbone is changed from Gemini 2\.5 Pro/Flash to Llama 3\.3 70B Instruct\.

## 4Experimental Setup

### 4\.1Dataset

We use a dataset of 2,011 buyer conversation configurations spanning 14 buckets defined by three axes:shopping style\(information\-seeking, broad\-to\-narrow, precise\-flexible, precise\-strict\),patience\(patient: 10 turns per mission; impatient: 4 turns\), andmission count\(1–3 missions per session\)\. Table[1](https://arxiv.org/html/2606.12924#S4.T1)shows the full distribution\. Seed keyword queries are drawn from eBay consumer search logs spanning all verticals, supplemented by thematic and occasion\-based queries; an LLM then expands each seed into a full buyer mission specifying goal, context, constraints, and persona\. The 14\-bucket structure was designed to cover buyer archetypes relevant to agentic search evaluation\. The dataset is intended to illustrate the framework; teams evaluating specific architectures will construct datasets targeting their own verticals, time windows, and query patterns, which the modular framework readily supports\.

StylePatienceMiss\.NNInformation seeking\(no purchase intent\)Patient1200Broad to narrow\(explores,narrows by results\)Patient116022003100Precise but flexible\(open to alternatives\)Patient116022003100Impatient113322003100Precise and strict\(exact match,no alternatives\)Patient125Impatient113322003100Total2,011Table 1:Dataset distribution across 14 buckets\. Each bucket is defined by shopping/browsing style, patience level, and mission count\. Five combinations are excluded as less relevant \(e\.g\., multi\-mission patient strict\)\. Patient buyers receive 10 turns/mission; impatient buyers only 4\.The following is an example of a buyer configuration from theprecise\_flexible\_patient\_1missionbucket\.

> Mission:“Need original Apple Watch sports loop band, but willing to consider alternatives if they offer better value or are in excellent condition\.” Patience:max\_turns = 10\(patient\) Persona:Experienced buyer who knows what they want but recognises good alternatives when presented; professional but open\-minded\. Tone:Casual conversational; asks follow\-up questions, appreciates suggestions\.

### 4\.2Evaluation

We use two different frontier LLM judges to evaluate conversations along four dimensions \(1–5 scale\), with chain\-of\-thought reasoning required before scoring\.333Traditional IR metrics such as NDCG or precision@k are not applicable here for two structural reasons\. First, the buyer reacts to each responder’s output, so the same buyer configuration generates different queries against different systems, breaking the fixed\-query assumption IR metrics depend on\. Second, even at temperature 0, LLM inference is not fully deterministic due to floating\-point non\-associativity in GPU kernels\(Atilet al\.,[2025](https://arxiv.org/html/2606.12924#bib.bib8)\), making relevance annotations from one run non\-transferable to the next\.This dual\-judge design provides robustness and as we show in §[5\.5](https://arxiv.org/html/2606.12924#S5.SS5)exposes systematic differences in evaluation philosophy\.

#### Evaluation metrics\.

Each conversation is scored on four metrics:

- •Mission Success: Did the buyer accomplish their stated goal? Measures outcome quality based on buyer actions and satisfaction signals \(5: buyer clicks or carts an item, or expresses strong purchase intent; 3: buyer makes progress but does not find fully satisfactory results; 1: no meaningful progress\)\.
- •SRP Relevance: How relevant were search results to buyer needs? Evaluates query\-to\-search translation and appropriateness of SHOW decisions \(5: all results match query and accumulated constraints, SHOW decisions always appropriate; 3: 50%\+ results match or some inappropriate SHOW decisions; 1:<<20% relevance or critical failure\)\.
- •CHAT Helpfulness: How helpful was conversational guidance? Did responses leverage search data to provide specific, actionable information and were CHAT decisions appropriate? \(5: grounded guidance with specific product or policy information, CHAT decisions always appropriate; 3: somewhat helpful but generic, or occasional inappropriate decisions; 1: unhelpful or consistently inappropriate CHAT usage\)\.
- •Query Intent Understanding: Did the responder correctly interpret each query and activate the right pipeline components across all turns? \(5: correct routing and constraint maintenance throughout; 3: main intent correct but important context lost from memory or some classification errors; 1: systematic mis\-classification or complete failure to maintain conversational context\.\)

#### LLM judges\.

We employ two frontier models, both receiving identical evaluation prompts: i\)Gemini 3\.1 Pro\(Reidet al\.,[2026](https://arxiv.org/html/2606.12924#bib.bib33)\) ii\)Claude Opus 4\.6\(Anthropic,[2026](https://arxiv.org/html/2606.12924#bib.bib34)\) We aggregate by taking the mean across metrics\. When both judges agree directionally \(e\.g\., Sys\-B\>\>Sys\-A\), findings are robust; substantial disagreement indicates architectural trade\-offs \(§[5\.5](https://arxiv.org/html/2606.12924#S5.SS5)\) rather than noise\. The evaluation prompt and a sample of over 200 conversations spanning the score range were manually reviewed by the authors to confirm that judge scores align with human assessment of interaction quality; low\-scoring conversations were verified to represent failed interactions and high\-scoring ones to represent successful journeys\.

#### Reproducibility\.

The buyer agent workflow \(Figure[2](https://arxiv.org/html/2606.12924#S3.F2)\) and evaluation metrics are described in this paper\. The responder architectures are described at the level of design decisions \(§[3\.3](https://arxiv.org/html/2606.12924#S3.SS3)\)\. Implementation code for responder agents is proprietary and cannot be released\. The buyer configurations and conversation transcripts are similarly not available for external release\. The eBay search API is an internal service and is not publicly accessible\. We make these constraints explicit in the interest of transparency; limited reproducibility is a standard constraint for many industry papers\.

## 5Results

### 5\.1Architecture Comparison: Sys\-A vs\. Sys\-B

We run the buyer configurations across all 14 buckets; each run through both Sys\-A and Sys\-B\.

#### Outcome: simpler memory wins\.

Contrary to the initial hypothesis, Sys\-B outperforms Sys\-A on every metric under both judges \(Table[2](https://arxiv.org/html/2606.12924#S5.T2)\)\. The advantage is modest \(0\.01–0\.10 points\) but consistent across all metrics and judge conditions\. In head\-to\-head per\-conversation comparisons, Sys\-B wins more often than Sys\-A under both judges: Gemini \(Sys\-B: 23\.2%, Sys\-A: 15\.2%, tie: 61\.6%\) and Claude \(Sys\-B: 27\.3%, Sys\-A: 26\.3%, tie: 46\.5%\)\. Sys\-B is also 35% faster per query\.

Table 2:Sys\-A \(intent tracking\) vs\. Sys\-B \(rolling window\)\. Sys\-B wins all metrics under both judges while being 35% faster per query\. All Claude scores align with Gemini direction; mission success is representative\.
#### Why simpler memory wins\.

Intent extraction introduces failure modes absent in Sys\-B: the dedicated LLM call can miscategorise intents, injecting noise downstream, and the compressed intent representation can drop information present in the raw query\. Modern long\-context reasoning models appear capable of working directly with raw query sequences, making intermediate extraction a liability in most conversation lengths\. Sys\-A occasionally wins on very long conversations \(≥\\geq3 missions, patient buyers\), suggesting explicit intent tracking may help when queries span 15\+ turns and important information is missed by Sys\-B’s short 6\-query window, but these cases are rare \(<5% of dataset\)\.

### 5\.2Evaluation of Sys\-B

#### Overall performance\.

Sys\-B exhibits strong aggregate performance \(Table[3](https://arxiv.org/html/2606.12924#S5.T3)\)\. Gemini scores consistently exceed Claude across all metrics; the gap between judges \(0\.39–0\.97 points\) reflects different evaluation philosophies rather than disagreement about architecture quality \(§[5\.5](https://arxiv.org/html/2606.12924#S5.SS5)\)\.

Table 3:Sys\-B evaluation\. Mean scores as percentage of maximum \(5/5\), by judge, and the absolute Gemini–Claude gap\. The CHAT Helpfulness gap \(0\.97 pts\) is the largest, reflecting Sys\-B’s tendency to generate informative explanations that Gemini rewards but Claude deems insufficiently action\-oriented\.
#### Persona analysis\.

Sys\-B excels with patient, flexible buyers: information\-seeking and broad\-to\-narrow exploratory buckets rank highest, with multi\-mission patient configurations among the strongest performers\. The rolling\-window architecture is well\-matched to buyers who allow time for discovery\. The worst\-performing combination isprecise\_strict\_impatient: all distinct mission\-number variants rank in the bottom three buckets\.

#### Failure analysis\.

We define catastrophic failure as all four metrics scored 1/5\. Of 2,011 conversations, 14 \(0\.7%\) meet this criterion—but failures are highly concentrated\. Seven occur in the singleprecise\_strict\_impatient\_1missionbucket \(NN=133\), a 5\.3% rate that is 7\.5×\\timesthe overall average\. All patient buckets withN≥N\\geq100 achieve 0\.0% catastrophic failure, confirming that patient buyers give Sys\-B time to recover from missteps\. Examining the broader set of 65 low\-scoring conversations \(score≤\\leq2/5 on all metrics\) a few recurring failure patterns emerge\. These provide actionable targets for the next iteration \(§[5\.3](https://arxiv.org/html/2606.12924#S5.SS3)\)\.

#### Perfect success\.

838 conversations \(41\.7%\) score 5/5 on all metrics under Gemini\. Only 96 \(4\.8%\) achieve this under both judges simultaneously\. The 36\.9\-percentage\-point gap quantifies Sys\-B’s architectural trade\-off: strong at process quality and conversational engagement \(Gemini’s criterion\); inconsistent at converting engagement into concrete buyer outcomes \(Claude’s criterion\)\.

### 5\.3Rapid Iteration: Sys\-B→\\rightarrowSys\-B\+

The failure analysis in §[5\.2](https://arxiv.org/html/2606.12924#S5.SS2)directly motivated targeted changes in Sys\-B\+, identified and implemented in one working day\. Recurring patterns in Sys\-B’s degraded responses informed three targeted architectural changes addressing the most frequent failure modes\.

#### Evaluation of Sys\-B\+

We first validate on Sys\-B’s 29 worst\-performing configurations—theprecise\_strict\_impatientbucket conversations scoring≤\\leq2\.0 \(Claude average\)\. Table[4](https://arxiv.org/html/2606.12924#S5.T4)shows consistent improvement across all three bucket variants and both judges; with the targeted fixes, buyers adapt their search strategy and in many cases complete their missions\. Running Sys\-B\+ across all 2,011 conversations \(Table[5](https://arxiv.org/html/2606.12924#S5.T5)\), both judges agree on direction: Gemini overall average rises by\+\+1\.6%; Claude by\+\+1\.0%\. CHAT Helpfulness shows the largest per\-metric gain \(\+\+2\.2% for both judges\)\. Failure rates improve sharply: catastrophic failures fall from 14 to 9 \(−\-36%\); near\-failures from 65 to 25 \(−\-62%\); Gemini\-perfect conversations rise from 41\.7% to 46\.1%\. Gains concentrate inpsibuckets—where exact constraints most often produce 0\-result responses and impatient buyers have fewest turns to recover\. Onepsibucket \(precise\_strict\_impatient\_2missions\) reaches significance under Gemini \(pp==0\.0040\.004, Welch’stt\-test\), and a sign test across all 14 bucket deltas givespp==0\.0290\.029\.

Table 4:Sys\-B\+ vs\. Sys\-B score improvement on 29 worst\-performingprecise\_strict\_impatient\(psi\) configurations\. Targeted fixes recover\+\+1\.3 \(Claude\) and\+\+1\.8 \(Gemini\) points consistently across all three bucket variants\.Table 5:Sys\-B\+ vs\. Sys\-B full\-scale evaluation, all 2,011 conversations\. Score change shown as percentage of baseline\. Targeted fixes reduce near\-failures by 62% and catastrophic failures by 36%\. Gains concentrate inpsibuckets; the 2\-mission variant reaches significance \(pp==0\.0040\.004, Gemini\)\. A sign test across all 14 bucket deltas givespp==0\.0290\.029\.

### 5\.4LLM Quality: Sys\-B vs\. Sys\-C

Having established thatarchitectureaffects quality \(Sys\-A vs\. Sys\-B\), we also verify how theunderlying LLMcontributes independently\. Sys\-C holds Sys\-B’s architecture constant and swaps the generative backbone from Gemini 2\.5 Pro/Flash to vLLM\-served Llama 3\.3 70B Instruct\.

#### Results\.

Table[6](https://arxiv.org/html/2606.12924#S5.T6)shows Sys\-B winning on all four metrics under both judges\. Win ratios from per\-conversation head\-to\-head comparisons: Gemini—Sys\-B wins 26\.5%, Sys\-C wins 17\.0%, tie 56\.5% \(Sys\-B wins1\.56×1\.56\\timesmore often\); for the Claude judge, Sys\-B wins 34\.0%, Sys\-C wins 22\.0%, tie 44\.0% \(Sys\-B wins1\.55×1\.55\\times\)\. Differences are≈\\approx2\.4 standard errors from zero \(p<0\.02p<0\.02\); Cohen’sd≈0\.15d\\approx 0\.15–0\.21\.

Table 6:Sys\-B vs\. Sys\-C \(Llama 3\.3 70B\), 200 matched conversations\. Architecture is identical; only the generative LLM differs\.Δ\\Deltacolumns show Sys\-B−\-Sys\-C advantage per judge\. Sys\-C is 13% faster but consistently lower quality across all metrics and both judges\.The largest gap is CHAT Helpfulness \(Claude:\+\+0\.45\)\. Sys\-C generates generic, parametric responses \(e\.g\., “The Pulsar X2V2 is known for its lightweight design and advanced sensor technology…”\) while Sys\-B grounds responses in actual listing data \(e\.g\., “The 4K version supports a 4000 Hz polling rate; the Pro typically uses 1000 Hz, though some bundles include the 4K dongle as an upgrade”\)\. Claude penalises generic CHAT heavily; Gemini is more lenient\. Sys\-C also over\-applies CHAT:∼\\sim75% of turns vs\. Sys\-B’s∼\\sim25%, this occasionally steers buyers away from their stated mission toward a different product\.

Sys\-C’s only advantage is speed\. It is 13% faster per conversation\. Across all buckets but one, the speed saving does not offset the quality loss\.

#### Architectural vs\. LLM quality are orthogonal\.

Removing the intent\-extraction step \(Sys\-A→\\rightarrowSys\-B\) improved quality by reducing noise\. Swapping the backbone LLM \(Sys\-B→\\rightarrowSys\-C\) reduced quality by reducing generative and reasoning capability\. Gains on the two separate dimensions are orthogonal and do not substitute for the other\.

### 5\.5Judge Philosophy: Gemini vs\. Claude

Both judges receive identical evaluation prompts\. Table[3](https://arxiv.org/html/2606.12924#S5.T3)shows a consistent, large gap in absolute scores across all metrics\. The disagreement is sharpest on CHAT Helpfulness \(\+\+0\.97\): Gemini rewards informative, well\-structured explanations; Claude requires those explanations to lead to concrete buyer actions, such as cart additions, confirmed purchases, or explicit satisfaction signals\. Table[7](https://arxiv.org/html/2606.12924#S5.T7)shows the full score distribution on this metric: Gemini awards 5/5 to 74% of conversations; Claude to only 6%, a 68\-percentage\-point gap despite identical prompts\. This is not a calibration difference but a fundamental disagreement about what CHAT responses arefor\.

Table 7:CHAT Helpfulness score distributions \(600\-conversation set\)\. Gemini awards 5/5 to 74% of conversations; Claude to only 6%—a 68\-percentage\-point gap despite identical prompts and rubrics\.Agreement analysis on the conversation union set confirms the divergence is structural\. Exact agreement on CHAT Helpfulness is only 13%, versus 57–60% on Mission Success and Intent Understanding\. The two judges are effectively applying different criteria to the same dimension\. Across all metrics, 29\.8% of conversations show a≥\\geq2\-point difference between judges \(maximum observed: 3 points on a 5\-point scale\)\. When the judges disagree directionally, the asymmetry is striking: on Mission Success, Gemini scores higher in 90 out of 91 disagreements; on SRP Relevance, in 20 out of 21\. The divergence is not noise around a shared baseline, but rather a systematic directional bias\.

This divergence has measurable consequences\. Of 2,011 conversations, 838 \(41\.7%\) score 5/5 on all metrics under Gemini; only 96 \(4\.8%\) achieve this under both judges simultaneously, a 36\.9\-percentage\-point gap\. Reasoning style is also consistent with the philosophical split: Gemini produces concise assessments \(∼\\sim512 characters\) describing process per turn; Claude produces detailed critiques \(∼\\sim1,087 characters,2\.1×2\.1\\timeslonger\) scrutinizing whether the buyer ultimately accomplished their goal\. Because these differences emerge from the same prompt, they reflect deep architectural or training characteristics of the models underlying the LLM judges\.

A particularly consequential finding emerges from the conversation\-paired validation set \(Sys\-A vs\. Sys\-B, evaluated by both judges\): the systematic Gemini–Claude gap \(≈\\approx0\.5 points\) is 50×\\timeslarger than the Sys\-A vs\. Sys\-B architecture difference as measured by Claude, and 5×\\timeslarger as measured by Gemini\. The choice of evaluator has more impact on reported scores than the architectural choice being evaluated\. This makes judge selection a first\-class methodological decision: all architectural comparisons must use the same judge, and conclusions drawn under one judge may not transfer to the other\.

The methodological implication here is that choosing an LLM judge is equivalent to choosing a quality criterion\. If the target production metric is session depth, return rate, or interaction satisfaction, Gemini’s philosophy is more aligned; if the target is conversion, cart additions, or task completion, Claude’s philosophy is\. Running both judges simultaneously, as we do here, is not redundancy: the gap between their verdicts directly quantifies the system’s process–outcome trade\-off in a way neither judge alone can reveal\. We leave for future work running a human preference study to determine which philosophy better reflects real user outcomes\.

## 6Ethical Considerations

Simulation\-based evaluation can surface fairness concerns that production testing cannot easily expose\. The failure taxonomy in §[5\.2](https://arxiv.org/html/2606.12924#S5.SS2)reveals that impatient, strict buyers experience 7\.5×\\timeshigher failure rates than the overall average: a disparity that would be difficult to detect without controlled simulation across diverse buyer personas\. Identifying such gaps offline enables architects to address equity issues before they affect real customers\. Our approach to offline simulation enables rapid, low\-cost architectural iteration across the full buyer typology before any production commitment\.

However, simulation is a laboratory, not a substitute for monitoring real user outcomes after deployment\. LLM judges embed specific assumptions about quality that may not reflect all buyer needs, as shown in §[5\.5](https://arxiv.org/html/2606.12924#S5.SS5), Gemini rewards process correctness while Claude demands concrete outcomes, and architectural decisions optimized against one judge may systematically disadvantage buyer types valued by the other\. Real users will also exhibit behavior that synthetic personas cannot fully anticipate\. Connecting simulation results to production AB test outcomes is an important direction for future validation\.

## 7Discussion

#### Simulation fidelity\.

To validate the two\-agent design, we ran 64 no\-responder conversations where a single LLM generates both sides of the dialogue with access to the same search API\. The contrast is stark: single\-LLM queries average 24 words vs\. 8 words in two\-agent mode; action commands \(click on X,put X in cart\) appear in<<5% vs\. 45% of turns; natural self\-corrections \(“oops ok”\) appear consistently in two\-agent conversations and are absent in single\-LLM generation\. The single\-LLM system generates coherent conversations, but does not optimize for the casual, action\-focused patterns of real shoppers\. Hence the necessity of agent independence\. Fine\-tuning the buyer LLM on real user interaction data, following the behavioral fidelity approach ofWanget al\.\([2025](https://arxiv.org/html/2606.12924#bib.bib26)\)andLuet al\.\([2025](https://arxiv.org/html/2606.12924#bib.bib24)\), is a natural direction for future work\.

#### Architectural lessons\.

Two clear lessons emerge from the four experiments\. First, intermediate processing steps should be validated empirically: Sys\-A’s intent\-extraction hypothesis was theoretically well\-motivated but empirically wrong\. Second, architecture and LLM quality are independent dimensions: simplifying architecture \(Sys\-A→\\rightarrowSys\-B\) improved quality by removing noise; downgrading the backbone LLM \(Sys\-B→\\rightarrowSys\-C\) reduced quality by reducing generative and reasoning capability\.

## 8Conclusion

We present a modular two\-agent simulation framework that enables controlled, rapid evaluation of conversational shopping assistant architectures using real search API integration\. Four experiments produced concrete findings: simpler rolling\-window memory can outperform intent extraction in both quality and speed; full\-scale evaluation lays bare concentrated failure modes invisible in small\-scale testing; systematic failure analysis enables rapid targeted iteration; and LLM quality contributes independently of architecture\. An important finding is that Gemini and Claude hold fundamentally different evaluation philosophies from an identical prompt\. This highlights that LLM judge selection is itself a consequential architectural decision in any evaluation pipeline\. The configurable buyer agent and modular responder are built for continued iteration: as shopping assistant systems mature and buyer behavior evolves, both the scenarios tested and the architectures evaluated can be updated independently\. Any e\-commerce responder that returns SRP and CHAT response components can be evaluated against the buyer agent without modification to either component, making the framework applicable beyond the specific architectures evaluated here\.

## References

- Claude opus 4\.6 system card\.Technical reportAnthropic\.Note:Available at[https://www\.anthropic\.com/claude\-opus\-4\-6\-system\-card](https://www.anthropic.com/claude-opus-4-6-system-card)External Links:[Link](https://www.anthropic.com/claude-opus-4-6-system-card)Cited by:[§4\.2](https://arxiv.org/html/2606.12924#S4.SS2.SSS0.Px2.p1.1)\.
- B\. Atil, S\. Aykent, A\. Chittams, L\. Fu, R\. J\. Passonneau, E\. Radcliffe, G\. R\. Rajagopal, A\. Sloan, T\. Tudrej, F\. Ture, Z\. Wu, L\. Xu, and B\. Baldwin \(2025\)Non\-determinism of “Deterministic” LLM settings\.arXiv preprint arXiv:2408\.04667\.External Links:[Link](https://arxiv.org/abs/2408.04667)Cited by:[footnote 3](https://arxiv.org/html/2606.12924#footnote3)\.
- C\. Chiang and H\. Lee \(2023\)Can large language models be an alternative to human evaluations?\.arXiv preprint arXiv:2305\.01937\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p6.1)\.
- Y\. Dubois, X\. Li, R\. Taori, T\. Zhang, I\. Gulrajani, J\. Ba, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2024\)AlpacaFarm: a simulation framework for methods that learn from human feedback\.Advances in Neural Information Processing Systems36\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p6.1)\.
- Y\. Lu, J\. Huang, Y\. Han, B\. Yao, S\. Bei, J\. Gesi, Y\. Xie, Z\. Wang, Q\. He, and D\. Wang \(2025\)Can llm agents simulate multi\-turn human behavior? evidence from real online customer behavior data\.arXiv preprint arXiv:2503\.20749\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p1.1),[§7](https://arxiv.org/html/2606.12924#S7.SS0.SSS0.Px1.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p4.1)\.
- M\. Reid, N\. Savinov, D\. Teplyashin, D\. Lepikhin, T\. Lillicrap, J\. Alayrac, R\. Soricut, A\. Lazaridou, O\. Firat, J\. Schrittwieser,et al\.\(2026\)Gemini 3\.1 pro: model card\.Technical reportGoogle DeepMind\.Note:Available at[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/)External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Cited by:[§4\.2](https://arxiv.org/html/2606.12924#S4.SS2.SSS0.Px2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.arXiv preprint arXiv:2302\.04761\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p5.1)\.
- N\. Shinn, F\. Cassano, B\. Labash, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.arXiv preprint arXiv:2303\.11366\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p4.1)\.
- L\. Sun, Y\. Lu, J\. Gesi, S\. Fu, W\. Li, J\. Huang, and D\. Wang \(2025\)LLM agent meets agentic ai: can llm agents simulate customers to evaluate agentic\-ai\-based shopping assistants?\.arXiv preprint arXiv:2509\.21501\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p3.1)\.
- Z\. Wang, Y\. Lu, W\. Li, A\. Amini, B\. Sun, Y\. Bart, W\. Lyu, J\. Gesi, T\. Wang, J\. Huang, Y\. Su, U\. Ehsan, M\. Alikhani, T\. J\. Li, L\. Chilton, and D\. Wang \(2025\)OPeRA: a dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation\.arXiv preprint arXiv:2506\.05606\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p1.1),[§7](https://arxiv.org/html/2606.12924#S7.SS0.SSS0.Px1.p1.1)\.
- L\. Warne, C\. Gong, A\. Bamba, and M\. Gode \(2026\)A simulation and evaluation flywheel to develop llm chatbots at scale\.Note:DoorDash Engineering BlogCited by:[§2](https://arxiv.org/html/2606.12924#S2.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p5.1)\.
- C\. Zhao, T\. Zhang, H\. Su, Y\. Zhang, S\. Su, M\. Xu, Y\. Liu, W\. Han, J\. Werner, C\. N\. Cheng, and Y\. Mehdad \(2025a\)Agent\-in\-the\-loop: a data flywheel for continuous improvement in llm\-based customer support\.arXiv preprint arXiv:2510\.06674\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p2.1)\.
- Z\. Zhao, C\. Vania, S\. Kayal, N\. Khan, S\. B\. Cohen, and E\. Yilmaz \(2025b\)PersonaLens: a benchmark for personalization evaluation in conversational ai assistants\.arXiv preprint arXiv:2506\.09902\.Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p3.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://arxiv.org/abs/2306.05685)Cited by:[§2](https://arxiv.org/html/2606.12924#S2.p6.1)\.

Similar Articles

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Hugging Face Daily Papers

Introduces Agent Bazaar, a multi-agent simulation framework for evaluating economic alignment of LLMs, identifying failure modes like algorithmic instability and Sybil deception, and training a 9B model that outperforms frontier models using targeted reinforcement learning.