Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

arXiv cs.CL 06/12/26, 04:00 AM Papers
Summary
The Shopping Reasoning Bench is an expert-authored benchmark for evaluating multi-turn conversational shopping assistants, with 525 missions and over 10,000 binary rubrics. Evaluations of GPT, Claude, and Gemini show that current models achieve only 57-77% pass rates, revealing significant gaps in expert-level shopping reasoning.
arXiv:2606.12608v1 Announce Type: new Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.
Original Article
View Cached Full Text
Cached at: 06/12/26, 08:50 AM
# Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants
Source: [https://arxiv.org/html/2606.12608](https://arxiv.org/html/2606.12608)
Shuxian FanSeonwoo Min11footnotemark:1Youna HuBotao XiaJayakrishnan Unnikrishnan Rowan MusselmannYifan GaoQingyu YinPriyanka NigamBing Yin Amazon \{fansx, seonwoom, ynhu, xiabota, jayunn, saramuss, yifangao, qingyy, nigamp, alexbyin\}@amazon\.com

###### Abstract

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open\-ended multi\-turn reasoning, domain expertise, and criterion\-level quality that real shopping conversations demand\. Shopping reasoning is unique among language model applications\. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross\-product trade\-offs across multi\-turn dialogue, capabilities absent from previous e\-commerce and general\-purpose benchmarks\. We introduce theShopping Reasoning Bench, an expert\-authored benchmark of 525 missions \(232 single\-turn, 293 multi\-turn\) with 10,863 importance\-weighted binary rubrics authored by retail domain experts\. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade\-off analysis, and compatibility assessment\. An evaluation of nine models across three families \(GPT, Claude, Gemini\) shows that pass rates reach only 57–77% overall\. On multi\-turn missions, all models score 13–29 points lower on optional above\-and\-beyond criteria than on required ones, and performance degrades 4–18 points as conversations progress\. These gaps show that current models handle basic shopping assistance but fall short of expert\-level advice, makingShopping Reasoning Bencha challenging testbed for future shopping assistant development\.

Shopping Reasoning Bench: An Expert\-Authored Benchmark for Multi\-Turn Conversational Shopping Assistants

Shuxian Fan††thanks:Equal contribution\.Seonwoo Min11footnotemark:1Youna Hu Botao Xia Jayakrishnan UnnikrishnanRowan MusselmannYifan GaoQingyu YinPriyanka NigamBing YinAmazon\{fansx, seonwoom, ynhu, xiabota, jayunn, saramuss, yifangao, qingyy,nigamp, alexbyin\}@amazon\.com

## 1Introduction

Suppose a customer asks an AI shopping assistant to recommend trail running shoes sturdy enough for backpacking\. The assistant lists five popular models but never warns that cushioned soles lose stability under load and never suggests sizing up to accommodate foot swelling on long hikes\. The response answers the question, but a retail expert would call it shallow\.

Conversational shopping assistants have reached consumer scale: Amazon Rufus serves over 300 million customers\(Amazon\.com, Inc\.,[2026](https://arxiv.org/html/2606.12608#bib.bib24)\), and major search platforms including Perplexity have integrated AI\-powered shopping features\(Reuters,[2024](https://arxiv.org/html/2606.12608#bib.bib25)\)\. Evaluating these assistants is harder than evaluating many conventional language model applications\. A good shopping response must balance subjective preferences, budget constraints, and product trade\-offs across a multi\-turn conversation with evolving user intent, all while drawing on product\-domain expertise\.

##### The benchmark gap\.

A useful benchmark for this setting must be\(i\) grounded in domain expertiseto capture product\-specific knowledge that crowd annotators often lack;\(ii\) rubric\-verifiable at the criterion levelto resolve the fine capability differences that coarse aggregated scores obscure; and\(iii\) open\-ended and multi\-turnto reflect the iterative nature of real shopping conversations\. No existing shopping benchmark meets all three requirements \(§[2](https://arxiv.org/html/2606.12608#S2)\)\. ShoppingReasoningBench is, to our knowledge, the first to jointly satisfy them: it pairs domain\-expert\-authored rubric criteria with a shopping reasoning taxonomy and evaluates models across multi\-turn shopping missions\.

##### Why shopping reasoning?

The complexity above is not incidental: pre\-purchase shopping is a form of*practical reasoning*\(Bratman,[1987](https://arxiv.org/html/2606.12608#bib.bib28)\), deliberation whose output is a decision to act, not a truth to verify\. Answering “What are the best trail runners that are supportive enough to use for backpacking too?” requires decomposing the customer’s constraints, identifying candidate products, applying domain expertise to evaluate each against those constraints, and synthesizing a recommendation\. No single step is retrievable; each depends on the novel intersection of the customer’s needs with product\-specific knowledge\. Existing reasoning benchmarks, mathematical\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.12608#bib.bib1); Hendryckset al\.,[2021](https://arxiv.org/html/2606.12608#bib.bib2)\), logical and scientific\(Suzgunet al\.,[2023](https://arxiv.org/html/2606.12608#bib.bib29); Reinet al\.,[2023](https://arxiv.org/html/2606.12608#bib.bib4)\), or code\(Chenet al\.,[2021](https://arxiv.org/html/2606.12608#bib.bib30); Jimenezet al\.,[2024](https://arxiv.org/html/2606.12608#bib.bib5)\), span a wide difficulty spectrum yet share a defining property: a unique verifiable answer exists\. Shopping reasoning has no ground\-truth answer, only better and worse deliberation, precisely the gap ShoppingReasoningBench’s expert\-authored rubrics are designed to measure\.

##### Contributions\.

- •First taxonomy of pre\-purchase shopping reasoning\.Five categories and fifteen subcategories grounded in expert\-annotated turns\. These capture shopping\-specific reasoning patterns that prior shopping\-intent\(Sondhiet al\.,[2018](https://arxiv.org/html/2606.12608#bib.bib12)\)and product\-QA\(Yang and Alonso,[2024](https://arxiv.org/html/2606.12608#bib.bib13)\)taxonomies don’t address \(§[3](https://arxiv.org/html/2606.12608#S3)\)\.
- •Expert\-authored multi\-turn shopping dataset\.232 single\-turn queries and 293 multi\-turn missions \(1,764 turns\) authored by retail domain experts across five product families \(§[3](https://arxiv.org/html/2606.12608#S3)\)\.
- •Importance\-weighted atomic rubric framework with validated LLM\-as\-judge\.10,863 binary criteria \(85\.0% required\) that decompose expert shopping reasoning into independently verifiable pass/fail checks\. The LLM judge is validated against expert consensus with per\-criterion macro\-F1 benchmarked against an inter\-expert ceiling \(§[4](https://arxiv.org/html/2606.12608#S4)\)\.
- •Empirical study across three model families and capability tiers\.Nine models from the GPT, Claude, and Gemini families, each at frontier, mid, and small tiers\. The benchmark separates families, separates tiers within each family, and exposes multi\-turn degradation as conversations grow longer \(§[5](https://arxiv.org/html/2606.12608#S5)\)\.

Our benchmark data, judge prompts, and per\-model outputs are publicly released at[https://huggingface\.co/datasets/amazon/ShoppingReasoningBench](https://huggingface.co/datasets/amazon/ShoppingReasoningBench)\.

##### Headline findings\.

We evaluate nine models across three families and three capability tiers on ShoppingReasoningBench\. First, the benchmark is unsaturated: pass rates range from 57% to 77% across the nine models\. Second, all models score 13–29 points lower on optional rubrics than on required ones, exposing a persistent gap between basic and expert\-level shopping assistance\. Third, multi\-turn performance degrades 4–18 points over the course of a mission, paralleling the “lost\-in\-conversation” phenomenon\(Labanet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib11)\)\.

## 2Related Work

ShoppingReasoningBench draws on shopping\-domain benchmarks, expert\-authored rubric benchmarks in other domains, query and intent taxonomies, and multi\-turn LLM evaluation\. Table[1](https://arxiv.org/html/2606.12608#S2.T1)positions ShoppingReasoningBench against the most directly comparable shopping benchmarks and against the rubric\-benchmark methodologies from which its evaluation design is adapted\.

##### Shopping and e\-commerce benchmarks\.

Evaluation of conversational shopping assistants has been fragmented across task formulations\. WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2606.12608#bib.bib17)\)benchmarks LLM agents on simulated web navigation, focusing on product selection rather than open\-ended reasoning\. Shopping MMLU\(Jinet al\.,[2024](https://arxiv.org/html/2606.12608#bib.bib18)\)provides a broad suite of classification\-style tasks, but evaluates single\-turn closed\-form answers\. eCeLLM\(Penget al\.,[2024](https://arxiv.org/html/2606.12608#bib.bib19)\)constructs instruction\-tuning data for e\-commerce\. ShoppingBench\(Wanget al\.,[2025a](https://arxiv.org/html/2606.12608#bib.bib20)\)provides intent\-grounded agent tasks against a large product sandbox, measuring end\-to\-end success rate rather than response quality\. EcomEval\(Xieet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib10)\)evaluates shopping assistants across seven languages but does not provide expert\-authored rubrics for open\-ended scoring\. SessionIntentBench\(Yanget al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib8)\)models inter\-session intention shifts using a hierarchical intention tree, but evaluates with classification metrics\. On the dialogue\-dataset side, Wizard of Shopping\(Liet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib21)\)and MG\-ShopDial\(Bernard and Balog,[2023](https://arxiv.org/html/2606.12608#bib.bib22)\)provide conversational shopping dialogues but lack rubric annotations for automatic criterion\-level scoring\.

The closest direct competitors are two recent rubric\-based shopping benchmarks\. ShoppingComp\(Touet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib23)\)introduces an expert\-curated single\-turn benchmark with rubric\-graded product retrieval, report generation, and safety\-critical decision\-making evaluation\. SmartShopBench\(Chenget al\.,[2026](https://arxiv.org/html/2606.12608#bib.bib3)\)introduces a hierarchical two\-level evaluation across shopping intent categories, designed to support RL\-based agent training\. Both are single\-turn and do not organize queries around a published shopping\-reasoning taxonomy\. ShoppingReasoningBench differs along three dimensions\. First, it adds expert\-authored multi\-turn missions structured as shopping journeys spanning exploration, comparison, and goal\-directed search, alongside single\-turn queries that themselves require decomposing customer constraints, identifying candidate products, and weighing trade\-offs against domain knowledge\. Second, it organizes queries around a published taxonomy of pre\-purchase shopping reasoning\. Third, its rubric criteria carry importance weights rather than uniform pass/fail\.

##### Expert\-authored rubric benchmarks\.

Expert\-authored rubric benchmarks have emerged in several domains as general\-purpose benchmarks approach saturation\. HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib14)\)evaluates models on thousands of multi\-turn health conversations scored against rubric criteria written by physicians, validating an LLM judge against physician consensus\. PRBench\(Akyüreket al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib15)\)extends this methodology to finance and law\. ProfBench\(Wanget al\.,[2025b](https://arxiv.org/html/2606.12608#bib.bib16)\)covers chemistry, physics, finance, and consulting domains at the PhD and MBA level with response\-criterion pairs annotated by domain experts\. Earlier expert\-authored benchmarks in science \(GPQA\(Reinet al\.,[2023](https://arxiv.org/html/2606.12608#bib.bib4)\)\) and software engineering \(SWE\-bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.12608#bib.bib5)\)\) use verifiable answers rather than open\-ended rubrics\. ShoppingReasoningBench extends this lineage to retail, adapting the importance\-weighted atomic\-criterion protocol and LLM\-judge validation design to shopping reasoning\.

##### Query and intent taxonomies\.

Query taxonomies for e\-commerce have focused on search\-query intent\(Sondhiet al\.,[2018](https://arxiv.org/html/2606.12608#bib.bib12)\)or product\-QA type\(Yang and Alonso,[2024](https://arxiv.org/html/2606.12608#bib.bib13)\)rather than conversational reasoning; general\-purpose dialogue taxonomies like INFINITY\-CHAT\(Jianget al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib7)\)are not shopping\-specific\. None of these resolves the reasoning patterns that distinguish pre\-purchase shopping conversations from search or general chat: preference refinement, cross\-product trade\-off analysis, compatibility assessment, and multi\-turn purchase\-decision progression\. ShoppingReasoningBench’s taxonomy fills this gap \(§[3](https://arxiv.org/html/2606.12608#S3)\)\.

##### Reasoning in LLM evaluation\.

Existing reasoning benchmarks—mathematical\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.12608#bib.bib1); Hendryckset al\.,[2021](https://arxiv.org/html/2606.12608#bib.bib2)\), scientific\(Suzgunet al\.,[2023](https://arxiv.org/html/2606.12608#bib.bib29); Reinet al\.,[2023](https://arxiv.org/html/2606.12608#bib.bib4)\), and code\(Chenet al\.,[2021](https://arxiv.org/html/2606.12608#bib.bib30); Jimenezet al\.,[2024](https://arxiv.org/html/2606.12608#bib.bib5)\)—span a wide difficulty range but share a defining property: a unique correct answer that can be automatically checked\. Shopping reasoning fundamentally lacks this property \(§[1](https://arxiv.org/html/2606.12608#S1)\); ShoppingReasoningBench adapts rubric\-graded evaluation to this regime, decomposing deliberation into independently verifiable criteria \(§[4](https://arxiv.org/html/2606.12608#S4)\)\.

Table 1:Comparison of ShoppingReasoningBench with shopping\-domain benchmarks and rubric\-based benchmarks in other domains\. Eval\-item counts appear below each name\. “MT” = multi\-turn; “Expert” = expert\-authored; “Rubric” = expert\-authored rubric scoring\.BenchmarkMTExpertRubric*Shopping\-domain benchmarks*WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2606.12608#bib.bib17)\)
\(12,087 instructions\)–––Shopping MMLU\(Jinet al\.,[2024](https://arxiv.org/html/2606.12608#bib.bib18)\)
\(57 tasks\)–––eCeLLM\(Penget al\.,[2024](https://arxiv.org/html/2606.12608#bib.bib19)\)
\(10 tasks\)–––EcomEval\(Xieet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib10)\)
\(37 tasks\)–✓–ShoppingBench\(Wanget al\.,[2025a](https://arxiv.org/html/2606.12608#bib.bib20)\)
\(3,310 tasks\)–––SessionIntentBench\(Yanget al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib8)\)
\(8,980 trajectories\)–––ShoppingComp\(Touet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib23)\)
\(120 tasks\)–✓✓SmartShopBench\(Chenget al\.,[2026](https://arxiv.org/html/2606.12608#bib.bib3)\)
\(120 tasks\)––✓*Expert rubric benchmarks \(other domains\)*HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib14)\)
\(5,000 conversations\)✓✓✓ProfBench\(Wanget al\.,[2025b](https://arxiv.org/html/2606.12608#bib.bib16)\)
\(80 tasks\)–✓✓PRBench\(Akyüreket al\.,[2025](https://arxiv.org/html/2606.12608#bib.bib15)\)
\(1,100 questions\)–✓✓ShoppingReasoningBench \(ours\)
\(1,996 turns\)✓✓✓

## 3A Taxonomy of Shopping Reasoning

### 3\.1Design rationale

An expert moves through the shopping reasoning arc by understanding what a customer needs, identifying relevant options, applying domain knowledge to evaluate those options against the customer’s constraints, and synthesizing actionable guidance\. Figure[1](https://arxiv.org/html/2606.12608#S3.F1)illustrates this arc on a representative query: an expert decomposes the query through reasoning stages and produces atomic rubrics that any adequate response must satisfy\.

Existing shopping\-related taxonomies target two axes:*search\-query intent*\(Sondhiet al\.,[2018](https://arxiv.org/html/2606.12608#bib.bib12)\)and*product\-QA type*\(Yang and Alonso,[2024](https://arxiv.org/html/2606.12608#bib.bib13)\)\. Both leave the actual*reasoning demand*unspecified\. A query such as “is it better to wear hiking boots or trail running shoes when thru\-hiking?” is informational in intent and comparative in form, yet the capability it probes, trade\-off analysis under implicit budget and use\-case constraints, is invisible to both axes\. Our taxonomy targets this layer directly\.

The ShoppingReasoningBench taxonomy accordingly operates at two levels\. At the*turn level*\(§[3\.2](https://arxiv.org/html/2606.12608#S3.SS2)\), each query is assigned to one of five reasoning categories that capture the cognitive task the query places on the assistant\. At the*rubric level*\(§[3\.3](https://arxiv.org/html/2606.12608#S3.SS3)\), every atomic rubric carries tags for reasoning stage and quality dimension that serve as analytical keys for fine\-grained diagnosis\. This two\-level design makes the taxonomy load\-bearing: rubrics constructed from reasoning\-stage decomposition are reusable across product domains, and a model’s capability profile across categories reveals reasoning gaps that a product\-domain split would mask\.

### 3\.2Reasoning categories

We defined five top\-level categories to capture the dominant reasoning patterns observed in conversational shopping, and refined each into three fine\-grained subcategories by requiring every leaf to support a distinct rubric template instantiated against the shopping mission context \(Figure[2](https://arxiv.org/html/2606.12608#S3.F2); definitions in Appendix[A](https://arxiv.org/html/2606.12608#A1)\)\. Retail domain experts verified the mapping of each of the 1,996 queries and turns to a taxonomy leaf\.

Two categories cover roughly 70% of turns\.Product Recommendation\(42\.8%\) ranges from narrowly constrained requests through multi\-product curation to open\-ended discovery, loading heavily on option generation and feature assessment\.Shopping Guidance\(26\.6%\) captures queries seeking advice or education rather than product suggestions, loading on domain expertise and actionability\. Three smaller categories capture distinct reasoning patterns:Product Comparison\(10\.7%\) demands weighing trade\-offs across alternatives;Product Inquiry\(10\.4%\) demands depth on a single product with 54% of its rubrics on feature assessment; andConversational Navigation\(9\.5%\) steers the dialogue rather than requesting products—confirmed in §[5](https://arxiv.org/html/2606.12608#S5)as the hardest category for every model\.

### 3\.3Rubric dimensions and dataset composition

The benchmark comprises 1,996 evaluation points \(232 single\-turn \+ 1,764 multi\-turn across 293 missions\) assessed against 10,863 importance\-weighted atomic rubrics \(median 5 per turn\)\. Each rubric carries three orthogonal tags\. Areasoning stageidentifies which step of the expert reasoning arc the rubric tests—the top three stages,*Feature Assessment*\(23\.3%\),*Domain Expertise*\(21\.6%\), and*Option Generation*\(21\.2%\), account for two\-thirds of all rubrics\. Aquality dimensionidentifies which property of response quality is evaluated—*Concreteness*\(26\.0%\) is the most frequently tested\.Importancemarks a rubric as*required*\(85%\) or*optional*\(15%\), separating adequacy from expert\-level proactive guidance\. Queries span fiveproduct families\(*hardlines*40\.6%,*softlines*15\.1%,*consumables*14\.6%,*media*5\.6%,*mixed*24\.1%\) and threemission types\(*Explore & Discover*57\.0%,*Compare & Choose*22\.5%,*Find Specific Solution*20\.5%; length 2–10 turns, median 6\)\. Full definitions appear in Appendix[A](https://arxiv.org/html/2606.12608#A1)\.

![Refer to caption](https://arxiv.org/html/2606.12608v1/x1.png)Figure 1:Expert annotation pipeline illustrated on a Constrained Recommendation query from ShoppingReasoningBench\. The customer query is analyzed through structured reasoning stages, producing atomic rubrics—binary, independently verifiable evaluation criteria\.![Refer to caption](https://arxiv.org/html/2606.12608v1/x2.png)Figure 2:The ShoppingReasoningBench reasoning taxonomy: five top\-level categories and fifteen fine\-grained subcategories with occurrence frequencies and per\-category lexical word clouds\.

## 4Evaluation Framework

ShoppingReasoningBench aggregates atomic rubric judgments \(§[3\.3](https://arxiv.org/html/2606.12608#S3.SS3)\) into per\-turn, per\-mission, and dataset\-level scores via importance\-weighted pass rates\. Each rubric is scored by a single LLM judge whose reliability is validated against expert annotations \(§[4\.3](https://arxiv.org/html/2606.12608#S4.SS3)\)\.

### 4\.1Pass rate scoring

Theweighted pass ratefor a model response is

WPR=∑i=1Nwi⋅𝟏\[rubricipasses\]∑i=1Nwi\\text\{WPR\}=\\frac\{\\sum\_\{i=1\}^\{N\}w\_\{i\}\\cdot\\mathbf\{1\}\[\\text\{rubric\}\_\{i\}\\text\{ passes\}\]\}\{\\sum\_\{i=1\}^\{N\}w\_\{i\}\}\(1\)whereNNis the number of rubrics for a single turn,wiw\_\{i\}is the importance weight \(wi=5w\_\{i\}=5for required rubrics,wi=1w\_\{i\}=1for optional\), and𝟏\[⋅\]\\mathbf\{1\}\[\\cdot\]is the indicator function\. Scores aggregate hierarchically: Eq\.[1](https://arxiv.org/html/2606.12608#S4.E1)produces a per\-turn score; the per\-mission score is the arithmetic mean of its per\-turn scores; the dataset\-level score is the arithmetic mean of per\-mission scores\. This macro\-average weights each turn equally within its mission and each mission equally within the dataset, so longer missions and turns with more rubrics do not dominate the aggregate\.

### 4\.2LLM\-as\-judge

ShoppingReasoningBench uses Claude Sonnet 4\.5 as the judge with fixed inference parameters \(temperature 0, single sample per rubric\)\. A single judge applies uniform decision criteria across the benchmark and permits direct validation against expert annotations \(§[4\.3](https://arxiv.org/html/2606.12608#S4.SS3)\)\.

The judge produces a binary pass/fail decision with a brief rationale per rubric\. For single\-turn queries, it receives the query, model response, and rubric text\. For multi\-turn evaluation, it additionally receives the conversation history through the current turn\. Prompts and output schema are in Appendix[D](https://arxiv.org/html/2606.12608#A4); full judge and generation parameters are in[Appendix˜C](https://arxiv.org/html/2606.12608#A3)\.

### 4\.3Judge validation

Two retail\-domain experts independently labeled a stratified sample of 1,457 rubric instances \(details in Appendix[B](https://arxiv.org/html/2606.12608#A2)\)\. Table[2](https://arxiv.org/html/2606.12608#S4.T2)reports agreement at two levels\.

##### Rubric level\.

Each rubric is a binary*met*/*not\-met*judgment\. We report macro\-F1 \(mean of per\-class F1, insensitive to class imbalance\) and Cohen’sκ\\kappa\. Overall macro\-F1 is 0\.749 \(κ=0\.498\\kappa=0\.498, moderate;Landis and Koch,[1977](https://arxiv.org/html/2606.12608#bib.bib49)\)\. The judge approaches the inter\-expert ceiling—agreement between the two human experts on the same sample—for Product Recommendation and Conversational Navigation, where ceiling F1 is itself low \(≤0\.764\\leq 0\.764\)\. The largest gap appears in Product Comparison \(0\.721 vs\. ceiling 0\.852\), suggesting comparative rubrics admit greater annotator subjectivity\.

##### Aggregate level\.

We correlate the judge’s importance\-weighted pass\-rates with experts’ holistic 1–5 Likert ratings \(collected per turn and per shopping mission\) via Spearman’sρ\\rho; the inter\-expert baseline replaces the judge’s scores with the second expert’s pass\-rates\. Response\-levelρ=0\.444\\rho=0\.444\(n=305n=305; baseline 0\.398\); mission\-levelρ=0\.469\\rho=0\.469\(n=30n=30; baseline 0\.389\)\. The judge slightly exceeds the inter\-expert baseline at both levels\.

Table 2:Judge validation against expert annotations by reasoning category\. Left: rubric\-level macro\-F1 and Cohen’sκ\\kappa\(judge vs\. mission\-owner expert; ceiling = inter\-expert\)\. Right: Spearmanρ\\rhobetween judge weighted pass\-rates and expert Likert \(baseline = second expert’s pass\-rates\)\.NN= rubrics;nn= responses\.Binary agreementRank correlationCategoryNNJudgeF1F\_\{1\}Judgeκ\\kappaCeiling F1 /κ\\kappannJudgeρ\\rhoInter\-expertρ\\rhoProduct Recommendation6370\.7610\.5230\.760 / 0\.5211390\.3810\.247Shopping Guidance2510\.7310\.4640\.766 / 0\.534570\.3450\.313Product Comparison2250\.7210\.4420\.852 / 0\.704440\.3620\.578Product Inquiry1300\.7510\.5030\.883 / 0\.765280\.6180\.539Conversational Navigation2140\.7380\.4760\.764 / 0\.528370\.6080\.451Overall1,4570\.7490\.4980\.787 / 0\.5733050\.4440\.398

## 5Results

We evaluate nine commercial LLMs spanning three model families \(GPT, Claude, Gemini\) and three capability tiers \(frontier, mid, small\) on ShoppingReasoningBench\. Each model generates responses using its native web search tool at default inference parameters \([Appendix˜C](https://arxiv.org/html/2606.12608#A3)\)\. A single Claude Sonnet 4\.5 judge scores all responses against the atomic rubrics; its reliability is validated against expert annotations \([Section˜4\.3](https://arxiv.org/html/2606.12608#S4.SS3)\), and a cross\-judge comparison with DeepSeek V3\.2 confirms that the reported rankings are robust to judge choice \([Section˜E\.4](https://arxiv.org/html/2606.12608#A5.SS4)\)\. All results use weighted pass rates \(Eq\.[1](https://arxiv.org/html/2606.12608#S4.E1)\)\. Ablations on system prompt conditioning are in[Section˜E\.5](https://arxiv.org/html/2606.12608#A5.SS5)\.

### 5\.1Main Results

Table 3:Main results on ShoppingReasoningBench\. Weighted pass rate \(%\) on single\-turn \(ST, 232 missions\) and multi\-turn \(MT, 293 missions\) subsets\. Overall averages ST and MT weighted by mission count\.FamilyModelSTMTOverallGPTGPT\-5\.469\.271\.070\.2GPT\-5\.4 mini61\.665\.863\.9GPT\-5\.4 nano65\.261\.963\.4ClaudeClaude Opus 4\.775\.178\.577\.0Claude Sonnet 4\.565\.171\.368\.6Claude Haiku 4\.555\.359\.157\.4GeminiGemini 3\.1 Pro76\.577\.777\.2Gemini 3 Flash75\.275\.775\.5Gemini 3\.1 Flash\-Lite71\.173\.572\.4Three properties of the benchmark emerge from the primary evaluation \([Table˜3](https://arxiv.org/html/2606.12608#S5.T3)\)\. First, ShoppingReasoningBench is unsaturated: overall pass rates range from 57\.4% to 77\.2%, and no model exceeds 79% on either split\. Second, the benchmark separates capability tiers: within every family, frontier models outperform mid\-tier models, which in turn outperform small\-tier models\. Third, the two frontier models—Claude Opus 4\.7 \(77\.0%\) and Gemini 3\.1 Pro \(77\.2%\)—achieve comparable performance at the top of the range, while the GPT family trails at the frontier tier \(70\.2%\), leaving substantial room for improvement remains across all families\.

### 5\.2Where Do Models Struggle?

Table 4:Multi\-turn weighted pass rate \(%\) by taxonomy dimension for the frontier model of each family, averaged across turns within each group\.DimensionGPT5\.4Opus4\.7Gemini3\.1 ProBy reasoning categoryProduct Recommendation69\.376\.276\.9Shopping Guidance74\.481\.579\.2Product Comparison72\.880\.977\.7Product Inquiry69\.179\.176\.7Conversational Navigation65\.273\.875\.5By product familyHardlines68\.678\.476\.7Softlines72\.976\.678\.7Consumables71\.880\.378\.1Media76\.575\.481\.0Mixed70\.478\.076\.9By mission typeExplore & Discover72\.478\.879\.1Compare & Choose67\.278\.074\.8Find Specific Solution69\.176\.875\.6Reasoning category produces a consistent difficulty ordering across all nine models \(Table[4](https://arxiv.org/html/2606.12608#S5.T4)\)\. Shopping Guidance sits at the easy end: its queries lean toward advisory or educational responses, which models handle reliably\. Conversational Navigation sits at the hard end: its turns mark shifts in the shopping journey, such as refining preferences as new products surface or narrowing toward a final decision, where the assistant has to re\-anchor its recommendation to the customer’s evolving intent\. Product family shows no consistent difficulty ordering across models: each model has its own strongest and weakest product families, and no product family is uniformly hard for all nine models\. Mission type yields small within\-model gaps, suggesting difficulty comes from individual turns rather than mission shape\.

### 5\.3Required vs\. Optional Criteria

ShoppingReasoningBench’s rubrics separate*required*criteria \(baseline shopping correctness\) from*optional*criteria \(expert\-flagged above\-and\-beyond advice, §[3\.3](https://arxiv.org/html/2606.12608#S3.SS3)\)\. Every model scores 13 to 29 points lower on optional rubrics than on required ones \([Table˜5](https://arxiv.org/html/2606.12608#S5.T5)\)\. Current models cover the basics of shopping assistance at a reasonable rate but less consistently produce the kind of above\-and\-beyond advice that domain experts consider the mark of high\-quality assistance\.

Table 5:Multi\-turn required vs\. optional rubric pass rates \(%\), averaging the per\-turn fraction of rubrics met within each importance class\. Gap = Optional−\-Required; negative values indicate that models perform worse on above\-and\-beyond criteria\.FamilyModelReq\.Opt\.GapGPTGPT\-5\.471\.646\.5−\-25\.1GPT\-5\.4 mini66\.837\.8−\-29\.0GPT\-5\.4 nano62\.637\.2−\-25\.4ClaudeClaude Opus 4\.778\.866\.0−\-12\.8Claude Sonnet 4\.572\.050\.8−\-21\.2Claude Haiku 4\.560\.236\.9−\-23\.3GeminiGemini 3\.1 Pro78\.358\.0−\-20\.3Gemini 3 Flash76\.058\.7−\-17\.3Gemini 3\.1 Flash\-Lite73\.956\.8−\-17\.1
### 5\.4Rubric Difficulty Distribution by Reasoning Dimension

The previous sections show*where*models struggle by category and by importance class\. Table[6](https://arxiv.org/html/2606.12608#S5.T6)provides a finer\-grained view, breaking down rubric difficulty by reasoning stage and quality dimension \(definitions in[Appendix˜A](https://arxiv.org/html/2606.12608#A1)\)\. Of the 10,042 multi\-turn rubrics, 28\.3% are passed by all nine models \(ceiling\), 5\.3% are passed by none \(floor\), and the remaining 66\.4% discriminate between models\.

Table 6:Rubric difficulty by dimension\. Floor = fraction passed by no model; Ceiling = fraction passed by all nine models\. Higher ceiling indicates easier rubrics; higher floor indicates harder rubrics\.DimensionNNFloor \(%\)Ceiling \(%\)By importanceRequired8,5024\.031\.3Optional1,54012\.311\.6By reasoning stageUser Context6993\.147\.1Trade\-offs8064\.832\.9Option Generation2,1435\.628\.9Domain Expertise2,0634\.826\.7Feature Assessment2,3034\.121\.8Actionability2,0287\.828\.4By reasoning qualityClarity7446\.349\.5Relevance2,1924\.035\.8Accuracy9073\.628\.6Completeness1,9055\.829\.0Concreteness2,7545\.921\.4Insightfulness1,5405\.918\.8The importance split reinforces the required–optional gap from §[5\.3](https://arxiv.org/html/2606.12608#S5.SS3): 31\.3% of required rubrics are passed by all nine models, compared to only 11\.6% of optional rubrics\.

Among reasoning stages,*user context*rubrics are the easiest \(47\.1% passed by all models\): models reliably identify what the customer is asking for\.*Feature assessment*has the lowest all\-pass rate \(21\.8%\), indicating that evaluating specific product attributes \(materials, compatibility, specifications\) is where models diverge most\.*Actionability*rubrics show a distinct pattern: despite a moderate all\-pass rate \(28\.4%\), they have the highest rate of rubrics that no model passes \(7\.8%\), suggesting that giving concrete, usable recommendations is an area of consistent difficulty\.

Among quality dimensions,*clarity*\(49\.5% all\-pass\) and*relevance*\(35\.8%\) are well\-handled, while*insightfulness*\(18\.8%\) and*concreteness*\(21\.4%\) show the lowest all\-pass rates\. Models produce organized, on\-topic responses but less consistently demonstrate expert\-level depth or provide specific, tangible details\. In the shopping domain, this gap matters: the difference between a generic answer and a useful one often lies in the concrete product knowledge that requires domain expertise to produce\.

### 5\.5Multi\-Turn Degradation

All three frontier models show declining pass rates as missions progress \([Section˜E\.3](https://arxiv.org/html/2606.12608#A5.SS3)\)\. GPT\-5\.4 drops 10\.3 points \(77\.4 → 67\.1\), Gemini 3\.1 Pro drops 7\.3 points \(81\.5 → 74\.2\), and Claude Opus 4\.7 drops least at 4\.5 points \(78\.5 → 74\.0\)\. Three frontier models degrading at different rates on the same missions confirms that sustained multi\-turn coherence is a distinct capability, not a function of single\-turn quality\.

## 6Discussion

##### The taxonomy as an evaluation lens\.

Organizing evaluation by reasoning category reveals capability gaps that simpler breakdowns obscure\. Conversational Navigation is consistently the hardest category across all three families, while Shopping Guidance is consistently the easiest\. This pattern holds across all three families despite large differences in overall score\. A product\-family breakdown, by contrast, produces model\-specific patterns with no consistent ordering \(Table[4](https://arxiv.org/html/2606.12608#S5.T4)\)\. The taxonomy\-based decomposition is more diagnostic because it groups turns by the cognitive demand they place on the assistant \(e\.g\., trade\-off reasoning, product knowledge retrieval\) rather than by surface\-level topic\. Rubric\-level tags extend this further: breaking down pass rates by reasoning stage and quality dimension \([Section˜E\.2](https://arxiv.org/html/2606.12608#A5.SS2)\) shows that models handle user context recognition well but struggle with actionability and insightfulness, a distinction invisible in category\-level or product\-level aggregates\.

##### What importance weighting reveals\.

The 13–29 point gap between required and optional rubric pass rates \([Section˜5\.3](https://arxiv.org/html/2606.12608#S5.SS3)\) is ShoppingReasoningBench’s most distinctive empirical finding\. Required rubrics test whether a response covers what the customer asked for; optional rubrics test whether it goes further with proactive advice, complementary suggestions, or decision frameworks\. Without importance weighting, these two classes would be averaged together, compressing the quality spectrum and making adequate responses look closer to expert\-level ones than they are\. The gap shows that current models reliably cover the basics of shopping assistance but less consistently produce the above\-and\-beyond advice that domain experts consider the mark of high quality\. This has a design implication for rubric benchmarks beyond shopping: distinguishing must\-have from nice\-to\-have criteria, and weighting them differently, exposes a capability dimension that binary pass/fail alone would miss\. A within\-stage breakdown of this gap is in[Section˜E\.1](https://arxiv.org/html/2606.12608#A5.SS1)\.

## 7Conclusion

We introduced ShoppingReasoningBench, an expert\-authored benchmark for evaluating multi\-turn shopping assistance\. Retail shopping poses distinctive evaluation challenges: subjective preference resolution, cross\-product trade\-off reasoning, and multi\-turn purchase\-decision progression\. These require dedicated expert\-crafted criteria rather than general\-purpose metrics\. ShoppingReasoningBench addresses this with 525 missions across five product families, scored against 10,863 importance\-weighted atomic rubric criteria organized around the first taxonomy of pre\-purchase shopping reasoning \(5 categories, 15 subcategories\)\.

Our evaluation of nine models from three families shows that the benchmark is both unsaturated and discriminative\. Pass rates range from 57% to 77%, and all models score 13–29 points lower on optional rubrics than on required ones\. The hardest shopping\-specific skills remain far from solved\. The rubric decomposition reveals that models handle the basics but fall short on the above\-and\-beyond criteria that domain experts consider the mark of high\-quality advice\. By grounding evaluation in expert\-authored atomic rubrics with importance weights, ShoppingReasoningBench provides the resolution needed to measure capability gains from domain\-specific post\-training of shopping assistants\. We release the full benchmark together with a focused ShoppingReasoningBench\-Hard subset of the 108 hardest missions \(Appendix[F](https://arxiv.org/html/2606.12608#A6)\)\.

## Limitations

On the evaluation side, each model generated one response per query \(no repeated sampling\), so scores reflect a single draw from each model’s output distribution\. Variance across samples could affect individual mission scores, though dataset\-level averages over 525 missions mitigate this\. Some per\-category and per\-family breakdowns are based on small subsets \(e\.g\., Media with 14 multi\-turn missions\) and should be interpreted cautiously\.

On the data side, the expert team covers five product families but several domains \(grocery, automotive, industrial supplies, digital services\) are not represented\. Some taxonomy labels \(e\.g\., query type, shopping funnel stage\) were calibrated through iterative expert review rather than single\-pass annotation\. While all labels were verified for accuracy, systematic biases from the calibration process may propagate into the taxonomy distribution\. Product knowledge evolves over time: new products launch, prices change, and availability fluctuates\. The 17\.7% of single\-turn queries and 38\.9% of multi\-turn turns flagged as time\-sensitive may become outdated, requiring periodic benchmark updates\.

## Ethics Statement

##### Data collection\.

All benchmark queries were authored by expert annotators who were informed of the research purpose\. No customer data, personally identifiable information, or proprietary product data was used\.

##### Potential misuse\.

While ShoppingReasoningBench is designed to evaluate and improve shopping assistants, the rubric framework could in principle be used to optimize models for persuasive or manipulative product recommendations\. We encourage responsible use focused on improving response accuracy and helpfulness rather than maximizing conversion\.

##### Bias considerations\.

The benchmark reflects the knowledge and perspectives of 7 domain experts and may inherit biases related to product preferences, brand familiarity, and cultural context\. We encourage users to consider these limitations when interpreting evaluation results\.

##### Environmental impact\.

LLM\-based evaluation requires significant computational resources\. The single\-turn subset \(232 queries, 821 rubrics\) can serve as a lightweight evaluation where full multi\-turn assessment is not required\.

##### Use of AI assistants\.

AI writing assistants were used for editorial refinement of the manuscript\. All content has been audited and modified by the authors\.

## Data Availability

The benchmark data \(queries, missions, rubrics, and taxonomy\), annotation guidelines, and judge prompts will be released under the Creative Commons Attribution\-NonCommercial 4\.0 International License \(CC\-BY\-NC\-4\.0\) upon acceptance\. An access\-gated release ensures responsible use while maintaining reproducibility\.

## Acknowledgments

We thank our domain expert annotation team—Rowan Musselmann, Elizabeth Gongliewski, Jastine Sanchez, Kenneth Young, Laura Santana, Tom Knee, and others—for their meticulous work in constructing the evaluation rubrics and expert reasoning traces that underpin this benchmark\.

## References

- A\. F\. Akyürek, A\. Gosai, C\. B\. C\. Zhang, V\. Gupta, J\. Jeong, A\. Gunjal, T\. Rabbani, M\. Mazzone, D\. Randolph, M\. M\. Meymand, G\. Chattha, P\. Rodriguez, D\. Mares, P\. Singh, M\. Liu, S\. Chawla, P\. Cline, L\. Ogaz, E\. Hernandez, Z\. Wang, P\. Bhatter, M\. Ayestaran, B\. Liu, and Y\. He \(2025\)PRBench: large\-scale expert rubrics for evaluating high\-stakes professional reasoning\.arXiv preprint arXiv:2511\.11562\.External Links:[Link](https://arxiv.org/abs/2511.11562)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.14.1.1.1)\.
- Amazon\.com, Inc\. \(2026\)Amazon\.com announces fourth quarter results\.Note:SEC Filing, Exhibit 99\.1, fiscal year ended December 31, 2025External Links:[Link](https://www.sec.gov/Archives/edgar/data/1018724/000101872426000002/amzn-20251231xex991.htm)Cited by:[§1](https://arxiv.org/html/2606.12608#S1.p2.1)\.
- Anthropic \(2026\)Claude models documentation\.Note:Accessed: 2026\-05\-22External Links:[Link](https://docs.anthropic.com/en/docs/about-claude/models)Cited by:[Appendix C](https://arxiv.org/html/2606.12608#A3.SS0.SSS0.Px1.p1.1)\.
- R\. K\. Arora, J\. Wei, R\. Soskin Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel, J\. Heidecke, and K\. Singhal \(2025\)HealthBench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.External Links:[Link](https://arxiv.org/abs/2505.08775)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.12.1.1.1)\.
- N\. Bernard and K\. Balog \(2023\)MG\-ShopDial: a multi\-goal conversational dataset for e\-commerce\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),pp\. 2775–2785\.External Links:[Document](https://dx.doi.org/10.1145/3539618.3591883)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1)\.
- M\. E\. Bratman \(1987\)Intention, plans, and practical reason\.Harvard University Press,Cambridge, MA\.Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px2.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Cheng, K\. Mao, T\. Li, J\. Tan, J\. Wen, and Z\. Dou \(2026\)ChatShopBuddy: towards reliable conversational shopping agents via reinforcement learning\.arXiv preprint arXiv:2603\.06065\.External Links:[Link](https://arxiv.org/abs/2603.06065)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p2.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.10.1.1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Cohen \(1960\)A coefficient of agreement for nominal scales\.Educational and Psychological Measurement20\(1\),pp\. 37–46\.Cited by:[§B\.3](https://arxiv.org/html/2606.12608#A2.SS3.SSS0.Px4.p1.4)\.
- DeepSeek AI \(2025\)DeepSeek\-V3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§E\.4](https://arxiv.org/html/2606.12608#A5.SS4.p1.1)\.
- Google DeepMind \(2026\)Gemini models documentation\.Note:Accessed: 2026\-05\-22External Links:[Link](https://ai.google.dev/gemini-api/docs/models)Cited by:[Appendix C](https://arxiv.org/html/2606.12608#A3.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2103.03874)Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px4.p1.1)\.
- L\. Jiang, Y\. Chai, M\. Li, M\. Liu, R\. Fok, N\. Dziri, Y\. Tsvetkov, M\. Sap, A\. Albalak, and Y\. Choi \(2025\)Artificial hivemind: the open\-ended homogeneity of language models \(and beyond\)\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2510.22954)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px3.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=VTF8yNQM66)Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Jin, Z\. Li, C\. Zhang, T\. Cao, Y\. Gao, P\. Jayarao, M\. Li, X\. Liu, R\. Sarkhel, X\. Tang, H\. Wang, Z\. Wang, W\. Xu, J\. Yang, Q\. Yin, X\. Li, P\. Nigam, Y\. Xu, K\. Chen, Q\. Yang, M\. Jiang, and B\. Yin \(2024\)Shopping MMLU: a massive multi\-task online shopping benchmark for large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,External Links:[Link](https://arxiv.org/abs/2410.20745)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.4.1.1.1)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)LLMs get lost in multi\-turn conversation\.arXiv preprint arXiv:2505\.06120\.External Links:[Link](https://arxiv.org/abs/2505.06120)Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px4.p1.1)\.
- J\. R\. Landis and G\. G\. Koch \(1977\)The measurement of observer agreement for categorical data\.Biometrics33\(1\),pp\. 159–174\.Cited by:[§4\.3](https://arxiv.org/html/2606.12608#S4.SS3.SSS0.Px1.p1.3)\.
- X\. Li, Z\. Chen, J\. I\. Choi, N\. Vedula, B\. Fetahu, O\. Rokhlenko, and S\. Malmasi \(2025\)Wizard of shopping: target\-oriented E\-commerce dialogue generation with decision tree branching\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 13095–13120\.External Links:[Link](https://aclanthology.org/2025.acl-long.641/)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2026\)GPT models documentation\.Note:Accessed: 2026\-05\-22External Links:[Link](https://platform.openai.com/docs/models)Cited by:[Appendix C](https://arxiv.org/html/2606.12608#A3.SS0.SSS0.Px1.p1.1)\.
- B\. Peng, X\. Ling, Z\. Chen, H\. Sun, and X\. Ning \(2024\)eCeLLM: generalizing large language models for E\-commerce from large\-scale, high\-quality instruction data\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),External Links:[Link](https://arxiv.org/abs/2402.08831)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.5.1.1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: a graduate\-level google\-proof Q&A benchmark\.arXiv preprint arXiv:2311\.12022\.External Links:[Link](https://arxiv.org/abs/2311.12022)Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px4.p1.1)\.
- Reuters \(2024\)AI startup Perplexity adds shopping features as search competition tightens\.External Links:[Link](https://www.tradingview.com/news/reuters.com,2024:newsml_L4N3MP0XZ:0-ai-startup-perplexity-adds-shopping-features-search-competition-tightens/)Cited by:[§1](https://arxiv.org/html/2606.12608#S1.p2.1)\.
- P\. Sondhi, M\. Sharma, P\. Kolari, and C\. Zhai \(2018\)A taxonomy of queries for E\-commerce search\.InProceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),pp\. 1245–1248\.External Links:[Document](https://dx.doi.org/10.1145/3209978.3210152)Cited by:[1st item](https://arxiv.org/html/2606.12608#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.12608#S3.SS1.p2.1)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. V\. Le, E\. H\. Chi, D\. Zhou, and J\. Wei \(2023\)Challenging BIG\-Bench tasks and whether chain\-of\-thought can solve them\.Findings of ACL\.Cited by:[§1](https://arxiv.org/html/2606.12608#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px4.p1.1)\.
- H\. Tou, Y\. Zeng, Y\. Li, C\. Ma, M\. Li, M\. Li, W\. Yuan, H\. Zhang, and K\. Jia \(2025\)ShoppingComp: are LLMs really ready for your shopping cart?\.Note:Version 2, February 2026External Links:2511\.22978,[Link](https://arxiv.org/abs/2511.22978v2)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p2.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.9.1.1.1)\.
- J\. Wang, K\. Xiao, Q\. Sun, H\. Zhao, T\. Luo, J\. D\. Zhang, and X\. Zeng \(2025a\)ShoppingBench: a real\-world intent\-grounded shopping benchmark for LLM\-based agents\.arXiv preprint arXiv:2508\.04266\.External Links:[Link](https://arxiv.org/abs/2508.04266)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.7.1.1.1)\.
- Z\. Wang, J\. Jung, X\. Lu, S\. Diao, E\. Evans, J\. Zeng, P\. Molchanov, Y\. Choi, J\. Kautz, and Y\. Dong \(2025b\)ProfBench: multi\-domain rubrics requiring professional knowledge to answer and judge\.arXiv preprint arXiv:2510\.18941\.External Links:[Link](https://arxiv.org/abs/2510.18941)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.13.1.1.1)\.
- S\. Xie, Z\. Liew, H\. Zhang, H\. Zhang, L\. Hu, Z\. Zhou, S\. Liu, and A\. Zeng \(2025\)Towards reliable evaluation of large language models for multilingual and multimodal E\-commerce applications\.arXiv preprint arXiv:2510\.20632\.External Links:[Link](https://arxiv.org/abs/2510.20632)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.6.1.1.1)\.
- D\. Yang and O\. Alonso \(2024\)A bespoke question intent taxonomy for E\-commerce\.InProceedings of the SIGIR 2024 Workshop on eCommerce \(eCom’24\),Washington, DC, USA\.External Links:[Link](https://ceur-ws.org/Vol-3843/)Cited by:[1st item](https://arxiv.org/html/2606.12608#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.12608#S3.SS1.p2.1)\.
- Y\. Yang, W\. Wang, B\. Xu, W\. Fan, Q\. Zong, C\. Chan, Z\. Deng, X\. Liu, Y\. Gao, C\. Yu, C\. Luo, Y\. Li, Z\. Li, Q\. Yin, B\. Yin, and Y\. Song \(2025\)SessionIntentBench: a multi\-task inter\-session intention\-shift modeling benchmark for E\-commerce customer behavior understanding\.arXiv preprint arXiv:2507\.20185\.External Links:[Link](https://arxiv.org/abs/2507.20185)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.8.1.1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2207.01206)Cited by:[§2](https://arxiv.org/html/2606.12608#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.12608#S2.T1.1.3.1.1.1)\.

## Appendix ATaxonomy and Rubric Definitions

This appendix provides complete definitions for all taxonomy dimensions and rubric design principles used in ShoppingReasoningBench\.

### A\.1Product Families

HardlinesDurable goods including electronics, appliances, tools, sports equipment, furniture, and home improvement\.

SoftlinesApparel, footwear, accessories, textiles, and fashion items\.

ConsumablesFood, beverages, health and beauty products, household supplies, and other consumable goods\.

MediaBooks, music, movies, video games, and digital content\.

MixedQueries spanning multiple product families\.

### A\.2Mission Types \(Multi\-Turn Only\)

Explore & DiscoverOpen\-ended shopping journeys where customers browse, learn about options, and gradually narrow their preferences over multiple turns\.

Compare & ChooseFocused comparison shopping between specific products or categories, leading to a selection decision\.

Find Specific SolutionGoal\-directed shopping for a particular need, problem, or use case with relatively clear requirements\.

### A\.3Shopping Funnel Stages

Each turn in a multi\-turn mission is labeled with one of three funnel stages reflecting the customer’s shopping intent at that point in the conversation\. Percentages below are computed over all 1,764 multi\-turn turns\. At the mission level, the ordered sequence of per\-turn labels is stored as theshopping\_funnel\_flowarray\.

Discover \(31\.4%\)Broadly exploring needs—the customer is in the early stage with undefined or loosely defined intent\.

Explore \(62\.9%\)Evaluating specific options—the customer has narrowed the space and is comparing or learning about particular products or categories\.

Ready\-to\-Transact \(5\.7%\)Finalizing a decision—the customer is close to or at the point of purchase\.

### A\.4Reasoning Stage and Quality Dimensions

Table[7](https://arxiv.org/html/2606.12608#A1.T7)gives the full definitions of the six reasoning stages and six quality dimensions used to tag each of the 10,863 rubrics in ShoppingReasoningBench\.

Table 7:Definitions of the six reasoning stages and six quality dimensions used to tag each rubric\. Percentages give the share of ShoppingReasoningBench’s 10,863 rubrics carrying each tag\.TagDefinition%*Reasoning stage*User ContextInterprets the customer’s situation, constraints, and intent7\.1Option GenerationSurfaces relevant products or categories21\.2Domain ExpertiseRequires specialized product knowledge21\.6Feature AssessmentEvaluates specific attributes and specifications23\.3Trade OffsComparison and prioritization reasoning8\.0ActionabilityRecommendations are concrete and usable18\.9*Reasoning quality*ConcretenessRecommendations include specific, tangible details26\.0RelevanceContent addresses actual customer needs22\.3CompletenessAll important aspects are covered20\.0InsightfulnessExpert\-level understanding is demonstrated15\.6AccuracyFacts and specifications are correct9\.2ClarityInformation is well\-organized6\.8
### A\.5Rubric Taxonomy

The four rubric tag dimensions are defined in §[3\.3](https://arxiv.org/html/2606.12608#S3.SS3); this subsection provides design principles, the stage–quality cross\-tabulation, and per\-category reasoning stage profiles\.

#### A\.5\.1Rubric Design Principles

Drawing on operational experience from expert annotation across all five reasoning categories, we identify seven principles that characterize effective evaluation rubrics:

1. 1\.Clear & Unambiguous\.Each rubric uses precise language so that both human annotators and LLM judges reach the same pass/fail decision\. Vague terms like “good” or “appropriate” are replaced with specific, measurable criteria\.
2. 2\.Actionable\.Rubrics describe observable response behaviors rather than internal model states\. A judge can determine pass/fail by examining the response text alone\.
3. 3\.Comprehensive\.The rubric set for a query covers the reasoning stages that the query’s reasoning category demands—for example, Product Comparison rubrics emphasize feature assessment and trade\-off reasoning, while Shopping Guidance rubrics emphasize domain expertise and actionability\.
4. 4\.Aligned with System Capabilities\.Rubrics do not penalize responses for limitations outside the model’s control \(e\.g\., real\-time inventory checks\) and account for what a text\-based assistant can reasonably provide\.
5. 5\.Balanced\.Rubrics test both the presence of required information \(recall\) and the absence of harmful or irrelevant content \(precision\), avoiding over\-emphasis on either direction\.
6. 6\.Fair\.Multiple valid response strategies can satisfy the same rubric\. Rubrics avoid requiring a single “correct” phrasing or product ordering unless specificity is essential\.
7. 7\.Atomic\.Each rubric tests exactly one aspect of the response\. Compound criteria \(e\.g\., “Recommend a durable and affordable product”\) are split into separate rubrics\.

In addition, we enforce the following operational guidelines:

- •Rubrics are ordered by importance, with required rubrics listed before optional ones\.
- •Each rubric is written as a complete sentence describing the expected response behavior\.
- •Queries contain between 2 and 10 rubrics\.
- •Rubric language targets non\-expert comprehension to ensure consistent LLM judge interpretation\.

Table 8:Cross\-tabulation of reasoning stage and reasoning quality across all 10,863 rubrics\. Each cell shows the number of rubrics at the intersection\.Stage\\\\backslashQualityAccuracyCompleteConcreteRelevanceInsightfulClarityTotalUser Context605575673940768Option Generation2272787762617322,301Domain Expertise584526274408031172,344Feature Assessment250455738662392292,526Trade Offs3320116715222892873Actionability482117643742204342,051Total9972,1752,8272,4211,69974410,863
#### A\.5\.2Reasoning Stage Profiles by Category

Different reasoning categories produce distinct rubric distributions across the six reasoning stages\.[Table˜9](https://arxiv.org/html/2606.12608#A1.T9)summarizes the stage profiles computed from all 10,863 published rubrics\. Across all categories, 84–87% of rubrics are marked required and 13–16% optional, reflecting the benchmark’s emphasis on must\-have evaluation criteria\.

##### Product Recommendation

\(4,649 rubrics\)\. Covers Constrained, Multi\-Product, and Open\-Ended subcategories\. Option generation \(32%\) and feature assessment \(24%\) dominate, reflecting the need to surface relevant product categories and evaluate their fit\. Actionability \(16%\) and domain expertise \(15%\) provide supporting depth\.

##### Shopping Guidance

\(2,995 rubrics\)\. Covers Decision\-Factor, Domain Knowledge, and Usage & Setup subcategories\. Domain expertise \(38%\) leads by a wide margin, followed by actionability \(27%\), consistent with queries that seek educational content and practical next steps rather than product lists\.

##### Product Comparison

\(1,195 rubrics\)\. Covers Product\-Level, Category\-Level, and Trade\-off Analysis subcategories\. Trade Offs \(36%\) is the single largest stage—uniquely high among all categories—paired with feature assessment \(22%\) and domain expertise \(20%\), capturing the structured comparative reasoning these queries require\.

##### Product Inquiry

\(1,057 rubrics\)\. Covers Feature & Spec, Compatibility, and Value & Market subcategories\. Feature assessment \(54%\) dominates, the highest single\-stage concentration in the benchmark, reflecting queries that target specific product attributes and specifications\.

##### Conversational Navigation

\(967 rubrics\)\. Covers Preference Refinement, Scope Expansion, and Decision Finalization subcategories\. User context \(15%\) is uniquely elevated for this category, while actionability \(25%\), option generation \(24%\), and feature assessment \(24%\) ensure multi\-turn coherence translates into concrete guidance\.

Table 9:Reasoning stage distribution per category \(percentage of rubrics\) and importance split\.CategoryNUserContextOptionGenerationDomainExpertiseFeatureAssessmentTradeOffsActionabilityRequired%Product Recommendation4,649732152451684Shopping Guidance2,995514381152784Product Comparison1,195472022361187Product Inquiry1,05786155441387Conversational Navigation967152492432587

## Appendix BExpert Panel, Authoring, and Annotation

### B\.1Expert panel and authoring

We recruited retail domain experts, each with product knowledge spanning at least three categories\. Selection criteria were \(i\) ability to produce accurate product recommendations grounded in technical product attributes, and \(ii\) familiarity with real customer shopping patterns across multiple price points and use cases\. The panel collectively covers the five product families\.

Rather than writing evaluation criteria directly, each expert follows a structured reasoning process that decomposes ambiguous customer needs into specific technical considerations—fit, materials, compatibility, trade\-offs—before deriving concrete rubric criteria that any adequate response must address \(Figure[1](https://arxiv.org/html/2606.12608#S3.F1)\)\. This decomposition makes explicit the domain knowledge that distinguishes expert shopping assistance from keyword matching: a “best trail runners for backpacking” query is not answered by retrieving popular trail\-runner SKUs but by reasoning about dual\-use load\-bearing requirements, midsole stiffness, and the trade\-off between trail agility and backpacking support\.

![Refer to caption](https://arxiv.org/html/2606.12608v1/x3.png)Figure 3:Expert domain coverage across the five product families in ShoppingReasoningBench, with representative subtopics illustrating the breadth of retail expertise required\.
### B\.2Annotation Guidelines & Templates

This subsection describes the annotation guidelines provided to expert annotators\.

#### B\.2\.1Single\-Turn Annotation Template

Each single\-turn annotation consists of the following fields:

1. 1\.Query:A natural language shopping question or request that a customer might ask a conversational assistant\.
2. 2\.Rubric Dimensions:A list of atomic, binary criteria, each with: - •rubric\_text: A clear, verifiable statement of what the response should include\. - •importance:required\(must be satisfied\) oroptional\(bonus quality\)\. - •scope:instance\(specific to this query\) orcluster\(category\-level\)\. - •reasoning\_stage: The expert reasoning phase the rubric tests \(e\.g\.,user\_context,option\_generation,actionability,trade\_offs\)\. - •reasoning\_quality: The quality dimension the rubric targets \(e\.g\.,relevance,insightfulness,completeness,accuracy\)\.

#### B\.2\.2Multi\-Turn Mission Template

Each multi\-turn mission consists of:

1. 1\.Mission Tags:Metadata including mission ID, name, type, objective, product family, and length\.
2. 2\.Turn Sequence:An ordered list of customer utterances representing a realistic shopping conversation flow\.
3. 3\.Per\-Turn Annotations:For each turn: - •Turn\-level tags:reasoning\_category,reasoning\_subcategory,shopping\_funnel\_stage\. - •Rubric dimensions specific to the turn’s expected response, each carrying the four LLM\-assigned tag dimensions:scope,importance,reasoning\_stage,reasoning\_quality\.

#### B\.2\.3Rubric Writing Guidelines

Annotators were instructed to:

- •Write rubrics that areatomic—each rubric tests exactly one aspect\.
- •Ensure rubrics areobjectively verifiable—an LLM judge should be able to determine pass/fail\.
- •Avoid rubrics that aretoo vague\(e\.g\., “Response is helpful”\) ortoo specific\(e\.g\., “Response contains exactly 5 product recommendations”\)\.
- •Tag rubrics asrequiredif failing them would make the response fundamentally inadequate, andoptionalif they represent desirable but non\-essential qualities\.
- •Aim for 2–7 rubrics per single\-turn query and 3–11 per multi\-turn turn\.

### B\.3Judge validation protocol

##### Model and sample\.

Judge validation is conducted on responses produced by an evaluated model, Gemini 2\.5 Pro\. A two\-stage stratified sampling protocol draws 124 single\-turn queries and 30 multi\-turn missions, yielding 1,457 rubric\-level validation instances that span all five reasoning categories and both benchmark splits\.

##### Annotator assignment and blinding\.

Each instance is labeled independently by two experts who are blinded to \(i\) the identity of the model that produced the response, \(ii\) the LLM judge’s label, and \(iii\) each other’s label\. The mission\-owner expert’s labels serve as ground truth; the second expert’s labels provide an inter\-expert reference\.

##### Reference interpretation\.

At the rubric level, inter\-expert agreement defines a*ceiling*: the maximum binary agreement attainable given inherent annotator subjectivity\. At the aggregate level, inter\-expert correlation defines a*baseline*: it quantifies how well*any*rubric\-based aggregation predicts the mission\-owner’s holistic Likert rating, since the second expert also works from rubrics rather than holistic impression\.

##### Metrics\.

The primary rubric\-level metric ismacro\-F1 \(MF1\), defined as the unweighted average of F1 on the*met*class and F1 on the*not\-met*class:

MF1=12\(F1met\+F1not\-met\),whereF1c=2TPc2TPc\+FPc\+FNc\.\\begin\{split\}\\mathrm\{MF1\}=\\tfrac\{1\}\{2\}\\bigl\(&F1\_\{\\text\{met\}\}\+F1\_\{\\text\{not\-met\}\}\\bigr\),\\\\ \\text\{where \}F1\_\{c\}&=\\frac\{2\\,TP\_\{c\}\}\{2\\,TP\_\{c\}\+FP\_\{c\}\+FN\_\{c\}\}\.\\end\{split\}\(2\)This choice ensures equal weight to both classes despite ShoppingReasoningBench’s pass\-heavy distribution \(71\.9%*met*in the validation sample\)\. As a secondary metric we report Cohen’sκ\\kappa\(Cohen,[1960](https://arxiv.org/html/2606.12608#bib.bib48)\), which adjusts for chance agreement\. At the aggregate level, we report Spearman’sρ\\rhobetween the judge’s importance\-weighted pass\-rates and the mission\-owner expert’s 1–5 Likert ratings, computed separately at response level \(n=305n=305\) and mission level \(n=30n=30\)\.

##### Coverage note\.

The judge applies the full rubric set to every response, whereas each human expert annotates only their assigned subset\. This asymmetry explains why the judge slightly exceeds the inter\-expert baseline for rank correlation: the judge aggregates over more rubric judgments per response than any single expert\.

## Appendix CInference and Judge Parameters

This appendix documents the inference parameters for evaluated models and the judge model used in ShoppingReasoningBench\. The nine evaluated models \(three families, three capability tiers each\) are listed in[Table˜3](https://arxiv.org/html/2606.12608#S5.T3)\. For the Claude family, Opus 4\.7 is used at the frontier tier while the mid and small tiers use Sonnet 4\.5 and Haiku 4\.5, as no 4\.7\-generation models are available at those tiers\.

##### Generation parameters\.

All nine models—GPT\-5\.4 family\(OpenAI,[2026](https://arxiv.org/html/2606.12608#bib.bib43)\), Claude family\(Anthropic,[2026](https://arxiv.org/html/2606.12608#bib.bib44)\), and Gemini family\(Google DeepMind,[2026](https://arxiv.org/html/2606.12608#bib.bib45)\)—are evaluated at temperature 1\.0 \(the API default for all providers\), with top\-ppand maximum output tokens left at API defaults\. Each model generates one response per query \(single\-turn\) or per turn \(multi\-turn\)\. All models have web search enabled via each provider’s native tool integration \(OpenAI web search, Google grounding, Anthropic web search\)\. The model decides autonomously whether to invoke search on each turn; no forced\-search or no\-search constraint is applied\.

##### Judge model\.

ShoppingReasoningBench uses Claude Sonnet 4\.5 as the single LLM judge at temperature 0, producing one binary pass/fail decision with a brief rationale per rubric\. For single\-turn queries, the judge receives the query, model response, and rubric text\. For multi\-turn evaluation, it additionally receives the full conversation history through the current turn\. The prompt templates and output schema are documented in[Appendix˜D](https://arxiv.org/html/2606.12608#A4)\. Self\-judging bias toward the Claude\-family evaluated models is an inherent property of this design; a cross\-judge comparison in[Section˜E\.4](https://arxiv.org/html/2606.12608#A5.SS4)confirms no self\-preference effect on rankings\.

## Appendix DLLM Judge Prompt

ShoppingReasoningBench uses a single prompt template for both single\-turn and multi\-turn evaluation\. For single\-turn queries, the conversation history field is left blank\. For multi\-turn evaluation, it contains all prior turns in chronological order\. Figure[4](https://arxiv.org/html/2606.12608#A4.F4)reproduces the prompt verbatim; five worked examples are omitted for space but are included in the released evaluation code\.

Yourtaskistoevaluatetheassistant'slatestresponseinthe\*\*currentconversation\*\*betweentheuserandtheAIshoppingassistantbasedonthegiven\*\*rubric\*\*todeterminewhetheritmeetstherubric'srequirements\.

Duringtheevaluation,youmayrefertothe\*\*conversationhistory\*\*iftherubricrequirescontextfromearlierturns\.

\#\#KeyInformationtoFocusOn

\#\#\#CurrentConversation

<<current\_conversation\>\>

\#\#\#Rubric

<<rubric\_text\>\>

\#\#\#ReferenceEvidence

\-\*\*Conversationhistory\*\*

<<conversation\_history\>\>

\-\-\-

\#\#Instructions

\#\#\#EvaluationScope

\-\*\*PRIMARYFOCUS\*\*:Theassistant'slatestresponseinthe"CurrentConversation"\.

\-\*\*EVALUATIONBASIS\*\*:Determinewhethertheassistant'slatestresponsesatisfiesALLcriteriaspecifiedinthe"Rubric"\.

\-\*\*SCOPELIMITATION\*\*:Onlyevaluatewhatisexplicitlyrequiredbytherubric,doNOTaddorconsideradditionalcriteria\.

\#\#\#WhentoUseReferenceEvidence

\-\*\*Generalprinciple\*\*:

\-Onlyreferencetheconversationhistorywhentherubricexplicitlyrequiresverificationagainstearlierturns\.

\-Iftherubriccanbeevaluatedsolelyfromthecurrentconversation,consultingconversationhistoryisoptional\.

\-\*\*Iftheconversationhistoryisblank\*\*:

\-Thisindicatestherubriccanbeevaluatedpurelyfromthecurrentconversation\.

\-Evaluatebasedsolelyontheassistant'sresponsecontentandstructure\.

\#\#\#Returnajsonobjectwiththefollowingfields:"explanation"and"rubric\_met"

\-The"explanation"fieldshouldbeastringexplainingwhytheresponsedoesordoesnotmeettherubric\.

\-The"rubric\_met"fieldshouldbeabooleanindicatingwhethertheresponsemeetstherubric\.

\-Ifrubrichasmultiplerequirementsconnectedby"and":ALLmustbemetfortrue\.Ifrubrichasrequirementsconnectedby"or":ANYONEmetresultsintrue\.Ifrubricasksfor"atleastX":meetingorexceedingXresultsintrue\.

\-Oneimportantexceptiontotheabovebulletpointisthatifarubricsays"suchas","forexample","including",or"e\.g\.",therubricdoesnothavetoincludealloftheexampleslistedtomeettherubric\.

\-Forrubricsaboutavoidingbehaviors:theresponseshouldbeclassifiedastrueifitsuccessfullyavoidstheundesirablebehavior,andfalseifitexhibitstheundesirablebehavior\.

\#\#\#EdgeCases

\-Iftheassistant'sresponseisemptyorerrormessage:rubric\_met=false\.

\-\-\-

\#Examples

Fiveworkedexamplesomittedforspace\.Examplescover:

\(1\)cross\-turnduplicationdetection

\(2\)missingconcretedetails

\(3\)context\-awarenessfailure

\(4\)incorrectproductidentificationfromhistory

\(5\)successfulworkflowexplanation\.

\#OutputFormat

\-ReturnjusttheJSONobjectinmarkdownformat\.Donotincludeanyothertextintheresponse\.

\-"explanation":String,1\-3sentencesfocusingonWHYtherubricis/isn'tmet

\-"rubric\_met":Boolean\(true/falseonly\)

Figure 4:Full LLM judge prompt used by ShoppingReasoningBench\. The<<…\>\>tokens are filled at evaluation time with the current turn, rubric text, and \(for multi\-turn\) conversation history through the preceding turn\.
## Appendix EExtended Results

This appendix extends the main evaluation \(§[5](https://arxiv.org/html/2606.12608#S5)\) with a within\-stage breakdown of the required–optional gap, per\-family breakdowns by reasoning stage and quality, and multi\-turn degradation for all nine models\. Per\-category and per\-mission\-type breakdowns for the three frontier models are reported in[Table˜4](https://arxiv.org/html/2606.12608#S5.T4)\.

### E\.1Required–Optional Gap by Reasoning Stage

A natural concern is whether optional rubrics are simply harder because they test more demanding reasoning stages\. Table[10](https://arxiv.org/html/2606.12608#A5.T10)breaks down the required–optional gap by reasoning stage, pooling rubric judgments from all nine standard models on multi\-turn missions\. The gap persists within every stage, ranging from−\-17\.2 points \(Feature Assessment\) to−\-22\.5 points \(Domain Expertise\), indicating that the gap is not an artifact of stage difficulty\. The widest gaps appear in Domain Expertise \(−\-22\.5 points\) and Actionability \(−\-20\.7 points\), while Feature Assessment shows the narrowest \(−\-17\.2 points\)\.

Table 10:Required vs\. optional pass rates by reasoning stage \(MT, pooled across 9 models\)\.StageRequired %Optional %User Context79\.160\.8Trade Offs72\.455\.0Option Generation69\.951\.2Domain Expertise70\.347\.8Feature Assessment69\.852\.6Actionability67\.246\.5Overall70\.449\.7
### E\.2Pass Rates by Reasoning Stage and Quality

Table[11](https://arxiv.org/html/2606.12608#A5.T11)reports weighted pass rates by reasoning stage and quality for the frontier model of each family, separately for single\-turn \(ST\) and multi\-turn \(MT\) missions\. Reasoning stage and quality are rubric\-level tags, so each rubric within a turn may carry a different tag\. Among reasoning stages, Actionability scores tend to be lower in MT than ST across all families \(e\.g\., GPT\-5\.4: 82\.6 ST vs\. 61\.8 MT; Claude Opus 4\.7: 87\.0 ST vs\. 68\.7 MT\)\. Among quality dimensions, Insightfulness is the lowest\-scoring dimension in MT for all three frontier models\.

Table 11:ST vs\. MT weighted pass rate \(%\) by reasoning stage and quality for the frontier modelsGPT\-5\.4Opus 4\.7Gem\. 3\.1 ProDimensionSTMTSTMTSTMTBy reasoning stageUser Context75\.479\.073\.981\.879\.782\.5Option Generation77\.868\.382\.978\.180\.473\.1Domain Expertise60\.963\.871\.977\.873\.376\.6Feature Assessment59\.667\.463\.277\.170\.478\.7Trade Offs58\.274\.768\.779\.458\.273\.1Actionability82\.661\.887\.068\.773\.966\.2By reasoning qualityAccuracy61\.169\.566\.778\.975\.675\.5Clarity—77\.4—75\.9—78\.9Completeness71\.967\.675\.682\.475\.270\.9Concreteness60\.361\.682\.278\.875\.375\.5Insightfulness49\.760\.164\.872\.666\.073\.4Relevance72\.174\.271\.669\.374\.274\.6
### E\.3Multi\-Turn Performance Degradation

![Refer to caption](https://arxiv.org/html/2606.12608v1/x4.png)Figure 5:Per\-turn pass rate for the three frontier models across turns T1 through T7\.Table 12:First\-turn vs\. last\-turn weighted pass rate \(%\) on multi\-turn missions\. For each of the 293 missions the first\-turn and last\-turn weighted pass rates are computed; the drop is averaged across all missions\.FamilyModelFirst\-turnLast\-turnGPTGPT\-5\.477\.467\.1GPT\-5\.4 mini71\.363\.6GPT\-5\.4 nano72\.053\.8ClaudeClaude Opus 4\.778\.574\.0Claude Sonnet 4\.573\.268\.6Claude Haiku 4\.563\.057\.0GeminiGemini 3\.1 Pro81\.574\.2Gemini 3 Flash80\.270\.3Gemini 3\.1 Flash\-Lite76\.970\.2Every model scores lower on the final turn of a mission than on the first turn \([Figure˜5](https://arxiv.org/html/2606.12608#A5.F5),[Table˜12](https://arxiv.org/html/2606.12608#A5.T12)\)\. The Claude family degrades least \(4\.5–6\.0 points\), while GPT\-5\.4 nano is an outlier whose last\-turn quality collapses by over 18 points\. Within the GPT family, the frontier model degrades more than the mid\-tier model, though this pattern does not hold across all families\.

### E\.4Cross\-Judge Validation

To assess whether self\-preference bias affects the reported rankings, we re\-evaluate the three frontier models with DeepSeek V3\.2\(DeepSeek AI,[2025](https://arxiv.org/html/2606.12608#bib.bib46)\)as an alternative judge\. DeepSeek is unrelated to any of the three evaluated model families, eliminating potential same\-family bias in either direction\.

Table 13:Cross\-judge comparison on the three frontier models\. Weighted pass rate \(%\) under two judges\. Rankings are preserved across judges\.JudgeModelSTMTOverallClaudeSonnet 4\.5GPT\-5\.469\.271\.070\.2Claude Opus 4\.775\.178\.577\.0Gemini 3\.1 Pro76\.577\.777\.2DeepSeekV3\.2GPT\-5\.474\.882\.178\.9Claude Opus 4\.780\.389\.385\.4Gemini 3\.1 Pro81\.987\.985\.3Table[13](https://arxiv.org/html/2606.12608#A5.T13)shows that the two judges produce the same relative ordering: Claude Opus 4\.7 and Gemini 3\.1 Pro achieve comparable performance at the top, while GPT\-5\.4 trails both\. DeepSeek is a more lenient grader overall \(scores 7–9 points higher\), but this shift is uniform across all three evaluated models and does not alter the ranking\. These results indicate no self\-preference bias in the reported evaluation\.

### E\.5System Prompt Ablation

The main evaluation uses each model’s default behavior without a task\-specific system prompt\. To explore whether a shopping\-domain system prompt affects performance, we test the three frontier models with a system prompt that defines general conditions for behaving as a shopping assistant\. Results are mixed \([Table˜14](https://arxiv.org/html/2606.12608#A5.T14)\): GPT\-5\.4 and Claude Opus 4\.7 improve \(\+2\.2 and \+2\.4 points overall, respectively\), while Gemini 3\.1 Pro decreases by 1\.8 points\. The divergent response suggests that base model behaviors differ in ways that interact with prompt conditioning, and that model\-specific system prompt development may be needed to achieve optimal shopping assistant performance\.

Table 14:System prompt ablation for the three frontier models\.Δ\\Deltais the difference relative to the default \(no system prompt\) condition\.ModelDefaultSyspromptΔ\\DeltaGPT\-5\.470\.272\.4\+\+2\.2Claude Opus 4\.777\.079\.4\+\+2\.4Gemini 3\.1 Pro77\.275\.4−\-1\.8Youareanexpertshoppingassistant\.Yourroleistohelpcustomersfindtherightproductsbycombiningdeepproductknowledgewithcarefulattentiontotheirspecificsituation\.

Howtoreasonaboutqueries:

Beforeresponding,identifywhatthecustomeractuallyneeds\-\-notjustwhattheyaskedfor\.Inferconstraintsfromcontext:theirusecase,environment,experiencelevel,timeline,andbudgetsignals\.Workwithwhatyouhaveandgiveyourbestrecommendationbasedontheinformationavailable\.

Howtorespond:

\-Beconcrete\.Namespecificproducts,brands,models,andrelevantspecs\.Avoidgenericadvicethatcouldapplytoanyproduct\.

\-Committorecommendations\.Useyourexpertisetomakejudgmentcallsratherthandeferringdecisionsbacktothecustomer\.

\-Explainthewhybehindrecommendations\.Connectproductfeaturestothecustomer'sactualusecaseandconstraints\.

\-Whenmultipleoptionsexist,presentthemwithcleartrade\-offssothecustomercanmakeaninformedchoice\.

\-Matchresponsedepthtoqueryscope\.Anarrowquestiongetsafocusedanswer\.Abroaddiscoveryquestiongetsstructuredcategorieswithexamples\.

\-Whendomainknowledgeisrelevant\(howatechnologyworks,whatmakesamaterialdurable,whyaspecmatters\),weaveitinnaturallytohelpthecustomerunderstandtheiroptions\.

Inmulti\-turnconversations:

\-Trackthecustomer'sevolvingpreferencesandconstraintsacrossturns\.Don'tre\-explainwhat'salreadybeenestablished\.

\-Buildonpriorcontext\.Ifthecustomernarrowstheirinterest,godeeperonthatpathratherthanrestatingthefulloptionspace\.

\-Whenthecustomerchangesdirectionoraskstoreconsider,acknowledgetheircurrentpositionbeforeexploringalternatives\.

\-Ifthecustomerisclosetoadecision,helpthemfinalizewithconfidence\-\-addressremainingconcerns,suggestcomplementaryitems,orvalidatetheirchoice\.

Figure 6:System prompt used in the system prompt ablation\. This prompt is prepended as the system message for all three frontier models in the ablation condition\.

## Appendix FBenchmark Variants

We release two variants of the benchmark:

- •ShoppingReasoningBench\-Full\(525 missions, 10,863 rubrics\): The complete benchmark comprising 232 single\-turn queries and 293 multi\-turn missions \(1,764 turns\) across five product families\. This variant covers the full benchmark\.
- •ShoppingReasoningBench\-Hard\(108 missions, 1,663 rubrics\): A subset of ShoppingReasoningBench\-Full containing missions where the nine\-model average weighted pass rate falls below 60%\. This yields 69 single\-turn queries and 39 multi\-turn missions \(304 turns\), representing the missions that current models collectively struggle with\. The Hard variant is designed for tracking progress on the most demanding shopping reasoning problems\.

Table 15:Summary of the two ShoppingReasoningBench variants\.VariantMissions \(ST / MT\)TurnsRubricsFull232 / 2931,99610,863Hard69 / 393041,663##### Selection criteria for ShoppingReasoningBench\-Hard\.

For each mission, we compute the weighted pass rate \(Eq\.[1](https://arxiv.org/html/2606.12608#S4.E1)\) per model, average across all nine evaluated models, and select missions where this average falls below 60%\. The threshold is applied uniformly across both splits\. All five reasoning categories, all fifteen subcategories, all five product families, all six reasoning stages, and all six reasoning quality dimensions are represented in the Hard subset\.

## Appendix GData Format and Examples

This appendix provides example benchmark data in the released JSON format\.

### G\.1Single\-Turn Example

##### Example: Long\-Haul Flight Headphones \(st\-10\)\.

A single\-turn mission with four rubrics \(three required, one optional\), all testing feature assessment\.

\{

"mission\_id":"st\-10",

"mission\_name":"Long\-HaulFlightHeadphones",

"mission\_type":"FindSpecificSolution",

"mission\_objective":"Customerisshoppingforcomfortable,long\-battery\-life,

noise\-cancelingheadphonesfora14hourflight\.",

"product\_family":"Hardlines",

"time\_sensitive":"Yes",

"shopping\_funnel\_flow":\["Discover"\],

"turns":\[

\{

"reasoning\_category":"ProductRecommendation",

"reasoning\_subcategory":"ConstrainedRecommendation",

"shopping\_funnel\_stage":"Discover",

"messages":\[

\{"role":"user","content":"Ineedapairofheadphonesfora14hourflight"\}

\],

"rubrics":\[

\{

"text":"Discussheadphonesthatarecomfortableenoughforthecustomer

towearfor10\+hours\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"feature\_assessment",

"reasoning\_quality":"relevance"

\},

\{

"text":"Discussheadphoneswithbatterylifethatwilllast10\+hours

withnoisecancelationandhavequickcharging\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"feature\_assessment",

"reasoning\_quality":"relevance"

\},

\{

"text":"Discussheadphonesthatcancanceloutchatterandenginenoises\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"feature\_assessment",

"reasoning\_quality":"relevance"

\},

\{

"text":"Discussportableheadphonesorportabilityfeaturesonheadphones\.",

"scope":"instance",

"importance":"optional",

"reasoning\_stage":"feature\_assessment",

"reasoning\_quality":"relevance"

\}

\]

\}

\]

\}

### G\.2Multi\-Turn Example

##### Example: Chocolate Making \(mt\-91, first 2 of 4 turns\)\.

A multi\-turn mission progressing from Shopping Guidance to Product Recommendation across turns\. Each turn carries independent rubrics with taxonomy tags\.

\{

"mission\_id":"mt\-91",

"mission\_name":"howtomakemyownchocolate",

"mission\_type":"Explore&Discover",

"mission\_objective":"Customerisshoppingforessentialtoolsandsuppliesto

beginmakingfilledchocolatesathome,seekingbeginner\-friendly

chocolate\-makingequipment\.",

"product\_family":"Consumables",

"time\_sensitive":"No",

"shopping\_funnel\_flow":\["Discover","Discover","Explore","Explore"\],

"turns":\[

\{

"reasoning\_category":"ShoppingGuidance",

"reasoning\_subcategory":"Decision\-FactorGuidance",

"shopping\_funnel\_stage":"Discover",

"messages":\[

\{"role":"user","content":"Iwanttolearnhowtomakemyown

chocolates,Iwantthemtobeabletohavesomesortoffillinginit,

whataresomeitemsthatcanhelpmegetstarted?"\}

\],

"rubrics":\[

\{

"text":"Listatleast5essentialitemsformakingfilledchocolates\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"option\_generation",

"reasoning\_quality":"completeness"

\},

\{

"text":"Foreachitem,brieflyexplainwhyit'snecessaryformaking

filledchocolates\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"domain\_expertise",

"reasoning\_quality":"insightfulness"

\},

\{

"text":"Recommendspecifictypesofchocolate\(e\.g\.,couverture,candy

melts\)suitableformolding,explainingwhytheyarepreferredover

regularchocolatechips\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"domain\_expertise",

"reasoning\_quality":"insightfulness"

\},

\{

"text":"Provideabriefexplanationoftemperingifmentioning

couverturechocolate\.",

"scope":"instance",

"importance":"optional",

"reasoning\_stage":"domain\_expertise",

"reasoning\_quality":"insightfulness"

\},

\{

"text":"Includeamentionofacandythermometerasanessentialitem\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"option\_generation",

"reasoning\_quality":"completeness"

\}

\]

\},

\{

"reasoning\_category":"ShoppingGuidance",

"reasoning\_subcategory":"Decision\-FactorGuidance",

"shopping\_funnel\_stage":"Discover",

"messages":\[

\{"role":"user","content":"whatkindoffillingismorebeginner

friendly,intermsoftimeandeffort?"\}

\],

"rubrics":\[

\{

"text":"Identifyfivebeginner\-friendlychocolatefillingtypes\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"option\_generation",

"reasoning\_quality":"concreteness"

\},

\{

"text":"Explainwhyeachrecommendedfillingissuitableforbeginners,

focusingoneaseofpreparation,time,andeffort\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"trade\_offs",

"reasoning\_quality":"insightfulness"

\},

\{

"text":"Provideconcreteexamplesofhowtoflavorthesefillings\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"actionability",

"reasoning\_quality":"concreteness"

\},

\{

"text":"Donotrecommendfillingsthatrequirecomplextechniquesor

manyingredients\.",

"scope":"instance",

"importance":"required",

"reasoning\_stage":"option\_generation",

"reasoning\_quality":"relevance"

\}

\]

\}

\]

\}
Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

Similar Articles

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

ChatGPT Shopping vs Perplexity vs Wizard AI

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Submit Feedback

Similar Articles

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
ChatGPT Shopping vs Perplexity vs Wizard AI
Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing