Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

arXiv cs.AI 06/02/26, 04:00 AM Papers
Summary
This paper introduces a multi-turn interactive framework for reasoning evaluation where LLMs must query a hidden environment and integrate partial observations, instantiated as a benchmark of 474 executable games across five difficulty levels, showing discriminative power and exposing differences in reasoning.
arXiv:2606.00103v1 Announce Type: new Abstract: We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:45 PM
# Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
Source: [https://arxiv.org/html/2606.00103](https://arxiv.org/html/2606.00103)
Mingyuan Fan1,Weiguang Han2,Daixin Wang2,Cen Chen1, Zhiqiang Zhang2,Jun Zhou2

1East China Normal University,2Ant Group

###### Abstract

We introduce a multi\-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating\. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer\. Beyond standard success rate and interaction efficiency, we evaluate*contextual robustness*under controlled contextual perturbations, and*metacognitive adaptation*through*counterfactual revision*and*necessity judgment*\. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs\. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency\. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops\.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Mingyuan Fan1, Weiguang Han2, Daixin Wang2, Cen Chen1,Zhiqiang Zhang2,Jun Zhou21East China Normal University,2Ant Group

## 1Introduction

Large language models \(LLMs\) have recently achieved impressive results on reasoning benchmarksComaniciet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib20)\); Liuet al\.\([2024](https://arxiv.org/html/2606.00103#bib.bib19)\)\. Yet most existing evaluations remain fundamentally staticYuet al\.\([2020](https://arxiv.org/html/2606.00103#bib.bib2)\); Cobbeet al\.\([2021](https://arxiv.org/html/2606.00103#bib.bib1)\); Chenet al\.\([2021](https://arxiv.org/html/2606.00103#bib.bib8)\): the model is given a fully specified problem and asked to produce a final answer in a single shot\. This setup is increasingly insufficient for evaluating reasoning itself\. On the one hand, it does not test whether a model can actively seek missing information, update beliefs over time, and decide when evidence is sufficient\. On the other hand, failures on static benchmarks often conflate two distinct sourcesBanget al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib21)\); Yinet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib22)\): a knowledge deficit, where the model lacks the necessary facts, and a reasoning deficit, where it has enough information but fails to use it correctly\. Because static evaluations rarely separate these cases, their diagnostic value is limited\.

To address this limitation, we propose a hierarchical framework for interactive reasoning evaluation that treats the model as an active agent under partial observability\. Instead of receiving the full problem specification upfront, the model is given only the task rules and must iteratively issue targeted queries, gather partial evidence, update its beliefs, and decide when to submit an answer\. This setting more directly measures reasoning as sequential information acquisition and evidence integration, while also reducing contamination concerns because there is no single fixed prompt\-answer pair to memorize\.

To isolate reasoning from factual recall and semantic priors, we build the games from minimal structural primitives\. Specifically, we define hidden state spaces over four canonical data structures \(sets, sequences, trees, and graphs\) and instantiate them across three inference modes: deductive, inductive, and abductive\. This structural design yields a controlled testbed in which performance can be attributed more cleanly to algorithmic reasoning rather than real\-world knowledge\.

On top of this interactive backbone, we introduce two higher\-order evaluation layers to obtain a more fine\-grained view of model capability\. The first,*contextual robustness*, tests whether reasoning is preserved under semantic perturbation, irrelevant context, and shifts in episode boundaries\. The second,*metacognitive adaptation*, examines whether models can revise beliefs when earlier evidence is corrected and whether they can distinguish logically necessary information from merely sufficient information\.

We instantiate this framework as a benchmark of 474 executable games, each evaluated under five difficulty levels \(i\.e\., five different configuration search spaces\), for a total of 2370 instances\. The benchmark covers all combinations of the four data structures and three inference modes, and includes controlled probes for contextual robustness and metacognitive adaptation\. We further note that several recent benchmarksLiet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib14)\); Badolaet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib16)\); Duanet al\.\([2024](https://arxiv.org/html/2606.00103#bib.bib17)\); Huet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib18)\)have explored multi\-turn reasoning in interactive settings such as games or dialogue\. However, these benchmarks are often constrained in terms of scale, structural diversity, or the extent to which they can disentangle distinct components of reasoning ability\.

Empirically, we find that the benchmark is highly discriminative: frontier models differ substantially not only in final success rate but also in interaction efficiency\. Performance varies systematically across reasoning regimes, with deductive tasks generally easier than abductive ones and set\-based games emerging as the most challenging family\. We also find that contextual perturbations cause moderate degradation, whereas counterfactual revision and evidence\-pruning probes reveal much larger weaknesses\. These results suggest that current LLMs are better at solving problems under fixed evidence than at maintaining a revisable account of why an answer is correct\. Our main contributions are as follows:

- •We propose a hierarchical framework for evaluating interactive reasoning, casting reasoning as active information acquisition under partial observability rather than one\-shot answer generation\. Moreover, we introduce two higher\-order evaluation layers, contextual robustness and metacognitive adaptation, to probe abstraction invariance, noise tolerance, boundary control, belief revision, and necessity judgment\.
- •We construct a benchmark of 474 executable games spanning four canonical data structures and three inference modes, with five fixed configuration search spaces corresponding to five difficulty levels, enabling controlled, comparable, and contamination\-resistant evaluation\.
- •We provide a systematic empirical study of frontier LLMs, showing that while current models can often perform interactive search, they remain substantially weaker in robustness, belief revision, and evidence attribution\.

## 2Related Work

A large body of prior work evaluates LLM reasoning in static, single\-turn settings, where the model receives a fully specified problem and produces a final answer\. Logical reasoning benchmarks such as ReClor and LogiQA expose the gap between shallow pattern matching and deductive inferenceYuet al\.\([2020](https://arxiv.org/html/2606.00103#bib.bib2)\); Liuet al\.\([2020](https://arxiv.org/html/2606.00103#bib.bib3)\)\. Mathematical benchmarks such as GSM8K, MATH, MathVista, FrontierMath, and Humanity’s Last Exam test increasingly difficult forms of problem solvingCobbeet al\.\([2021](https://arxiv.org/html/2606.00103#bib.bib1)\); Hendryckset al\.\([2021b](https://arxiv.org/html/2606.00103#bib.bib4)\); Luet al\.\([2024](https://arxiv.org/html/2606.00103#bib.bib5)\); Phanet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib6)\); Glazeret al\.\([2024](https://arxiv.org/html/2606.00103#bib.bib7)\)\. Code reasoning has similarly progressed from function synthesis benchmarks such as HumanEval and APPSChenet al\.\([2021](https://arxiv.org/html/2606.00103#bib.bib8)\); Hendryckset al\.\([2021a](https://arxiv.org/html/2606.00103#bib.bib9)\)to more realistic software\-engineering tasks such as SWE\-benchJimenezet al\.\([2024](https://arxiv.org/html/2606.00103#bib.bib10)\)\.

Recent benchmarks move toward multi\-turn evaluation\. MT\-Bench and follow\-up datasets such as MT\-Bench\-101, MultiChallenge, and TurnBench\-MS assess abilities including long\-horizon instruction following and multi\-step reasoningZhenget al\.\([2023](https://arxiv.org/html/2606.00103#bib.bib11)\); Baiet al\.\([2024](https://arxiv.org/html/2606.00103#bib.bib12)\); Deshpandeet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib13)\); Liet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib14)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib15)\)\. However, these settings are mostly conversation\-centric: models respond to evolving prompts rather than actively querying a hidden environment for evidence\. They also often mix reasoning with world knowledge, dialogue ability, and instruction following\.

Closest to our work are game\-based and interactive benchmarks\. MTR\-BenchLiet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib14)\)evaluates multi\-turn interaction across 40 automated tasks\. Multi\-Turn PuzzlesBadolaet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib16)\)studies dialogue\-based puzzle solving and logical consistency\. GTBenchDuanet al\.\([2024](https://arxiv.org/html/2606.00103#bib.bib17)\)evaluates strategic reasoning in board and card games, while GameArenaHuet al\.\([2025](https://arxiv.org/html/2606.00103#bib.bib18)\)uses live human\-LLM interaction in Roblox mini\-games to probe deductive and inductive reasoning\.

Overall, these work has begun to study multi\-turn reasoning, but rarely isolates reasoning as sequential evidence acquisition and belief updating\. It also seldom tests robustness to controlled contextual perturbations or adaptation after earlier evidence is revised\. Many benchmarks further conflate reasoning with factual knowledge, dialogue skill, or task\-specific strategy, or remain limited in scale\. Our work is designed to address these limitations\.

## 3A Hierarchical Evaluation Framework for Interactive Reasoning

We present a hierarchical framework for interactive reasoning evaluation\. At its core is a clean interactive backbone that formulates reasoning as state search under partial observability\. On top of this backbone, we introduce two higher\-level evaluation layers, namely contextual robustness and metacognitive adaptation, to probe increasingly advanced capabilities\. Sections[3\.1](https://arxiv.org/html/2606.00103#S3.SS1)∼\\sim[3\.4](https://arxiv.org/html/2606.00103#S3.SS4)focus on the conceptual design of these layers; the executable benchmark instantiation and construction pipeline are described in Section[3\.5](https://arxiv.org/html/2606.00103#S3.SS5)\.

### 3\.1Design Principles

Our framework is guided by four principles\.

- •Resistance to contamination\.Static benchmarks expose fixed prompt\-answer pairs and are therefore vulnerable to memorization\. By contrast, our evaluation is interactive and dynamically instantiated: each episode is generated from a hidden configuration, and the evidence revealed depends on the model’s own actions\. This greatly reduces the chance of memorizing fixed trajectories or answer patterns\.
- •Hierarchical diagnostic design\.We begin with basic interactive search under uncertainty, then add progressively harder conditions\. This layered design helps distinguish failures of basic search, contextual robustness, and adaptive revision\.
- •Discriminative evaluation\.A useful benchmark should separate models of different capability levels\. We vary difficulty through five fixed configuration search spaces, which control search\-space size and hidden\-state complexity while preserving comparability across models, methods, and evaluation settings\.
- •Structural diversity and coverage\.Rather than relying on a single task template, we build games across multiple data structures and inference types to test whether reasoning generalizes across qualitatively different interactive settings\.

Algorithm 1Interactive ProtocolInput:Game Type𝒬\\mathcal\{Q\}, Game Configuration𝒞\\mathcal\{C\}, LLMπ\\pi, Max turn budgetTmaxT\_\{\\max\} Output:Final StatusEstatus∈\{Success, Failure, FormatError, Timeout\}E\_\{\\text\{status\}\}\\in\\\{\\text\{Success, Failure, FormatError, Timeout\}\\\}, Interaction CountNN

1:

ℰ←InstantiateGame\(𝒬,𝒞\)\\mathcal\{E\}\\leftarrow\\text\{InstantiateGame\}\(\\mathcal\{Q\},\\mathcal\{C\}\)
2:

p0←ℰ\.getRules\(\)p\_\{0\}\\leftarrow\\mathcal\{E\}\.\\text\{getRules\}\(\)
3:p0←ContextWrapper\(p0\)p\_\{0\}\\leftarrow\\text\{ContextWrapper\}\(p\_\{0\}\)⊳\\trianglerightOptional: isomorphic perturbation / rule noise

4:

H0←\[p0\]H\_\{0\}\\leftarrow\[p\_\{0\}\]⊳\\trianglerightInitialize interaction history

5:H0←HistoryWrapper\(H0\)H\_\{0\}\\leftarrow\\text\{HistoryWrapper\}\(H\_\{0\}\)⊳\\trianglerightOptional: inter\-game boundary control

6:for

t=1,2,…,Tmaxt=1,2,\\dots,T\_\{\\max\}do

7:

rtagent←π\(Ht−1\)r\_\{t\}^\{\\text\{agent\}\}\\leftarrow\\pi\(H\_\{t\-1\}\)⊳\\trianglerightLLM acts on full history

8:if

ℰ\.isQuery\(rtagent\)\\mathcal\{E\}\.\\text\{isQuery\}\(r\_\{t\}^\{\\text\{agent\}\}\)then

9:if

ℰ\.is\_invalid\_format\(rtagent\)\\mathcal\{E\}\.\\text\{is\\\_invalid\\\_format\}\(r\_\{t\}^\{\\text\{agent\}\}\)then

10:return\(FormatError,

tt\)

11:endif

12:

rtenv←ℰ\.respondToQuery\(rtagent\)r\_\{t\}^\{\\text\{env\}\}\\leftarrow\\mathcal\{E\}\.\\text\{respondToQuery\}\(r\_\{t\}^\{\\text\{agent\}\}\)
13:rtenv←NoiseWrapper\(rtenv\)r\_\{t\}^\{\\text\{env\}\}\\leftarrow\\text\{NoiseWrapper\}\(r\_\{t\}^\{\\text\{env\}\}\)⊳\\trianglerightOptional: intra\-game noise injection

14:\(rtenv,Ht−1\)←RevisionWrapper\(rtenv,Ht−1\)\(r\_\{t\}^\{\\text\{env\}\},H\_\{t\-1\}\)\\leftarrow\\text\{RevisionWrapper\}\(r\_\{t\}^\{\\text\{env\}\},H\_\{t\-1\}\)⊳\\trianglerightOptional: counterfactual revision

15:

Ht←Ht−1⊕\[rtagent,rtenv\]H\_\{t\}\\leftarrow H\_\{t\-1\}\\oplus\[r\_\{t\}^\{\\text\{agent\}\},r\_\{t\}^\{\\text\{env\}\}\]
16:elseif

ℰ\.isSubmit\(rtagent\)\\mathcal\{E\}\.\\text\{isSubmit\}\(r\_\{t\}^\{\\text\{agent\}\}\)then

17:if

ℰ\.is\_invalid\_format\(rtagent\)\\mathcal\{E\}\.\\text\{is\\\_invalid\\\_format\}\(r\_\{t\}^\{\\text\{agent\}\}\)then

18:return\(FormatError,

tt\)

19:endif

20:if

ℰ\.checkAnswer\(rtagent\)\\mathcal\{E\}\.\\text\{checkAnswer\}\(r\_\{t\}^\{\\text\{agent\}\}\)then

21:return\(Success,

tt\)

22:else

23:return\(Failure,

tt\)

24:endif

25:else

26:return\(FormatError,

tt\)

27:endif

28:endfor

29:return\(Timeout,

TmaxT\_\{\\max\}\)

### 3\.2Interactive Reasoning Backbone

At the core of our framework, each task is formulated as a multi\-turn interactive game under partial observability\. An LLM interacts with an environment containing a hidden state and must infer the target answer by actively gathering information\. Unlike standard reasoning benchmarks, where all information is given upfront, our setting initially provides only the game rules; the model must reduce uncertainty through querying and evidence integration\.

As shown in Algorithm[1](https://arxiv.org/html/2606.00103#alg1), an episode begins by instantiating an environmentℰ\\mathcal\{E\}from game type𝒬\\mathcal\{Q\}and configuration𝒞\\mathcal\{C\}\. This separation lets us vary difficulty while preserving task structure\. The environment returns an initial promptp0=ℰ\.getRules\(\)p\_\{0\}=\\mathcal\{E\}\.\\mathrm\{getRules\}\(\), which specifies the objective and the valid action format, and the interaction history is initialized asH0=\[p0\]H\_\{0\}=\[p\_\{0\}\]\. At each steptt, the model generates a responsertagent=π\(Ht−1\)r\_\{t\}^\{\\text\{agent\}\}=\\pi\(H\_\{t\-1\}\)\. The environment parses this response as either a query or a final submission\. Queries are answered byℰ\.respondToQuery\(⋅\)\\mathcal\{E\}\.\\mathrm\{respondToQuery\}\(\\cdot\)and appended to the history; submissions are evaluated byℰ\.checkAnswer\(⋅\)\\mathcal\{E\}\.\\mathrm\{checkAnswer\}\(\\cdot\), yieldingSuccessorFailure\. Unparseable responses are labeledFormatError\. If no valid submission is made within the budgetTmaxT\_\{\\max\}, the episode ends withTimeout\.

Note that our parsing approach follows a structured\-action design to simplify parsing and ensure consistent evaluation, as it allows the model’s responses to be processed reliably and efficiently\. For example, a submission takes the form<submit\>answer</submit\>, which allows the environment to extract the proposed answer through exact pattern matching rather than brittle free\-form interpretation\. We report two main metrics:*Success Rate*, i\.e\., the fraction of episodes that end inSuccess, and*Avg\. Turns*computed over successful episodes\.

Remark\.The backbone defines a clean reference setting: rules are explicit, history contains only task\-relevant content, and environment responses are faithful and noise\-free\. The two higher\-order layers are systematic departures from this baseline along two axes\. Contextual robustness preserves the latent search problem while perturbing surface form or surrounding context\. Metacognitive adaptation changes the informational dynamics of the protocol itself, for example by revising earlier evidence or asking which observations are actually necessary for the conclusion\.

### 3\.3Contextual Robustness

Real\-world reasoning rarely appears in such purified form\. To this end, we introduce a first higher\-order evaluation layer centered on contextual robustness in the following probes:

- •Isomorphic perturbation\. Starting from the clean symbolic games, we construct semantically enriched variants that preserve the underlying data structure while changing only the surface interpretation of entities and relations\. For example, graph nodes may become patients in a contact network\. As these variants are structurally isomorphic to the original game, the search problem and the correct solution remain unchanged\.
- •Intra\-game noise injection\. We augment the rules and environment responses with non\-informative text or irrelevant background descriptions while preserving the evidential content of the interaction\. This probe tests whether redundant context degrades evidence tracking and weakens reasoning consistency\.
- •Inter\-game boundary control\. After the completion of one game, a second game is introduced within the same ongoing context without resetting the interaction history\. The model must recognize the episode boundary, discard no\-longer\-relevant prior context, and reason about the new game from a clean latent state\.

We measure contextual robustness by comparing performance under each perturbation against the corresponding clean condition\.

### 3\.4Metacognitive Adaptation

Beyond effective search and contextual invariance, we ask whether a model can revise its reasoning when accepted information changes, and whether it can distinguish sufficient from necessary evidence\. These abilities probe whether the model maintains an internal account of why its conclusion is justified\.

- •Counterfactual revision\. During an ongoing game, after the model has already accumulated a nontrivial body of evidence, the environment issues a correction to one earlier observation\. The model must then update its beliefs and continue the search accordingly\. A strong model should revise only the affected portion of its reasoning while preserving deductions that remain valid\.
- •Necessity judgment\.The model is first given a complete set of information sufficient to derive the correct answer\. It is then asked to identify pieces of information that it believes are redundant, that is, observations that can be removed without changing the final conclusion\. The goal is not merely to solve the original problem, but to infer a minimal supporting set for the answer according to the model’s own dependency analysis\.

For counterfactual revision, we measure the success\-rate drop relative to the clean backbone\. For the necessity\-judgment probe, we measure how many queried premises the model judges removable, and whether the pruned evidence set still preserves the correct answer\.

### 3\.5Benchmark Instantiation and Construction Pipeline

We instantiate the framework as a benchmark suite of 474 multi\-turn interactive games spanning four canonical data structures and three inference modes, with five fixed configuration search spaces corresponding to five standardized difficulty levels\. The difficulty is parameterized primarily by monotonically increasing the size of the latent search space \(e\.g\., cardinality of sets, length of sequences, depth of trees, and number of nodes/edges in graphs\)\. Table[1](https://arxiv.org/html/2606.00103#S3.T1)summarizes the distribution of games across categories\.

Constructing a benchmark of this scale presents a methodological tension\. Purely manual construction provides high fidelity but is prohibitively labor\-intensive, while unconstrained LLM generation often produces repetitive, inconsistent, or poorly specified games\. We therefore adopt a hybrid human\-LLM construction strategy that seeks an effective balance between scalability and control\.

- •Rule generation under fine\-grained constraints\.We begin by generating candidate game specifications at the level of rules\. We define a structured design space by crossing four data\-structure families \(set, sequence, tree, graph\) with three inference modes \(deductive, inductive, abductive\)\. Within each cell, candidate games are generated subject to constraints on hidden state, query types, observability, answer type, bounded\-budget solvability, and difficulty variation\. We also manually enumerate possible reasoning targets and probing styles for each structure\. Conditioned on these constraints, multiple frontier models, including GPT\-5, Claude 4\.5, and Gemini 3, generated an initial pool of over 2000 candidate games\.
- •Program generation with templates\.The next stage converts selected rule specifications into executable environments\. Here, directly asking an LLM to generate each environment from scratch proved error\-prone\. To reduce implementation failures, we instead use a templated environment interface and abstract away common functionality that is shared across games, including action parsing, format validation, turn\-budget management, response serialization, and answer checking\. LLMs then implement only the game\-specific logic, including hidden\-state initialization, query semantics, and observation generation\. We further apply iterative execution\-based repair using runtime diagnostics and failed test cases\.
- •Difficulty and discrimination filtering\.Using preliminary evaluation with GPT\-5, Claude 4\.5, and Gemini 3, we remove games that are too easy \(success rate above 90%\) or too hard \(below 10%\), as well as games with suspicious cross\-model discrepancies\. After this stage, roughly 550 games remained\.
- •Fine\-grained validation and manual review\.We then conduct a final pass focused on well\-posedness, solvability, rule clarity, and overall benchmark quality\. Automated assessment is complemented by manual inspection, through which we removed approximately 70 additional games judged to be ambiguous, ill\-calibrated, or otherwise unsuitable for reliable evaluation\. The resulting benchmark contains 474 games\.

Construction of robustness and adaptation variants\.For contextual robustness, we generate three perturbation types\. For isomorphic perturbation, LLMs rewrite symbolic entities, relations, and rule descriptions into semantically grounded variants across five domains: education, healthcare, transportation, manufacturing, and law\. For intra\-game noise injection, we insert irrelevant natural\-language sentences into rules and meaningless character\-level or semantically vacuous fragments into environment responses while preserving evidential content\. For inter\-game boundary control, each game is initialized with the full interaction history of another randomly sampled game\.

For metacognitive adaptation, we similarly derive two controlled variants\. In counterfactual revision, the environment intentionally answers the model’s second query incorrectly and explicitly corrects it on the following turn111This ensures that any subsequent failures are caused by an inability to update beliefs properly rather than by an exhausted budget or an irreversibly broken trajectory\.\. In the necessity\-judgment probe, we enumerate all valid queries for a game instance, collect the complete evidence pool, ask the model to solve the game from this full information state, and then ask it to identify unnecessary observations\. Those observations are removed, and the model is re\-evaluated on the pruned evidence set\. Because the benchmark consists of executable environments under a unified protocol, all these variants can be generated simply at scale\.

Table 1:Number of instantiated games by data structure and reasoning type\.

## 4Evaluation

### 4\.1Setup

Models\.We report results for Qwen3\-max, Deepseek\-3\.2, Claude\-4\.5, GPT\-5\.4, Gemini\-2\.5\-flash, Gemini\-3\.1\-flash\-lite, and Gemini\-3\.1\-pro on the clean backbone benchmark\. For several higher\-order stress tests, we focus on a representative subset of models \(Qwen3\-max, Deepseek\-3\.2, GPT\-5\.4, and Gemini\-3\.1\-pro\) to control evaluation cost while retaining diversity in capability levels and interaction styles\.

Metrics\.We report Success Rate \(%\) and Avg\. Turns, where Avg\. Turns is computed over successful episodes only\. We also report Efficiency = Success Rate / Avg\. Turns\. For robustness and adaptation settings, we report the corresponding Success Rate and its change relative to the clean backbone\.We will open\-source all the code upon the acceptance of this paper\.

### 4\.2Results and Analysis

Table 2:Overall performance on the clean interactive reasoning backbone\. Success Rate measures the fraction of successfully solved instances, Avg\. Turns is computed over successful episodes, and Efficiency is defined as Success Rate / Avg\. Turns\.Table 3:Success Rate under clean and contextualized \(isomorphic perturbation\) conditions across the five fixed difficulty levels\.Dropdenotes the performance change from the clean setting to the contextualized setting, so larger positive values indicate weaker contextual robustness\.Table 4:Fine\-grained success rate \(%\) of four models across data structures and inference modes on the clean backbone benchmark\.Table 5:Success Rate before and after inter\-game boundary control, where a new game is introduced without resetting the prior interaction context\.Table 6:Success Rate before and after intra\-game noise injection, where irrelevant or non\-informative content is inserted into rules or environment responses\.Table 7:Success Rate before and after counterfactual revision, where one earlier observation is revised during the interaction\.Table 8:Results on the necessity\-judgment probe\.Clean Backboneis the success rate on the clean interactive backbone\.Complete Evidenceis the success rate when the model is given the full evidence pool\.After Pruningis the success rate after removing observations that the model judged unnecessary\.Pruning Ratiois the average fraction of observations removed by the model\.#### Overall performance on the clean backbone\.

Table[2](https://arxiv.org/html/2606.00103#S4.T2)reports average results on the clean interactive reasoning backbone across the five fixed difficulty levels\. The benchmark is strongly discriminative: model accuracy ranges from 43\.67% to 74\.21%\. Gemini\-3\.1\-pro achieves the best overall performance, with 74\.21% accuracy, the lowest average turn count among the strongest models \(6\.77\), and the highest efficiency score \(10\.96\)\. Claude\-4\.5 reaches a comparable accuracy of 73\.63%, but requires nearly twice as many turns \(12\.93\), yielding a much lower efficiency score of 5\.69\. This contrast suggests that both models can solve a large fraction of games, but Gemini\-3\.1\-pro does so more directly and with stronger interaction efficiency\. GPT\-5\.4 occupies a different regime where the accuracy is only moderate \(49\.17%\), but it is relatively concise in interaction \(7\.92 turns\), indicating a more aggressive strategy that is efficient when correct but less reliable overall\. Qwen3\-max, Deepseek\-3\.2, and Gemini\-3\.1\-flash\-lite cluster in a lower\-performance band around 44–45% accuracy\. Moreover, we see that Deepseek\-3\.2 uses substantially more turns than the others, suggesting that additional interaction alone is insufficient to close the reasoning gap\.

#### Difficulty scaling and contextual robustness\.

Table[3](https://arxiv.org/html/2606.00103#S4.T3)compares clean and contextualized performance across five difficulty levels\. As expected, success rates decrease as difficulty increases, confirming that the fixed configuration spaces provide a meaningful difficulty gradient\. For example, Gemini\-3\.1\-pro drops from 81\.31% at difficulty 1 to 67\.93% at difficulty 5 in the clean setting, while Claude\-4\.5 shows a similar decline from 81\.86% to 67\.09%\.

Isomorphic contextualization introduces an additional robustness challenge\. Although the underlying latent problem is unchanged, most models suffer performance drops under semantically enriched surface forms\. Claude\-4\.5 is the most stable model in this setting, with small drops across all difficulty levels\. Gemini\-2\.5\-flash and Gemini\-3\.1\-pro, despite strong clean performance, are more sensitive to contextualization, often losing more than five percentage points\. Some models occasionally improve under contextualization, suggesting that richer semantic descriptions may provide useful priors in isolated cases\. However, the dominant trend is degradation, indicating that preserving abstract reasoning under surface\-level semantic shifts remains nontrivial\.

#### Structure\- and inference\-level breakdown\.

Table[4](https://arxiv.org/html/2606.00103#S4.T4)shows clear differences across inference modes and data structures\. Deductive reasoning is consistently the strongest setting for all models\. We attribute this to the dominant reasoning\-oriented training data for current LLMs, such as mathematics and code, which mainly emphasize deriving conclusions from explicit premises\. In contrast, inductive and abductive reasoning require pattern formation or explanation search from partial evidence, and are less directly supported by such training signals\.

We also observe that set\-based tasks are often harder than tree and graph tasks\. Unlike trees and graphs, which provide explicit structural relations, sets offer fewer organizational cues and require models to maintain more unordered information internally\. This may make evidence tracking and hypothesis management more difficult\.

#### Boundary control and noise robustness\.

Tables[5](https://arxiv.org/html/2606.00103#S4.T5)and Table[6](https://arxiv.org/html/2606.00103#S4.T6)examine two additional contextual robustness probes\. Under inter\-game boundary control, where a new game begins without clearing the previous interaction history, performance remains stable and even improves slightly for all tested models\. This suggests that stale context alone is not a major obstacle, i\.e\., models can usually recognize the new task boundary and reinitialize their reasoning state\. In contrast, intra\-game noise injection consistently reduces performance\. GPT\-5\.4 drops from 49\.17% to 42\.84%, and Gemini\-3\.1\-pro drops from 74\.21% to 70\.30%\. Qwen3\-max and Deepseek\-3\.2 also decline\. The contrast between boundary control and noise injection indicates that the main robustness challenge is not simply long\-context interference, but the ability to filter irrelevant content while tracking active evidence within the current game\.

#### Counterfactual revision\.

Table[7](https://arxiv.org/html/2606.00103#S4.T7)shows that counterfactual revision is substantially more difficult than the contextual perturbations\. Revising one earlier observation causes large drops for all models: Qwen3\-max decreases from 43\.67% to 35\.99%, Deepseek\-3\.2 from 45\.49% to 40\.21%, etc\. These results suggest that current models often fail to update their internal reasoning state after evidence changes\. Belief revision under partial observability therefore appears to be a qualitatively harder capability than robustness to surface\-level contextual variation\.

#### Necessity judgment\.

Table[8](https://arxiv.org/html/2606.00103#S4.T8)evaluates whether models can distinguish necessary evidence from merely available evidence\. Providing the complete evidence pool does not uniformly improve performance: GPT\-5\.4 and Qwen3\-max benefit modestly, whereas Deepseek\-3\.2 and Gemini\-3\.1\-pro perform slightly worse than in the clean interactive setting\. This indicates that full information does not remove the need for coherent evidence integration\. After pruning observations that the model itself judged unnecessary, success rates drop sharply for all models\. Gemini\-3\.1\-pro falls from 70\.76% to 48\.87%, GPT\-5\.4 from 54\.80% to 37\.89%, Deepseek\-3\.2 from 42\.48% to 18\.21%, and Qwen3\-max from 46\.25% to 32\.97%\. At the same time, pruning ratios are high, ranging from 0\.62 to 0\.83\. Thus, models frequently remove a large fraction of observations while mistakenly discarding information required for the conclusion\. This gap suggests that current LLMs are better at using evidence when it is present than at identifying which evidence is indispensable\.

Table 9:Failure attribution\.Timeoutdenotes episodes that exceed the maximum turn budget, andFormatErrordenotes episodes in which the model fails to follow the required query or submission format\.
#### Failure modes\.

Table[9](https://arxiv.org/html/2606.00103#S4.T9)shows that timeouts are rare across models, suggesting that the turn budget is generally sufficient\. The dominant non\-solution failure mode is insteadFormatError\. This highlights a distinctive challenge of interactive evaluation: models must not only reason over hidden states, but also maintain reliable protocol compliance across multiple turns\. The effect is especially visible for Gemini\-3\.1\-pro, which has very few timeouts but many format\-related failures\.

## 5Conclusion

We introduced a hierarchical framework and benchmark for interactive reasoning that evaluates LLMs as active agents under partial observability\. The benchmark combines a clean interactive backbone with controlled probes for contextual robustness and metacognitive adaptation\. Our results show that current frontier models already exhibit nontrivial interactive reasoning ability, but their performance remains uneven across settings\. More broadly, we hope this benchmark helps future work on reasoning diagnostics, training\-time intervention, tool\-augmented agents, and interactive reinforcement learning, because the environment is executable and the interaction protocol is explicit\.

## Limitations

Several limitations of this work should be acknowledged\. First, the five fixed configuration search spaces, though carefully calibrated for difficulty scaling, cannot fully capture the continuous and unbounded nature of reasoning challenges encountered in practice\. Second, the interactive protocol employs a structured XML\-based action format that, while ensuring reliable parsing and evaluation consistency, may oversimplify the interaction dynamics compared to free\-form natural language communication, potentially overestimating models’ ability to maintain protocol compliance in less constrained settings\.

## Ethics Statement

This work introduces a benchmark for evaluating interactive reasoning in LLMs and does not involve any human subjects, user studies, or collection of personally identifiable information\. All benchmark games are constructed from abstract symbolic primitives \(sets, sequences, trees, and graphs\) across deductive, inductive, and abductive inference modes, and do not contain any offensive, biased, or culturally sensitive content\. The isomorphic perturbation variants are drawn from five neutral domains \(education, healthcare, transportation, manufacturing, and law\) and are designed solely to test structural reasoning invariance rather than to make any substantive claims about these domains\.

## References

- K\. Badola, J\. Simon, A\. Hosseini, S\. M\. M\. Carthy, T\. Munkhdalai, A\. Goyal, T\. Kočiskỳ, S\. Upadhyay, B\. Fatemi, and M\. Kazemi \(2025\)Multi\-turn puzzles: evaluating interactive reasoning and strategic dialogue in llms\.arXiv preprint arXiv:2508\.10142\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p5.1),[§2](https://arxiv.org/html/2606.00103#S2.p3.1)\.
- Mt\-bench\-101: a fine\-grained benchmark for evaluating large language models in multi\-turn dialogues\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7421–7454\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p2.1)\.
- Y\. Bang, Z\. Ji, A\. Schelten, A\. Hartshorn, T\. Fowler, C\. Zhang, N\. Cancedda, and P\. Fung \(2025\)Hallulens: llm hallucination benchmark\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 24128–24156\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p1.1),[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.CoRRabs/2110\.14168\.External Links:[Link](https://arxiv.org/abs/2110.14168),2110\.14168Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p1.1),[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p1.1)\.
- K\. Deshpande, V\. Sirdeshmukh, J\. B\. Mols, L\. Jin, E\. Hernandez\-Cardona, D\. Lee, J\. Kritz, W\. E\. Primack, S\. Yue, and C\. Xing \(2025\)Multichallenge: a realistic multi\-turn conversation evaluation benchmark challenging to frontier llms\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 18632–18702\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p2.1)\.
- J\. Duan, R\. Zhang, J\. Diffenderfer, B\. Kailkhura, L\. Sun, E\. Stengel\-Eskin, M\. Bansal, T\. Chen, and K\. Xu \(2024\)Gtbench: uncovering the strategic reasoning capabilities of llms via game\-theoretic evaluations\.Advances in Neural Information Processing Systems37,pp\. 28219–28253\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p5.1),[§2](https://arxiv.org/html/2606.00103#S2.p3.1)\.
- E\. Glazer, E\. Erdil, T\. Besiroglu, D\. Chicharro, E\. Chen, A\. Gunning, C\. F\. Olsson, J\. Denain, A\. Ho, E\. d\. O\. Santos,et al\.\(2024\)Frontiermath: a benchmark for evaluating advanced mathematical reasoning in ai\.arXiv preprint arXiv:2411\.04872\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- D\. Hendrycks, S\. Basart, S\. Kadavath, M\. Mazeika, A\. Arora, E\. Guo, C\. Burns, S\. Puranik, H\. He, D\. Song,et al\.\(2021a\)Measuring coding challenge competence with apps\.arXiv preprint arXiv:2105\.09938\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021b\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- L\. Hu, Q\. Li, A\. Xie, N\. Jiang, I\. Stoica, H\. Jin, and H\. Zhang \(2025\)Gamearena: evaluating llm reasoning through live computer games\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 33278–33309\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p5.1),[§2](https://arxiv.org/html/2606.00103#S2.p3.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)Swe\-bench: can language models resolve real\-world github issues?\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 54107–54157\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- X\. Li, K\. Bao, Y\. Ma, M\. Li, W\. Wang, R\. Men, Y\. Zhang, F\. Feng, D\. Liu, and J\. Lin \(2025\)Mtr\-bench: a comprehensive benchmark for multi\-turn reasoning evaluation\.arXiv preprint arXiv:2505\.17123\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p5.1),[§2](https://arxiv.org/html/2606.00103#S2.p2.1),[§2](https://arxiv.org/html/2606.00103#S2.p3.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p1.1)\.
- J\. Liu, L\. Cui, H\. Liu, D\. Huang, Y\. Wang, and Y\. Zhang \(2020\)Logiqa: a challenge dataset for machine reading comprehension with logical reasoning\.arXiv preprint arXiv:2007\.08124\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- P\. Lu, H\. Bansal, T\. Xia, J\. Liu, C\. Li, H\. Hajishirzi, H\. Cheng, K\. Chang, M\. Galley, and J\. Gao \(2024\)Mathvista: evaluating mathematical reasoning of foundation models in visual contexts\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 23439–23554\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- L\. Phan, A\. Gatti, Z\. Han, N\. Li, J\. Hu, H\. Zhang, C\. B\. C\. Zhang, M\. Shaaban, J\. Ling, S\. Shi,et al\.\(2025\)Humanity’s last exam\.arXiv preprint arXiv:2501\.14249\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- C\. Yin, Z\. Sha, S\. Cui, C\. Meng, and Z\. Li \(2025\)The reasoning trap: how enhancing llm reasoning amplifies tool hallucination\.arXiv preprint arXiv:2510\.22977\.Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p1.1)\.
- W\. Yu, Z\. Jiang, Y\. Dong, and J\. Feng \(2020\)ReClor: A reading comprehension dataset requiring logical reasoning\.In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020,External Links:[Link](https://openreview.net/forum?id=HJgJtT4tvB)Cited by:[§1](https://arxiv.org/html/2606.00103#S1.p1.1),[§2](https://arxiv.org/html/2606.00103#S2.p1.1)\.
- Y\. Zhang, M\. Wang, X\. Li, K\. Ren, C\. Zhu, and U\. Naseem \(2025\)TurnBench\-ms: a benchmark for evaluating multi\-turn, multi\-step reasoning in large language models\.Findings of the Association for Computational Linguistics: EMNLP,pp\. 19892–19924\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p2.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§2](https://arxiv.org/html/2606.00103#S2.p2.1)\.

## Appendix AStructural Coverage and Inference Type Definitions

The framework covers four canonical data structures, chosen for their foundational role in symbolic reasoning:

- •Set:Unordered collection of distinct elements\. Reasoning tasks involve membership, subset relations, intersections, and cardinality constraints\.
- •Sequence:Ordered list of elements\. Reasoning tasks involve ordering, adjacency, indexing, pattern completion, and transformation rules\.
- •Tree:Hierarchical structure with parent\-child relationships\. Reasoning tasks involve ancestry, subtree properties, path constraints, and recursive definitions\.
- •Graph:Network of nodes connected by edges\. Reasoning tasks involve connectivity, paths, cycles, reachability, and constraint propagation\.

Each data structure supports tasks across three inference types, defined as follows:

- •Deductive inference:The model is provided with complete rules and initial conditions\. The hidden state is fully determined by these premises, and the task is to derive the necessary conclusion\. Success requires correct rule application rather than hypothesis generation\.
- •Inductive inference:The model is provided with partial observations\. The hidden state is a latent rule or pattern that must be inferred from examples\. Multiple candidate rules may be consistent with the observed data, and the model must identify the one intended by the game designer\.
- •Abductive inference:The model is provided with an observed outcome\. The hidden state is the most plausible cause or configuration that could have produced this outcome\. Multiple hidden states may be consistent with the same observation, and the model must reason backward to identify the most likely explanation\.

## Appendix BImplementation Overview

All benchmark instances are implemented under a shared executableGameinterface\. The common runtime handles prompt construction, XML\-based action parsing, transcript management, and episode status updates\. Each concrete game only needs to define three components: \(1\) how the hidden state is initialized, \(2\) how the environment answers valid queries, and \(3\) how final answers are verified\.

Shared runtime\.At initialization, a game instance is created from a configuration specifying one of five fixed difficulty levels and the context condition\. The environment first constructs the hidden state, then selects either the clean rule template or one of its contextualized rewrites, and finally appends the rendered rule prompt to the interaction history\. During interaction, each model response is parsed into either a query action or a final answer\. Query actions are executed by the environment, while final answers are passed to a task\-specific verifier\.

classGame\(ABC\):

definitialize\_game\(self\):

defstep\(self,response\):

defevaluate\(self,parsed\_info\):

defproduce\(self,parsed\_info\):

defget\_all\_possible\_queries\(self\):

defcf\_make\_wrong\(self,correct\):

The same abstraction also supports the two higher\-order probes\.*Counterfactual revision*is implemented by wrapping the ordinary response function: the environment returns one controlled wrong observation and then explicitly corrects it on the next turn\.*Necessity\-judgment probing*is implemented by exposing a function that enumerates all legal queries and their answers for a fixed hidden instance, which allows us to construct a complete evidence pool and then remove the observations that the model judges unnecessary\. In other words, both probes are realized through additional functions on top of the same executable game class, rather than through separate benchmark implementations\.

Rule Example\.A minimal example is a hidden\-number game\. The environment secretly chooses one number from\{1,2,3,4\}\\\{1,2,3,4\\\}\. The model may ask whether the number is odd, whether it is greater than 2, or whether it equals a specific candidate\. It must then submit the hidden number\. A shortened rule prompt is as follows:

> I have selected a hidden number from \{1,2,3,4\}\. Your goal is to find the number\. You may ask one question per turn using the following format:<query\_odd\></query\_odd\> <query\_greater\>2</query\_greater\> <query\_equal\>3</query\_equal\>When you know the answer, submit:<answer\>3</answer\>

This design keeps the benchmark modular: all tasks share the same runtime protocol, while individual games only define hidden\-state construction, query semantics, and answer verification\. It also makes contextualization,*counterfactual revision*, and*necessity\-judgment*pruning easy to implement as lightweight wrappers over the same executable environment\.
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Similar Articles

Evaluating Large Language Models in a Complex Hidden Role Game

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

@burny_tech: A Survey on Latent Reasoning "Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especia…

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Submit Feedback

Similar Articles

Evaluating Large Language Models in a Complex Hidden Role Game
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
@burny_tech: A Survey on Latent Reasoning "Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especia…
Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games