BALAR : A Bayesian Agentic Loop for Active Reasoning

arXiv cs.AI Papers

Summary

This paper introduces BALAR, a training-free Bayesian agentic loop algorithm that enables large language models to actively reason and ask clarifying questions in multi-turn interactions. It demonstrates significant performance improvements over baselines on detective, puzzle, and clinical diagnosis benchmarks.

arXiv:2605.05386v1 Announce Type: new Abstract: Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled mechanism to reason about what information is missing and which question should be asked next. We propose BALAR (Bayesian Agentic Loop for Active Reasoning), a task-agnostic outer-loop algorithm that requires no fine-tuning and enables structured multi-turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and dynamically expands its state representation when the current one proves insufficient. We evaluate BALAR on three diverse benchmarks: AR-Bench-DC (detective cases), AR-Bench-SP (thinking puzzles), and iCraft-MD (clinical diagnosis). BALAR significantly outperforms all baselines across all three benchmarks, with $14.6\%$ higher accuracy on AR-Bench-DC, $38.5\%$ on AR-Bench-SP, and $30.5\%$ on iCraft-MD.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:10 AM

# BALAR : A Bayesian Agentic Loop for Active Reasoning
Source: [https://arxiv.org/html/2605.05386](https://arxiv.org/html/2605.05386)
Aymen Echarghaoui Department of Statistics, Stanford University & Dongxia Wu Department of Computer Science, Stanford University & Emily B\. Fox Department of Statistics, Stanford University Department of Computer Science, Stanford University

###### Abstract

Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user\. However, most current systems treat dialogue reactively and lack a principled mechanism to reason about what information is missing and which question should be asked next\. We proposeBALAR\(BayesianAgenticLoop forActiveReasoning\), a task\-agnostic outer\-loop algorithm that requires no fine\-tuning and enables structured multi\-turn interaction between an LLM agent and a user\. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and dynamically expands its state representation when the current one proves insufficient\. We evaluate BALAR on three diverse benchmarks: AR\-Bench\-DC \(detective cases\), AR\-Bench\-SP \(thinking puzzles\), and iCraft\-MD \(clinical diagnosis\)\. BALAR significantly outperforms all baselines across all three benchmarks, with14\.6%\\mathbf\{14\.6\\%\}higher accuracy on AR\-Bench\-DC,38\.5%\\mathbf\{38\.5\\%\}on AR\-Bench\-SP, and30\.5%\\mathbf\{30\.5\\%\}on iCraft\-MD\.

## 1Introduction

Modern deployments of large language models \(LLMs\) span domains where user intent is rarely unambiguous: a patient asking a medical AI for advice might omit critical symptoms, a customer service agent must resolve which product a user is asking about from a vague description\. In all these cases, the bottleneck is not the raw reasoning capacity of the LLM, but the absence of a principled mechanism that \(1\) detects when a prompt is ambiguous, \(2\) formulates targeted clarifying questions, \(3\) integrates responses coherently, and \(4\) decides when enough information has been gathered to commit to an answer\.

Existing approaches address this challenge in different ways\. Methods such as Tree\-of\-Thoughts \(ToT\)\(Yaoet al\.,[2023](https://arxiv.org/html/2605.05386#bib.bib6)\)and Uncertainty\-of\-Thoughts \(UoT\)\(Huet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib3)\)both support multi\-step reasoning beyond a single forward pass, but they target different forms of interaction\. ToT focuses on internal search over intermediate reasoning states using BFS/DFS to maintain multiple partial solutions, whereas UoT relies on forward simulation of interaction trees to select follow\-up questions\. Interactive methods such asCollabLLM\(Wuet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib2)\)learn to select questions by fine\-tuning on simulated trajectories, introducing costly training dependencies\. TheMediQ Expertsystem\(Liet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib5)\)operates through a fixed pipeline of LLM calls without a formal model of user intent\. Benchmarks such as AR\-Bench\(Zhouet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib1)\)and MediQ\(Liet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib5)\)expose the gap: state\-of\-the\-art LLMs at the time of their publication fall far below an oracle that has access to all private information\.

We address this gap withBALAR\(BayesianAgenticLoop forActiveReasoning\), a*task\-agnostic*,*training\-free*Bayesian outer loop\. The key insight is to model user intent as a latent discrete variable over a structured product space of disambiguating dimensions where each dimension captures one facet of potential ambiguity \(e\.g\.*severity level*,*product type*\)\. BALAR initializes this belief at “sleep time” \(before interaction begins\) using a number of parallel LLM calls, then iteratively selects the unasked \(question, user\) pair with highest mutual information with the belief, updates the posterior via Bayes’ rule upon receiving each response, and dynamically expands the state space by proposing new dimensions when the current representation is insufficient\. We illustrate the full pipeline in[Figure˜1](https://arxiv.org/html/2605.05386#S1.F1)using a running medical example that we revisit throughout[Section˜4](https://arxiv.org/html/2605.05386#S4)\.

##### Contributions\.

We make the following contributions:

- •We propose BALAR, a task\-agnostic, training\-free Bayesian outer loop that enables LLM agents to engage in structured multi\-turn interaction, actively selecting informative questions and updating a belief over latent task states during dialogue \([Section˜4](https://arxiv.org/html/2605.05386#S4)\)\.
- •We introduce a*dynamic state expansion*mechanism, combining ASK and EXPAND actions guided by an entropy gap criterion \([Section˜4\.5](https://arxiv.org/html/2605.05386#S4.SS5)\)\.
- •We evaluate BALAR on three diverse benchmarks across open\-weight LLMs\. Results show BALAR outperforms baselines by14\.6%14\.6\\%on AR\-Bench\-DC,38\.5%38\.5\\%on AR\-Bench\-SP, and30\.5%30\.5\\%on iCraft\-MD \([Section˜6](https://arxiv.org/html/2605.05386#S6)\)\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x1.png)Figure 1:BALAR overview\.Given an ambiguous query, BALAR performs structured multi\-turn reasoning in two stages\.*Sleep\-time initialization*\(left\): the agent constructs a latent state representation by proposing disambiguating dimensions\{θj\}\\\{\\theta\_\{j\}\\\}, eliciting priorsπ\(j\)\\pi^\{\(j\)\}, generating candidate questions𝒬\\mathcal\{Q\}, and estimating likelihood tablesLq,u,θj​\(y∣θj\)L\_\{q,u,\\theta\_\{j\}\}\(y\\mid\\theta\_\{j\}\)\.*Interaction loop*\(center\): the agent maintains a beliefπt​\(θ\)\\pi\_\{t\}\(\\theta\)and iteratively selects the unasked\(q,u\)\(q,u\)pair maximizing mutual informationIt​\(θ;Y∣ℋt\)I\_\{t\}\(\\theta;Y\\mid\\mathcal\{H\}\_\{t\}\)\. User responses are incorporated via a Bayesian update\. When the entropy gap cannot be closed within the remaining budget, BALAR triggersExpand, introducing new dimensions and targeted questions\.*Final answer*\(right\): once the belief concentrates, the MAP stateθ^=arg⁡maxθ⁡πT​\(θ\)\\hat\{\\theta\}=\\arg\\max\_\{\\theta\}\\pi\_\{T\}\(\\theta\)and the historyℋT\\mathcal\{H\}\_\{T\}condition a final LLM call to produce the answer\.

## 2Related Work

##### Active reasoning benchmarks\.

In*active reasoning*, a model must iteratively acquire missing information through interaction rather than solving a problem from a fully specified prompt\. TheAR\-Benchbenchmark\(Zhouet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib1)\)evaluates this capability by placing language models in multi\-turn environments where they must ask informative questions to uncover hidden facts before producing a final answer\. Empirical results reveal a substantial gap between passive and active reasoning performance: even state\-of\-the\-art models achieve relatively low accuracy, and models frequently ask vague or redundant questions while struggling to accumulate useful information across turns\. These findings highlight the need for architectures that explicitly reason about uncertainty and guide question selection in a principled manner\. BALAR addresses this by maintaining a structured posterior over latent user intent and selecting clarifying questions by maximizing expected information gain, providing a principled mechanism for strategic information gathering across interaction rounds\.

##### Interactive medical dialogue\.

TheMEDIQframework\(Liet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib5)\)introduces a benchmark for evaluating the ability of LLMs to proactively seek missing information in clinical decision\-making tasks\. MEDIQ converts existing datasets into interactive tasks by revealing only limited initial patient information and requiring the model to iteratively gather missing evidence before making a decision\. Experiments show that prompting state\-of\-the\-art LLMs to ask questions often*degrades*performance relative to answering directly with partial information, highlighting the difficulty of proactive information\-seeking for current models\. In contrast to task\-specific pipelines such as MEDIQ\-Expert, BALAR provides a general\-purpose Bayesian outer\-loop that operates across domains without relying on specialized heuristics\.

##### Search\-based and uncertainty\-aware reasoning\.

Tree\-of\-Thoughts \(ToT\)\(Yaoet al\.,[2023](https://arxiv.org/html/2605.05386#bib.bib6)\)extends chain\-of\-thought reasoning by organizing intermediate reasoning steps as a search tree, combining LM generation with BFS/DFS to maintain multiple partial solutions and backtrack when necessary\.Uncertainty of Thoughts \(UoT\)\(Huet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib3)\)similarly relies on forward simulation of interaction trees, propagating information\-theoretic rewards across hypothetical dialogue trajectories to select the question with highest expected uncertainty reduction\. While both approaches improve reasoning through structured exploration, they incur substantial computational cost and do not maintain an explicit probabilistic model of the problem state\. The idea of selecting questions by maximizing expected entropy reduction over a maintained belief state has classical precedent in expert systems\.Horvitzet al\.\([1984](https://arxiv.org/html/2605.05386#bib.bib13)\)introduced this strategy in the PATHFINDER system for lymph\-node pathology diagnosis, where a probability distribution over diseases is updated after each observation\. BALAR maintains and updates a posterior belief over a structured latent intent space, selecting questions through closed\-form mutual information\. Although it also requires multiple LLM calls, these are decomposed into independent computations that can be executed in parallel, avoiding the sequential overhead of trajectory\-based methods\.

##### Proactive dialogue through learning and prompting\.

CollabLLM\(Wuet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib2)\)uses reinforcement learning to fine\-tune LLMs to optimize long\-term collaboration outcomes, estimating a multiturn\-aware reward by simulating future conversations and evaluating trajectories for task success and efficiency\.Proactive Chain\-of\-Thought \(ProCoT\)\(Denget al\.,[2023](https://arxiv.org/html/2605.05386#bib.bib7)\)instead induces proactive behaviors through prompting, augmenting standard prompting with intermediate reasoning steps that describe the dialogue state and plan the next action\.STaR\-GATE\(Andukuriet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib10)\)takes a self\-improvement approach: starting from a pretrained model, it iteratively fine\-tunes on questions that increase the likelihood of high\-quality task responses, bootstrapping better clarification\-seeking behavior without requiring human\-labeled trajectories\. Both approaches demonstrate that models can be made to ask clarifying questions, but rely on either additional training or heuristic prompt engineering\. BALAR requires neither: its Bayesian formulation directly identifies informative queries from a maintained posterior, enabling principled proactive dialogue without fine\-tuning or prompt heuristics\.

## 3Problem Setup

We consider an LLM agent interacting with a set𝒰\\mathcal\{U\}of users to resolve an ambiguous prompt\.

###### Definition 3\.1\(Interaction instance\)\.

An interaction instance is a tuple\(𝐩,𝐜,𝒰\)\(\\mathbf\{p\},\\mathbf\{c\},\\mathcal\{U\}\)where𝐩\\mathbf\{p\}is an*ambiguous prompt*,𝐜\\mathbf\{c\}is an optional*meta\-context*\(publicly known background\), and𝒰=\{u1,…,uN\}\\mathcal\{U\}=\\\{u\_\{1\},\\ldots,u\_\{N\}\\\}is a set of users each holding*private information*𝐟i\\mathbf\{f\}\_\{i\}not visible to the agent\.

The agent’s goal is to produce the correct answery∗y^\{\*\}to𝐩\\mathbf\{p\}, where correctness depends on the users’ private information\. The agent may ask questionsq∈𝒬q\\in\\mathcal\{Q\}to any useru∈𝒰u\\in\\mathcal\{U\}, receiving a natural language responserr\. We treat this as a*Bayesian active information gathering*problem: the agent maintains a belief over a dynamic latent stateθ\\thetarepresenting user intent, selects actions to reduce uncertainty, and commits to an answer when sufficiently confident\.

## 4BALAR: Bayesian Agentic Loop for Active Reasoning

BALAR operates in two phases: a*sleep\-time initialization*\(Linet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib11)\)that constructs a structured belief and a question bank from LLM calls, and an*interaction loop*that adaptively selects questions, updates the belief, and expands the state space as needed\.

Running example\.Throughout[Section˜4](https://arxiv.org/html/2605.05386#S4), we trace a concrete instance: a patient submits the ambiguous prompt*“I’ve been having headaches lately\. What could I do ?”*The same complaint could indicate migraine, tension headache, cluster headache, or hypertensive crisis, and the patient cannot be expected to use clinical terminology such as*vascular involvement*or*trigger pattern*\. BALAR’s goal is to infer these latent medical concepts from patient\-friendly conversational questions, without the patient ever encountering medical jargon\.

### 4\.1Sleep\-Time Initialization

Given\(𝐩,𝐜,𝒰\)\(\\mathbf\{p\},\\mathbf\{c\},\\mathcal\{U\}\), the agent performs four initialization steps before any user interaction\.

#### Step 1: Disambiguating dimensions\.

A single LLM call proposespp*disambiguating dimensions*\(θ1,…,θp\)\(\\theta\_\{1\},\\ldots,\\theta\_\{p\}\), each with a finite value setΘj=\{vj,1,…,vj,nj\}\\Theta\_\{j\}=\\\{v\_\{j,1\},\\ldots,v\_\{j,n\_\{j\}\}\\\}\. A dimension captures one axis of potential variation in user intent such that, once its value is fixed, the ambiguity of𝐩\\mathbf\{p\}is reduced\. See[Figure˜2](https://arxiv.org/html/2605.05386#S4.F2)for the running example\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x2.png)Figure 2:A single LLM call proposes two dimensions:θ1=\\theta\_\{1\}=*Vascular Involvement*\(Θ1=\{vascular,non\-vascular\}\\Theta\_\{1\}=\\\{\\text\{vascular\},\\text\{non\-vascular\}\\\}\) andθ2=\\theta\_\{2\}=*Trigger Pattern*\(Θ2=\{episodic,chronic,acute\}\\Theta\_\{2\}=\\\{\\text\{episodic\},\\text\{chronic\},\\text\{acute\}\\\}\), yielding a joint state space of\|Θ\|=6\|\\Theta\|=6states\.
#### Step 2: Prior elicitation\.

Letℒ=\{ℓ1,…,ℓr\}\\mathcal\{L\}=\\\{\\ell\_\{1\},\\ldots,\\ell\_\{r\}\\\}be a finite label set \(e\.g\., “likely”, “neutral”, “unlikely”\) andϕ:ℒ→Δr−1\\phi:\\mathcal\{L\}\\to\\Delta^\{r\-1\}a fixed*label\-to\-probability map*, whereϕ​\(ℓi\)\\phi\(\\ell\_\{i\}\)denotes the probability mass assigned to labelℓi\\ell\_\{i\}, with∑i=1rϕ​\(ℓi\)=1\\sum\_\{i=1\}^\{r\}\\phi\(\\ell\_\{i\}\)=1\. For each dimension valuevj,kv\_\{j,k\}\(e\.g\., “vascular”, “non\-vascular”\), a separate LLM call returns a labelℓj,k∈ℒ\\ell\_\{j,k\}\\in\\mathcal\{L\}\. The per\-dimension prior is thenπ\(j\)​\(vj,k\)=ϕ​\(ℓj,k\)/∑k′=1njϕ​\(ℓj,k′\)\.\\pi^\{\(j\)\}\(v\_\{j,k\}\)\\;=\\;\\phi\(\\ell\_\{j,k\}\)/\\sum\_\{k^\{\\prime\}=1\}^\{n\_\{j\}\}\\phi\(\\ell\_\{j,k^\{\\prime\}\}\)\.The label setℒ\\mathcal\{L\}and the mapϕ\\phiare treated as hyperparameters and specified in[Appendix˜E](https://arxiv.org/html/2605.05386#A5)\. See[Figure˜3](https://arxiv.org/html/2605.05386#S4.F3)for the running example\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x3.png)Figure 3:Parallel LLM calls assign a labelℓ∈ℒ\\ell\\in\\mathcal\{L\}to each dimension value\. Here, the LLM judges*vascular*asneutraland*non\-vascular*aslikely, yieldingπ\(1\)=\[0\.38,0\.62\]\\pi^\{\(1\)\}=\[0\.38,0\.62\], while*episodic*islikely,*chronic*neutral,*acute*unlikely, givingπ\(2\)=\[0\.53,0\.33,0\.13\]\\pi^\{\(2\)\}=\[0\.53,0\.33,0\.13\]\.
#### Step 3: Question generation\.

A single LLM call generates\|𝒬\|\|\\mathcal\{Q\}\|candidate clarifying questions, each with a discrete answer set𝒴q\\mathcal\{Y\}\_\{q\}\. Questions are designed to be informative about the dimensions from Step 1\. See[Figure˜4](https://arxiv.org/html/2605.05386#S4.F4)for the running example\. Note that the user provides free\-form responses to these questions and is not shown the discrete answer set𝒴q\\mathcal\{Y\}\_\{q\}\. Instead,𝒴q\\mathcal\{Y\}\_\{q\}is used internally for tractable belief updates \(see[Section˜4\.4](https://arxiv.org/html/2605.05386#S4.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x4.png)Figure 4:An LLM call generates\|𝒬\|=3\|\\mathcal\{Q\}\|=3questions conditioned on the proposed dimensions\. Each question informs many dimensions simultaneously\. No question uses medical jargon\.
#### Step 4: Likelihood table construction\.

For each triple\(q,u,j\)∈𝒬×𝒰×\[p\]\(q,u,j\)\\in\\mathcal\{Q\}\\times\\mathcal\{U\}\\times\[p\], a separate LLM call returns a labelℓq,u,j,k,y∈ℒ\\ell\_\{q,u,j,k,y\}\\in\\mathcal\{L\}for each cell\(vj,k,y\)∈Θj×𝒴q\(v\_\{j,k\},y\)\\in\\Theta\_\{j\}\\times\\mathcal\{Y\}\_\{q\}, yielding the*dimension\-level likelihood matrix*:

Lq,u,θj​\(y∣vj,k\)=ϕ​\(ℓq,u,j,k,y\)∑y′∈𝒴qϕ​\(ℓq,u,j,k,y′\),y∈𝒴q\.L\_\{q,u,\\theta\_\{j\}\}\(y\\mid v\_\{j,k\}\)\\;=\\;\\frac\{\\phi\(\\ell\_\{q,u,j,k,y\}\)\}\{\\sum\_\{y^\{\\prime\}\\in\\mathcal\{Y\}\_\{q\}\}\\phi\(\\ell\_\{q,u,j,k,y^\{\\prime\}\}\)\},\\quad y\\in\\mathcal\{Y\}\_\{q\}\.Each call handles exactly one triple, keeping contexts short and avoiding the hallucination that arises from joint evaluation of many cells\. All\|𝒬\|⋅N⋅p\|\\mathcal\{Q\}\|\\cdot N\\cdot pcalls are dispatched concurrently\. See[Figure˜5](https://arxiv.org/html/2605.05386#S4.F5)for the running example\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x5.png)Figure 5:Six parallel LLM calls fill the\|𝒬\|×p=3×2\|\\mathcal\{Q\}\|\\times p=3\\times 2likelihood matrices\. Shown here are the two tables forq1q\_\{1\}: vascular states are much more likely to answer*“yes, definitely”*\(L=0\.72L=0\.72\) than non\-vascular states \(L=0\.08L=0\.08\)\.

### 4\.2Structured Belief State

Under the apriori*independence assumption*across dimensions, the prior over the joint stateθ=\(θ1,…,θp\)∈Θ:=∏j=1pΘj\\theta=\(\\theta\_\{1\},\\ldots,\\theta\_\{p\}\)\\in\\Theta:=\\prod\_\{j=1\}^\{p\}\\Theta\_\{j\}factorizes:

π0​\(θ\)=∏j=1pπ\(j\)​\(θj\),θ=\(θ1,…,θp\)∈Θ\.\\pi\_\{0\}\(\\theta\)=\\prod\_\{j=1\}^\{p\}\\pi^\{\(j\)\}\(\\theta\_\{j\}\),\\qquad\\theta=\(\\theta\_\{1\},\\ldots,\\theta\_\{p\}\)\\in\\Theta\.The belief stateπt\\pi\_\{t\}is stored as a log\-probability tensor of shapen1×⋯×npn\_\{1\}\\times\\cdots\\times n\_\{p\}, initialized from marginals and updated in log\-space for numerical stability\. The*state\-level likelihood*for pair\(q,u\)\(q,u\)is obtained by combining dimension\-level likelihoods\. Forθ∈Θ\\theta\\in\\Thetaandy∈𝒴qy\\in\\mathcal\{Y\}\_\{q\}:

Kq,u​\(θ,y\)=∏j=1pLq,u,θj​\(y∣θj\)∑y′∈𝒴q∏j=1pLq,u,θj​\(y′∣θj\)\.K\_\{q,u\}\(\\theta,y\)=\\frac\{\\prod\_\{j=1\}^\{p\}L\_\{q,u,\\theta\_\{j\}\}\(y\\mid\\theta\_\{j\}\)\}\{\\sum\_\{y^\{\\prime\}\\in\\mathcal\{Y\}\_\{q\}\}\\prod\_\{j=1\}^\{p\}L\_\{q,u,\\theta\_\{j\}\}\(y^\{\\prime\}\\mid\\theta\_\{j\}\)\}\.This is a modeling choice that enables efficient tensor computation\.

### 4\.3Information\-Theoretic Question Selection

At roundtt, letπt​\(θ\)=ℙ​\(θ=θ∣ℋt\)\\pi\_\{t\}\(\\theta\)=\\mathbb\{P\}\(\\theta=\\theta\\mid\\mathcal\{H\}\_\{t\}\)denote the posterior over states given the interaction historyℋt\\mathcal\{H\}\_\{t\}\. Define the*belief entropy*:

ℍ​\(πt\)=−∑θ∈Θπt​\(θ\)​log⁡πt​\(θ\)\.\\mathbb\{H\}\(\\pi\_\{t\}\)\\;=\\;\-\\sum\_\{\\theta\\in\\Theta\}\\pi\_\{t\}\(\\theta\)\\log\\pi\_\{t\}\(\\theta\)\.For a candidate pair\(q,u\)∈𝒬×\[N\]\(q,u\)\\in\\mathcal\{Q\}\\times\[N\], the*predictive distribution*over answers111We use the semicolon to separate conditioning variables from fixed parameters:fD​\(A∣B;C\)f\_\{D\}\(A\\mid B;\\,C\)denotes quantityffindexed byDD\(e\.g\., the round indextt\), whereA∣BA\\mid Bis read as “AAgivenBB” in the probabilistic sense, andCCdenotes a fixed parameter or context not treated as a random variable \(e\.g\., the query–user pair\(q,u\)\(q,u\)\)\.is:

ptq,u​\(y\)=ℙ​\(Y=y∣ℋt;q,u\)=∑θ∈Θπt​\(θ\)​Kq,u​\(θ,y\)\.p\_\{t\}^\{q,u\}\(y\)\\;=\\;\\mathbb\{P\}\(Y=y\\mid\\mathcal\{H\}\_\{t\};\\,q,u\)\\;=\\;\\sum\_\{\\theta\\in\\Theta\}\\pi\_\{t\}\(\\theta\)\\,K\_\{q,u\}\(\\theta,y\)\.Upon observing answeryy, the posterior becomes:

πtq,u,y\(θ\)=ℙ\(θ=θ∣ℋt,Y=y;q,u\)=Kq,u​\(θ,y\)​πt​\(θ\)∑θ′∈ΘKq,u​\(θ′,y\)​πt​\(θ′\)\.\\pi\_\{t\}^\{q,u,y\}\(\\theta\)\\;=\\;\\mathbb\{P\}\(\\theta=\\theta\\mid\\mathcal\{H\}\_\{t\},\\,Y=y;\\,q,u\)\\;=\\;\\frac\{K\_\{q,u\}\(\\theta,y\)\\,\\pi\_\{t\}\(\\theta\)\}\{\\sum\_\{\\theta^\{\\prime\}\\in\\Theta\}K\_\{q,u\}\(\\theta^\{\\prime\},y\)\\,\\pi\_\{t\}\(\\theta^\{\\prime\}\)\}\.The*conditional entropy*after querying\(q,u\)\(q,u\)is:

ℍ​\(θ∣Y,ℋt;q,u\)=∑y∈𝒴qptq,u​\(y\)​ℍ​\(πtq,u,y\),\\mathbb\{H\}\(\\theta\\mid Y,\\mathcal\{H\}\_\{t\};\\,q,u\)\\;=\\;\\sum\_\{y\\in\\mathcal\{Y\}\_\{q\}\}p\_\{t\}^\{q,u\}\(y\)\\;\\mathbb\{H\}\(\\pi\_\{t\}^\{q,u,y\}\),and the*mutual information*betweenθ\\thetaand the answerYYis:

It​\(θ;Y∣ℋt;q,u\)=ℍ​\(πt\)−ℍ​\(θ∣Y,ℋt;q,u\)\.I\_\{t\}\(\\theta;\\,Y\\mid\\mathcal\{H\}\_\{t\};\\,q,u\)\\;=\\;\\mathbb\{H\}\(\\pi\_\{t\}\)\-\\mathbb\{H\}\(\\theta\\mid Y,\\mathcal\{H\}\_\{t\};\\,q,u\)\.The agent selects the unasked pair with highest MI:

\(qt∗,ut∗\)=arg⁡max\(q,u\)​unasked⁡It​\(θ;Y∣ℋt;q,u\)\.\(q\_\{t\}^\{\*\},u\_\{t\}^\{\*\}\)\\;=\\;\\arg\\max\_\{\(q,u\)\\,\\text\{unasked\}\}I\_\{t\}\(\\theta;\\,Y\\mid\\mathcal\{H\}\_\{t\};\\,q,u\)\.
##### Theoretical Guarantee\.

Under the assumption that answers\{Yq,u\}\\\{Y\_\{q,u\}\\\}are conditionally independent givenθ\\theta, BALAR’s greedy MI\-maximizing selection within theAsksub\-loop with a fixed pair space𝒬×𝒰\\mathcal\{Q\}\\times\\mathcal\{U\}satisfies :

Gk≥\(1−1e\)​G∗,G\_\{k\}\\;\\geq\\;\\left\(1\-\\tfrac\{1\}\{e\}\\right\)G^\{\*\},whereGkG\_\{k\}is the expected information gain of the greedy policy afterkkrounds andG∗G^\{\*\}is that of the optimalkk\-budgeted adaptive policy\. The full proof is given in Appendix[C](https://arxiv.org/html/2605.05386#A3)\.

### 4\.4Soft Bayesian Belief Update

Userut∗u\_\{t\}^\{\*\}provides a free\-form natural language answerrtr\_\{t\}, which need not coincide with any element of𝒴qt∗\\mathcal\{Y\}\_\{q\_\{t\}^\{\*\}\}\. An LLM mapsrtr\_\{t\}to a labelℓy∈ℒ\\ell\_\{y\}\\in\\mathcal\{L\}for each choicey∈𝒴qt∗y\\in\\mathcal\{Y\}\_\{q\_\{t\}^\{\*\}\}, inducing a probability vectorω^​\(rt,q\)∈Δ\|𝒴q\|−1\\hat\{\\omega\}\(r\_\{t\},q\)\\in\\Delta^\{\|\\mathcal\{Y\}\_\{q\}\|\-1\}over the choices\. Fory∈𝒴qt∗:y\\in\\mathcal\{Y\}\_\{q\_\{t\}^\{\*\}\}:

ω^y​\(rt,q\)=ϕ​\(ℓy\)∑y′∈𝒴qt∗ϕ​\(ℓy′\)\.\\hat\{\\omega\}\_\{y\}\(r\_\{t\},q\)\\;=\\;\\frac\{\\phi\(\\ell\_\{y\}\)\}\{\\sum\_\{y^\{\\prime\}\\in\\mathcal\{Y\}\_\{q\_\{t\}^\{\*\}\}\}\\phi\(\\ell\_\{y^\{\\prime\}\}\)\}\.The*effective per\-state likelihood*under this soft observation isL^​\(θ∣rt,q,u\)=∑y∈𝒴qt∗ω^y​\(rt,q\)​Kq,u​\(θ,y\)\.\\hat\{L\}\(\\theta\\mid r\_\{t\},q,u\)\\;=\\;\\sum\_\{y\\in\\mathcal\{Y\}\_\{q\_\{t\}^\{\*\}\}\}\\hat\{\\omega\}\_\{y\}\(r\_\{t\},q\)\\,K\_\{q,u\}\(\\theta,y\)\.The posterior then updates as:

πt\+1​\(θ\)=L^​\(θ∣rt,qt∗,ut∗\)​πt​\(θ\)∑θ′∈ΘL^​\(θ′∣rt,qt∗,ut∗\)​πt​\(θ′\),\\pi\_\{t\+1\}\(\\theta\)\\;=\\;\\frac\{\\hat\{L\}\(\\theta\\mid r\_\{t\},q\_\{t\}^\{\*\},u\_\{t\}^\{\*\}\)\\,\\pi\_\{t\}\(\\theta\)\}\{\\sum\_\{\\theta^\{\\prime\}\\in\\Theta\}\\hat\{L\}\(\\theta^\{\\prime\}\\mid r\_\{t\},q\_\{t\}^\{\*\},u\_\{t\}^\{\*\}\)\\,\\pi\_\{t\}\(\\theta^\{\\prime\}\)\},which is computed in log\-space for numerical stability\. See[Figure˜6](https://arxiv.org/html/2605.05386#S4.F6)for the running example\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x6.png)Figure 6:q1q\_\{1\}has the highest MI and is asked first\. The patient replies in free\-form natural language\. The soft\-map LLM assignsω^=\[0\.91,0\.07,0\.02\]\\hat\{\\omega\}=\[0\.91,0\.07,0\.02\]over the discrete choices\. A Bayesian update shifts mass toward vascular states:\(vasc,epis\)\(\\text\{vasc\},\\text\{epis\}\)rises from19%19\\%to59%59\\%\.
### 4\.5Dynamic State Expansion

When the existing question bank is exhausted or the best available MI is insufficient to close the entropy gap within the remaining budget, BALAR triggers anEXPANDstep\.

##### Entropy gap criterion\.

Letℍt=ℍ​\(πt\)\\mathbb\{H\}\_\{t\}=\\mathbb\{H\}\(\\pi\_\{t\}\)be the current entropy,α∈\[0,1\)\\alpha\\in\[0,1\)andℍα\\mathbb\{H\}\_\{\\alpha\}be the target entropy:

ℍα=−\(1−α\)​log⁡\(1−α\)−α​log⁡\(α\|Θ\|−1\),\\mathbb\{H\}\_\{\\alpha\}=\-\(1\-\\alpha\)\\log\(1\-\\alpha\)\-\\alpha\\log\\\!\\left\(\\tfrac\{\\alpha\}\{\|\\Theta\|\-1\}\\right\),the entropy of a distribution that assigns probability1−α1\-\\alphato a single state and distributes the remaining massα\\alphauniformly over the other\|Θ\|−1\|\\Theta\|\-1states\. Equivalently, among all distributions satisfyingmaxθ∈Θ⁡π​\(θ\)≥1−α,\\max\_\{\\theta\\in\\Theta\}\\pi\(\\theta\)\\geq 1\-\\alpha,this choice achieves the largest possible entropy\. The gap isΔt=max⁡\(0,ℍt−ℍα\)\\Delta\_\{t\}=\\max\(0,\\mathbb\{H\}\_\{t\}\-\\mathbb\{H\}\_\{\\alpha\}\)\. The EXPAND condition is:

Δt\>λ⋅It∗⋅\(T−t\),\\Delta\_\{t\}\>\\lambda\\cdot I\_\{t\}^\{\*\}\\cdot\(T\-t\),whereIt∗=max\(q,u\)​unasked⁡It​\(θ;Y∣ℋt;q,u\)I\_\{t\}^\{\*\}=\\max\_\{\(q,u\)\\,\\text\{unasked\}\}I\_\{t\}\(\\theta;Y\\mid\\mathcal\{H\}\_\{t\};q,u\)andTTis the maximum number of rounds\. This condition has a natural interpretation in terms of the minimum number of rounds required to close the entropy gap\. Recall that for any question–user pair\(q,u\)\(q,u\), the mutual information is

It​\(θ;Y∣ℋt;q,u\)=ℍ​\(πt\)−ℍ​\(θ∣Y,ℋt;q,u\),I\_\{t\}\(\\theta;\\,Y\\mid\\mathcal\{H\}\_\{t\};\\,q,u\)=\\mathbb\{H\}\(\\pi\_\{t\}\)\-\\mathbb\{H\}\(\\theta\\mid Y,\\mathcal\{H\}\_\{t\};\\,q,u\),so each interaction round can reduce the entropy ofπt\\pi\_\{t\}by at mostIt∗I\_\{t\}^\{\*\}, the information gain of the most informative unasked pair\. More precisely, suppose the agent asks the sequence of maximally informative pairs at each remaining round\. The cumulative reduction in entropy overkkrounds satisfies

ℍ​\(πt\)−ℍ​\(πt\+k\)=∑i=0k−1It\+i​\(θ;Y∣ℋt\+i;qi∗,ui∗\)≤k​It∗,\\mathbb\{H\}\(\\pi\_\{t\}\)\-\\mathbb\{H\}\(\\pi\_\{t\+k\}\)=\\sum\_\{i=0\}^\{k\-1\}I\_\{t\+i\}\(\\theta;\\,Y\\mid\\mathcal\{H\}\_\{t\+i\};\\,q\_\{i\}^\{\*\},u\_\{i\}^\{\*\}\)\\;\\leq\\;k\\,I\_\{t\}^\{\*\},where the inequality follows from the fact thatIt∗I\_\{t\}^\{\*\}upper bounds the per\-round information gain at timett\. To reach the target entropyℍα\\mathbb\{H\}\_\{\\alpha\}from the current entropyℍ​\(πt\)\\mathbb\{H\}\(\\pi\_\{t\}\), the agent must therefore close a gap ofΔt=max⁡\(0,ℍ​\(πt\)−ℍα\)\\Delta\_\{t\}=\\max\(0,\\,\\mathbb\{H\}\(\\pi\_\{t\}\)\-\\mathbb\{H\}\_\{\\alpha\}\), which requires at least

k∗=⌈ΔtIt∗⌉k^\{\*\}=\\left\\lceil\\frac\{\\Delta\_\{t\}\}\{I\_\{t\}^\{\*\}\}\\right\\rceilrounds even under optimal pair selection\. The EXPAND conditionΔt\>λ​It∗​\(T−t\)\\Delta\_\{t\}\>\\lambda\\,I\_\{t\}^\{\*\}\\,\(T\-t\)implies thatk∗\>λ​\(T−t\)k^\{\*\}\>\\lambda\\,\(T\-t\), meaning the minimum number of rounds needed to reach the target entropy exceeds a fractionλ\\lambdaof the remaining budgetT−tT\-t\. When this condition holds, the current state spaceΘ\\Thetais insufficiently resolved relative to the remaining interaction horizon, and expandingΘ\\Thetawith new candidate dimensions is warranted\. The thresholdλ∈\(0,1\)\\lambda\\in\(0,1\)controls how conservatively the agent triggers expansion: smaller values require the gap to be more severe before expansion is triggered, while larger values cause earlier and more frequent expansions\.

##### EXPAND procedure\.

When triggered, BALAR:

1. 1\.Generates a new dimensionθp\+1\\theta\_\{p\+1\}via an LLM call conditioned on the current conversation history and all existing dimensions\.
2. 2\.Elicits a priorπ\(p\+1\)\\pi^\{\(p\+1\)\}for the new dimension, conditioned on the conversation\.
3. 3\.Expands the belief state under independence:πtnew​\(θ,θp\+1\)=πt​\(θ\)⋅π\(p\+1\)​\(θp\+1\)\\pi\_\{t\}^\{\\text\{new\}\}\(\\theta,\\theta\_\{p\+1\}\)=\\pi\_\{t\}\(\\theta\)\\cdot\\pi^\{\(p\+1\)\}\(\\theta\_\{p\+1\}\)\.
4. 4\.Recomputes likelihood tables for all existing questions over the new dimension\.
5. 5\.Generates up to\|𝒬′\|\|\\mathcal\{Q\}^\{\\prime\}\|new clarifying questions targeting the new dimensionθp\+1\\theta\_\{p\+1\}and thekkexisting dimensions with the highest marginal entropy, wherekkis a configurable hyperparameter\.
6. 6\.Computes likelihood tables for the new questions over all dimensions\.

The Expand mechanism allows BALAR to refine its state representation, analogous to gradient descent on the manifold of possible intents\. TheExpandaction momentarily increases entropy by adding a new dimension before subsequentAskrounds reduce it again\. This interplay is analogous to exploration \(expanding the hypothesis space\) vs\. exploitation \(narrowing within the current space\)\.

### 4\.6Convergence and Final Answer

##### Convergence criterion\.

BALAR uses two convergence conditions depending on whether the task exposes a discrete answer set𝒜\\mathcal\{A\}\.

*Answer\-probability convergence*\. BALAR maintains*answer\-level likelihoods*P​\(a∣θ\)P\(a\\mid\\theta\)for each possible answera∈𝒜a\\in\\mathcal\{A\}, estimated via parallel LLM calls at initialization\. These are structured like the question\-level likelihood tables: for each dimensionjj, a tableP​\(a∣θj\)P\(a\\mid\\theta\_\{j\}\)is computed, and the joint answer likelihood factorizes analogously\. The answer probability at roundttis:

p^t​\(a\)=∑θ∈Θπt​\(θ\)​P​\(a∣θ\)\.\\hat\{p\}\_\{t\}\(a\)=\\sum\_\{\\theta\\in\\Theta\}\\pi\_\{t\}\(\\theta\)\\,P\(a\\mid\\theta\)\.The loop terminates whenmaxa∈𝒜⁡p^t​\(a\)≥1−α\\max\_\{a\\in\\mathcal\{A\}\}\\hat\{p\}\_\{t\}\(a\)\\geq 1\-\\alpha\.

*Marginal\-fraction convergence*\. When the answer is a free\-form explanation with no fixed answer set, BALAR instead declares convergence when aβ\\beta\-fraction of dimensions are individually concentrated:

1p​∑j=1p𝟏​\[maxθj∈Θj⁡πt\(j\)​\(θj\)≥1−α\]≥β,\\frac\{1\}\{p\}\\sum\_\{j=1\}^\{p\}\\mathbf\{1\}\\\!\\left\[\\max\_\{\\theta\_\{j\}\\in\\Theta\_\{j\}\}\\pi\_\{t\}^\{\(j\)\}\(\\theta\_\{j\}\)\\geq 1\-\\alpha\\right\]\\;\\geq\\;\\beta,whereπt\(j\)\\pi\_\{t\}^\{\(j\)\}is the marginal ofπt\\pi\_\{t\}over dimensionθj\\theta\_\{j\}\. The parameterβ∈\(0,1\]\\beta\\in\(0,1\]controls how many dimensions must converge before termination\.β=1\\beta=1requires all dimensions to concentrate marginally\. In both cases, the loop also terminates when the total round budgetTTor the ASK\-round budgetTaskT\_\{\\text\{ask\}\}is exhausted\.

##### Final answer\.

The MAP stateθ^=arg⁡maxθ∈Θ⁡πT​\(θ\)\\hat\{\\theta\}=\\arg\\max\_\{\\theta\\in\\Theta\}\\pi\_\{T\}\(\\theta\)is extracted\. It is formatted as a structured disambiguation summary appended to the original prompt and conversation historyHTH\_\{T\}, and a final LLM call produces the answer\. See[Figure˜7](https://arxiv.org/html/2605.05386#S4.F7)for the running example\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x7.png)Figure 7:The MAP stateθ^=\(vascular,episodic,absent\)\\hat\{\\theta\}=\(\\text\{vascular\},\\text\{episodic\},\\text\{absent\}\), combined with the conversation historyℋT\\mathcal\{H\}\_\{T\}, conditions a final LLM call that produces the diagnosis*Migraine without aura*\.The full algorithm is summarized in[Algorithm˜1](https://arxiv.org/html/2605.05386#algorithm1)\. Implementation details including atomic reasoning, parallelism and the verifier are discussed in Appendix[A](https://arxiv.org/html/2605.05386#A1)\. LLM call complexity is analyzed in Appendix[B](https://arxiv.org/html/2605.05386#A2)\.

1

2

Input:

𝐩\\mathbf\{p\}ambiguous prompt,

𝐜\\mathbf\{c\}context, users

𝒰\\mathcal\{U\}, config

\(α,β,T,Task,λ,Φ,ℒ\)\(\\alpha,\\beta,T,T\_\{\\text\{ask\}\},\\lambda,\\Phi,\\mathcal\{L\}\)
Output:Final answer

y^\\hat\{y\}
3

\{θj,Θj\}j=1p←ProposeDimensions​\(𝐩,𝐜\)\\\{\\theta\_\{j\},\\Theta\_\{j\}\\\}\_\{j=1\}^\{p\}\\leftarrow\\textsc\{ProposeDimensions\}\(\\mathbf\{p\},\\mathbf\{c\}\)
4

π\(j\)←ElicitPrior​\(𝐩,𝐜,θj\)\\pi^\{\(j\)\}\\leftarrow\\textsc\{ElicitPrior\}\(\\mathbf\{p\},\\mathbf\{c\},\\theta\_\{j\}\)for all

jj\(parallel\)

5

𝒬←GenerateQuestions​\(𝐩,𝐜,\{θj\}\)\\mathcal\{Q\}\\leftarrow\\textsc\{GenerateQuestions\}\(\\mathbf\{p\},\\mathbf\{c\},\\\{\\theta\_\{j\}\\\}\)
6

Lq,u,j←EstimateLikelihood​\(𝐩,𝐜,q,u,θj\)L\_\{q,u,j\}\\leftarrow\\textsc\{EstimateLikelihood\}\(\\mathbf\{p\},\\mathbf\{c\},q,u,\\theta\_\{j\}\)for all

\(q,u,j\)\(q,u,j\)\(parallel\)

7

π0←⨂j=1pπ\(j\)\\pi\_\{0\}\\leftarrow\\bigotimes\_\{j=1\}^\{p\}\\pi^\{\(j\)\},

Kq,u←BuildStateLikelihood​\(Lq,u,⋅\)K\_\{q,u\}\\leftarrow\\textsc\{BuildStateLikelihood\}\(L\_\{q,u,\\cdot\}\)
8

t←1t\\leftarrow 1,

nasked←0n\_\{\\text\{asked\}\}\\leftarrow 0,

asked←∅\\mathrm\{asked\}\\leftarrow\\emptyset
9while*t≤Tt\\leq Tandn*asked*<T*ask*n\_\{\\text\{asked\}\}<T\_\{\\text\{ask\}\}*do

10

11if*𝒜\\mathcal\{A\}existsandmaxa⁡p^t​\(a\)≥1−α\\max\_\{a\}\\hat\{p\}\_\{t\}\(a\)\\geq 1\-\\alpha*then

12break

13end if

14

15if*1p​∑j=1p𝟏​\[maxv⁡πt−1\(j\)​\(v\)≥1−α\]≥β\\frac\{1\}\{p\}\\sum\_\{j=1\}^\{p\}\\mathbf\{1\}\[\\max\_\{v\}\\pi^\{\(j\)\}\_\{t\-1\}\(v\)\\geq 1\-\\alpha\]\\geq\\beta*then

16break

17end if

18

19

Iq,u←It−1​\(θ;Y\|ℋt−1;q,u\)I\_\{q,u\}\\leftarrow I\_\{t\-1\}\(\\theta;Y\|\\mathcal\{H\}\_\{t\-1\};q,u\)for all

\(q,u\)∉asked\(q,u\)\\notin\\mathrm\{asked\}
20if*no unasked\(q,u\)\(q,u\)*then

21if*state cap reached*then

22break

23end if

24

It∗←0I\_\{t\}^\{\*\}\\leftarrow 0
25else

26

\(q∗,u∗\)←arg⁡max⁡Iq,u\(q^\{\*\},u^\{\*\}\)\\leftarrow\\arg\\max I\_\{q,u\},

It∗←Iq∗,u∗I\_\{t\}^\{\*\}\\leftarrow I\_\{q^\{\*\},u^\{\*\}\}
27end if

28

29

Δt←max⁡\(0,ℍ​\(πt−1\)−ℍα\)\\Delta\_\{t\}\\leftarrow\\max\(0,\\mathbb\{H\}\(\\pi\_\{t\-1\}\)\-\\mathbb\{H\}\_\{\\alpha\}\)
30if*no unasked pairorΔt\>λ​It∗​\(T−t\)\\Delta\_\{t\}\>\\lambda I\_\{t\}^\{\*\}\(T\-t\)*then

31

32if*state cap reached*then

33break

34end if

35

36

θp\+1,Θp\+1←NewDimension​\(𝐩,𝐜,ℋt−1\)\\theta\_\{p\+1\},\\Theta\_\{p\+1\}\\leftarrow\\textsc\{NewDimension\}\(\\mathbf\{p\},\\mathbf\{c\},\\mathcal\{H\}\_\{t\-1\}\)
37

π\(p\+1\)←ElicitPrior​\(𝐩,𝐜,θp\+1,ℋt−1\)\\pi^\{\(p\+1\)\}\\leftarrow\\textsc\{ElicitPrior\}\(\\mathbf\{p\},\\mathbf\{c\},\\theta\_\{p\+1\},\\mathcal\{H\}\_\{t\-1\}\)
38

πt−1←πt−1⊗π\(p\+1\)\\pi\_\{t\-1\}\\leftarrow\\pi\_\{t\-1\}\\otimes\\pi^\{\(p\+1\)\},

p←p\+1p\\leftarrow p\+1
39

Lq,u,p\+1←EstimateLikelihood​\(⋅\)L\_\{q,u,p\+1\}\\leftarrow\\textsc\{EstimateLikelihood\}\(\\cdot\)for

q∈𝒬,u∈𝒰q\\in\\mathcal\{Q\},u\\in\\mathcal\{U\}
40

𝒬new←GenerateExpandedQuestions​\(⋅\)\\mathcal\{Q\}\_\{\\text\{new\}\}\\leftarrow\\textsc\{GenerateExpandedQuestions\}\(\\cdot\)
41

Lq,u,j←EstimateLikelihood​\(⋅\)L\_\{q,u,j\}\\leftarrow\\textsc\{EstimateLikelihood\}\(\\cdot\)for

q∈𝒬new,u∈𝒰,j∈\[p\]q\\in\\mathcal\{Q\}\_\{\\text\{new\}\},u\\in\\mathcal\{U\},j\\in\[p\]
42

𝒬←𝒬∪𝒬new\\mathcal\{Q\}\\leftarrow\\mathcal\{Q\}\\cup\\mathcal\{Q\}\_\{\\text\{new\}\}
43

Kq,u←BuildStateLikelihood​\(Lq,u,⋅\)K\_\{q,u\}\\leftarrow\\textsc\{BuildStateLikelihood\}\(L\_\{q,u,\\cdot\}\)
44else

45

46

rt←GetUserAnswer​\(u∗,q∗\)r\_\{t\}\\leftarrow\\textsc\{GetUserAnswer\}\(u^\{\*\},q^\{\*\}\)
47

ω^←SoftMap​\(rt,q∗\)\\hat\{\\omega\}\\leftarrow\\textsc\{SoftMap\}\(r\_\{t\},q^\{\*\}\)
48

L^​\(θ\)←∑y∈𝒴q∗ω^y​Kq∗,u∗​\(θ,y\)\\hat\{L\}\(\\mathbf\{\\theta\}\)\\leftarrow\\sum\_\{y\\in\\mathcal\{Y\}\_\{q^\{\*\}\}\}\\hat\{\\omega\}\_\{y\}K\_\{q^\{\*\},u^\{\*\}\}\(\\mathbf\{\\theta\},y\)
49

πt←L^⋅πt−1/Z\\pi\_\{t\}\\leftarrow\\hat\{L\}\\cdot\\pi\_\{t\-1\}/Z
50

asked←asked∪\{\(q∗,u∗\)\}\\mathrm\{asked\}\\leftarrow\\mathrm\{asked\}\\cup\\\{\(q^\{\*\},u^\{\*\}\)\\\},

nasked←nasked\+1n\_\{\\text\{asked\}\}\\leftarrow n\_\{\\text\{asked\}\}\+1
51end if

52

53

t←t\+1t\\leftarrow t\+1
54end while

55

56

θ^←arg⁡maxθ⁡πt−1​\(θ\)\\hat\{\\theta\}\\leftarrow\\arg\\max\_\{\\theta\}\\pi\_\{t\-1\}\(\\theta\)
57

y^←FinalAnswer​\(𝐩,𝐜,ℋt−1,θ^\)\\hat\{y\}\\leftarrow\\textsc\{FinalAnswer\}\(\\mathbf\{p\},\\mathbf\{c\},\\mathcal\{H\}\_\{t\-1\},\\hat\{\\theta\}\)
58return*y^\\hat\{y\}*

Algorithm 1Bayesian Agentic Loop for Active Reasoning \(BALAR\)

## 5Experimental Setup

### 5\.1Datasets

We evaluate on three structurally distinct benchmarks :AR\-Bench\-DC\(detective reasoning\),AR\-Bench\-SP\(situation puzzles\), andiCraft\-MD\(clinical diagnosis\)\.

##### AR\-Bench\-DC\(Zhouet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib1)\)\.

100 detective cases, each with scene metadata, victim description, and up to 5 suspect profiles\. Each suspect holds private information \(story, task, alibi\)\. The agent must identify the true murderer by interrogating suspects\. Scored by exact match\.

##### AR\-Bench\-SP\(Zhouet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib1)\)\.

100 situation puzzles, where the agent must reconstruct the hidden explanation of a puzzling scenario by asking yes/no questions to a user who knows the solution\. The original AR\-Bench evaluation measures the final answer using a character\-level F1 similarity with the ground\-truth explanation\(Zhouet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib1)\)\. However, this metric is insensitive to semantic structure and word order\. For instance, a random permutation of the same characters can still achieve a perfect score\. We therefore do not use the character\-level F1 metric and instead evaluate solutions using non\-strict semantic equivalence\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.05386#bib.bib8)\)which we explain in Appendix[F](https://arxiv.org/html/2605.05386#A6)\.

##### iCraft\-MD\(Liet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib5)\)\.

140 patient cases from CRAFT\-MD\. Each case contains a clinical multiple\-choice question, patient demographic information, a chief complaint, and a list of atomic clinical facts private to the user\. Scored by exact match on the correct answer option\.

For AR\-Bench\-DC, each user is a suspect with its own private info \(story \+ task\)\. For the other three datasets, there is a single user \(the patient or puzzle narrator\)\. The maximum number of ASK roundsTaskT\_\{\\text\{ask\}\}is a swept hyperparameter \(see[Appendix˜E](https://arxiv.org/html/2605.05386#A5)\)\. The global round budgetT=100T=100is set large enough to be non\-binding\.

### 5\.2Baselines

We compare against dataset\-specific interactive baselines implemented in our experimental pipeline\.

AR\-Bench baselines\.For the detective reasoning and situation puzzle benchmarks \(AR\-Bench\-DC and AR\-Bench\-SP\), we evaluate four baselines:Few\-Shot \(AR\-Bench\)which is the best\-performing method reported inZhouet al\.\([2025](https://arxiv.org/html/2605.05386#bib.bib1)\),UoT\(Huet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib3)\),ToT\(Yaoet al\.,[2023](https://arxiv.org/html/2605.05386#bib.bib6)\), andProactive CoT\(Denget al\.,[2023](https://arxiv.org/html/2605.05386#bib.bib7)\)\. All baselines are run with a maximum of 25 interaction turns, using the best hyperparameter settings reported inZhouet al\.\([2025](https://arxiv.org/html/2605.05386#bib.bib1)\)\.

Medical dialogue baselines\.For the medical benchmark iCraft\-MD, we evaluate two baselines:Zero\-Shot, where an LLM iteratively generates clarifying questions from the conversation history and produces a final answer after the interaction budget is exhausted, andMediQA Expert\(Liet al\.,[2024](https://arxiv.org/html/2605.05386#bib.bib5)\), a five\-module pipeline comprising initial assessment, abstention, question generation, information integration, and decision making\. Both baselines are run with a maximum of 25 interaction rounds\. ForMediQA Expert, we adopt the best\-performing configuration reported inLiet al\.\([2024](https://arxiv.org/html/2605.05386#bib.bib5)\)\(Scale \+ Rational Generation \+ Self\-Consistency \(3\)\)\.

### 5\.3Models and User Simulator

We evaluate five open\-weight LLMs spanning a range of parameter scales and training paradigms\. Specifically, we useQwen2\.5\-7B/14B/32B\-Instructto evaluate scaling with model size,QwQ\-32Bto assess the effect of reasoning\-tuned models, andLlama\-3\.1\-8B\-Instructas an instruction\-tuned model from a separate family to evaluate generality beyond the Qwen family\. All models are served locally via vLLM using bfloat16 precision and bitsandbytes 4\-bit quantization\. For the Qwen\-family models \(Qwen2\.5\-7B/14B/32B\-Instruct,QwQ\-32B\), we additionally enable YaRN rope scaling \(factor4×4\\times, original context length 32768\) to support a maximum context length of 131072 tokens for long interaction histories\. This is not used forLlama\-3\.1\-8B\-Instruct\. The user simulator is fixed toQwen2\.5\-32B\-Instructacross all experiments\. The user simulator is prompted with the user’s private information and must respond to the agent’s question in natural language\.

## 6Results

### 6\.1Main Results

[Figure˜8](https://arxiv.org/html/2605.05386#S6.F8)reports outcome scores for all agent models against the baselines\. For each BALAR entry, we report the performance associated with the best configuration over\(α,β,Task,p,\|𝒬\|\)\(\\alpha,\\beta,T\_\{\\text\{ask\}\},p,\|\\mathcal\{Q\}\|\)\. We report exact answer accuracy for AR\-Bench\-DC and iCraft\-MD, and non\-strict semantic equivalence for AR\-Bench\-SP\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.05386#bib.bib8)\)\.

BALAR outperforms all baselines across all three benchmarks and both agent models, with one exception: on AR\-Bench\-SP withLlama\-3\.1\-8B\-Instruct, BALAR falls slightly behind ToT \(26\.0%26\.0\\%vs\.31\.0%31\.0\\%\) and UoT \(26\.0%26\.0\\%vs\.29\.0%29\.0\\%\)222We attribute this to the lack of reliability of smaller models under BALAR’s structured prompting\.\. UnderQwen2\.5\-32B\-Instruct, BALAR achieves relative gains of14\.6%\\mathbf\{14\.6\\%\}on AR\-Bench\-DC,38\.5%\\mathbf\{38\.5\\%\}on AR\-Bench\-SP, and30\.5%\\mathbf\{30\.5\\%\}on iCraft\-MD over the strongest respective baselines\. To contextualize this gain on iCraft\-MD, we compare against an oracle that receives full patient information : BALAR withQwen2\.5\-32B\-Instructwithout access to any private information achieves73\.6%73\.6\\%, closing88%88\\%of the gap to the oracle upper bound83\.6%83\.6\\%\(Appendix[H](https://arxiv.org/html/2605.05386#A8)\)\. This suggests that BALAR’s structured Bayesian formulation provides more robust performance than methods relying on prompt\-driven or search\-based heuristics, or task\-tailored designs such as MediQ Expert\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x8.png)\(a\)Llama\-3\.1\-8B\-Instruct
![Refer to caption](https://arxiv.org/html/2605.05386v1/x9.png)\(b\)Qwen2\.5\-32B\-Instruct

Figure 8:Main results \(%\)\. Outcome is*exact accuracy*for AR\-Bench\-DC, iCraft\-MD, and*non\-strict semantic equivalence*for AR\-Bench\-SP\.Task=25T\_\{\\text\{ask\}\}=25and Standard Errors=p​\(1−p\)/n=\\sqrt\{p\(1\-p\)/n\}\.
### 6\.2Effect of Reasoning Mode

[Table˜1\(a\)](https://arxiv.org/html/2605.05386#S6.T1.st1)reports the effect of replacingQwen2\.5\-32B\-Instructwith the reasoning\-tunedQwQ\-32BatTask=10T\_\{\\text\{ask\}\}\{=\}10\. Switching models yields relative gains of10\.9%10\.9\\%on AR\-Bench\-DC and36\.5%36\.5\\%on AR\-Bench\-SP, while performance remains unchanged on iCraft\-MD suggesting reasoning capacity matters more for deductive tasks than for clinical decision\-making: structured hypothesis elimination benefits most from deeper reasoning, while clinical diagnosis appears less bottlenecked by reasoning depth than by information gathering\.

### 6\.3Scaling with Model Size

To isolate the effect of model scale from hyperparameter choice, we fix the best hyperparameter configuration found forQwen2\.5\-32B\-Instructand evaluate the*same configuration*onQwen2\.5\-7B\-InstructandQwen2\.5\-14B\-Instruct\.[Table˜1\(b\)](https://arxiv.org/html/2605.05386#S6.T1.st2)shows that BALAR scales consistently with model size on iCraft\-MD: accuracy improves from50\.7%50\.7\\%\(7B\) to62\.1%62\.1\\%\(14B\) to73\.6%73\.6\\%\(32B\), suggesting that BALAR scales with model capacity rather than saturating at smaller scales\.

Table 1:Effect of reasoning mode \(left\) and model scale \(right\) on BALAR performance\.\(a\)Effect of Reasoning Mode\.
\(b\)Scaling analysis\.

### 6\.4Ablation Studies

[Table˜2](https://arxiv.org/html/2605.05386#S6.T2)isolates two key components of BALAR\. For question selection, we compare : \(i\) random selection from the question bank, \(ii\) LLM\-prompted selection \(“which question should I ask next given the history?”\), and \(iii\) MI maximization \(ours\)\. This isolates the value of the information\-theoretic criterion from the rest of the framework\. For dynamic expansion, we disable EXPAND and fix the state space and question bank to their initial values\. Replacing MI maximization with random or LLM\-prompted selection drops accuracy by23\.3%23\.3\\%and9\.6%9\.6\\%respectively, confirming the value of the information\-theoretic criterion\.

Table 2:Question\-selection and expansion ablation on iCraft\-MD / Qwen2\.5\-32B\-Instruct\. Config:Task=10T\_\{\\text\{ask\}\}\{=\}10,α=0\.1\\alpha\{=\}0\.1,p=5p\{=\}5,\|𝒬\|=10\|\\mathcal\{Q\}\|\{=\}10\.DisablingExpandreduces accuracy by13\.7%13\.7\\%, highlighting that the initial state space can be insufficient and BALAR benefits from dynamic expansion\. Together, these results confirm that the gains reported in[Section˜6\.1](https://arxiv.org/html/2605.05386#S6.SS1)are driven by the combination of principled question selection and adaptive state refinement, rather than by the Bayesian scaffolding alone\. We additionally study sensitivity of the choice of prior mapping in Appendix[I](https://arxiv.org/html/2605.05386#A9)\.

### 6\.5Information Gain from Questions

We analyze how much uncertainty is removed by successiveAskactions\.[Figure˜9](https://arxiv.org/html/2605.05386#S6.F9)\(bottom\) plots the cumulative entropy reduction as a function of the number of questions asked underQwen2\.5\-32B\-Instruct\. Letπtkold\\pi\_\{t\_\{k\}\}^\{\\mathrm\{old\}\}denote the belief state immediately before thekk\-thAskupdate, andπtknew\\pi\_\{t\_\{k\}\}^\{\\mathrm\{new\}\}the posterior after incorporating the observed response\. The entropy change at roundkkisΔk=H​\(πtknew\)−H​\(πtkold\)\.\\Delta\_\{k\}=H\(\\pi\_\{t\_\{k\}\}^\{\\mathrm\{new\}\}\)\-H\(\\pi\_\{t\_\{k\}\}^\{\\mathrm\{old\}\}\)\.We report the cumulative information gain−∑i=1kΔi,\-\\sum\_\{i=1\}^\{k\}\\Delta\_\{i\},indexed by the number of questions askedkk, which measures the total uncertainty removed by questions alone\.[Figure˜9](https://arxiv.org/html/2605.05386#S6.F9)\(bottom\) shows that it grows consistently with questions asked across all datasets, confirming that MI\-based selection extracts meaningful uncertainty reduction at every round\. AR\-Bench\-SP plateaus aroundk=20k\{=\}20, suggesting diminishing returns once key dimensions are resolved, while AR\-Bench\-DC and iCraft\-MD remain approximately linear throughout\.

[Figure˜9](https://arxiv.org/html/2605.05386#S6.F9)\(top\) shows BALAR accuracy as a function of the number of interaction roundsKKunderQwen2\.5\-32B\-Instruct, obtained by truncating the dialogue at roundK∈\{5,10,15,20,25\}K\\in\\\{5,10,15,20,25\\\}and recomputing the final answer from the resulting partial history\. BALAR improves monotonically with more rounds on AR\-Bench\-SP and iCraft\-MD, demonstrating that additional questions consistently refine the belief toward the correct answer\. On AR\-Bench\-DC, performance is less monotone, likely due to the multi\-user setting\. The distribution ofAskandExpandrounds per instance is analyzed in Appendix[J](https://arxiv.org/html/2605.05386#A10)\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x10.png)Figure 9:Top:BALAR score \(%\) as a function of interaction roundsKK, obtained by truncating dialogue atK∈\{5,10,15,20,25\}K\\in\\\{5,10,15,20,25\\\}\.Bottom:Cumulative entropy reduction induced by successiveAskactions\. Solid lines show the mean across runs, shaded regions denote standard errors\. All results underQwen2\.5\-32B\-Instruct\. Fixed configs: DC \(Task=25T\_\{\\text\{ask\}\}\{=\}25,α=0\.3\\alpha\{=\}0\.3,p=5p\{=\}5,\|𝒬\|=10\|\\mathcal\{Q\}\|\{=\}10\), SP \(Task=25T\_\{\\text\{ask\}\}\{=\}25,α=0\.3\\alpha\{=\}0\.3,β=0\.5\\beta\{=\}0\.5,p=5p\{=\}5,\|𝒬\|=10\|\\mathcal\{Q\}\|\{=\}10\), iCraft\-MD \(Task=25T\_\{\\text\{ask\}\}\{=\}25,α=0\.3\\alpha\{=\}0\.3\(top\) /0\.10\.1\(bottom\),p=1p\{=\}1,\|𝒬\|=2\|\\mathcal\{Q\}\|\{=\}2\)\.

## 7Discussion

##### Sleep\-time compute\.

BALAR’s initialization is decoupled from the interaction phase : dimensions, priors, questions, and likelihood tables are computed before the first user message\. Following the*sleep\-time compute*paradigm\(Linet al\.,[2025](https://arxiv.org/html/2605.05386#bib.bib11)\), this pre\-interaction computation means the per\-turn latency is dominated only by a single LLM call for the user simulator and the soft\-map inference, making the interaction loop fast\.

##### Hyperparameter selection\.

BALAR exposes several hyperparameters: the confidence thresholdα\\alpha, the marginal convergence fractionβ\\beta, the ASK budgetTaskT\_\{\\text\{ask\}\}, the initial number of dimensionsppand questions\|𝒬\|\|\\mathcal\{Q\}\|, and the expand multiplierλ\\lambda\. The appropriate values of these parameters are inherently task\-dependent\. When labeled interaction data is available, for instance, a held\-out set of solved cases with known ground\-truth answers, a standard grid search or random sweep over these hyperparameters can be used to select the configuration that maximizes validation accuracy\. In this supervised regime,α\\alphacontrols the tradeoff between committing early \(lowα\\alpha, fewer questions asked\) and gathering more evidence before answering \(highα\\alpha\), and can be tuned directly against task accuracy\.

When no labeled data is available, the thresholdα\\alphashould reflect the acceptable residual uncertainty: for high\-stakes decisions \(e\.g\. medical diagnosis\), a smallα\\alpha\(e\.g\.0\.10\.1\) is appropriate, whereas for lower\-stakes tasks a larger value \(e\.g\.0\.30\.3\) reduces unnecessary questioning\. The budgetTaskT\_\{\\text\{ask\}\}should be set to the maximum number of questions a user is willing to answer, which is typically a product constraint rather than a modeling choice\. The initial state\-space richness\(p,\|𝒬\|\)\(p,\|\\mathcal\{Q\}\|\)trades off initialization cost against the expressiveness of the prior representation\. Starting with\(p=1,\|𝒬\|=2\)\(p\{=\}1,\|\\mathcal\{Q\}\|\{=\}2\)and relying onExpandis a safe default when the structure of ambiguity is unknown\. The expand multiplierλ\\lambdacontrols how aggressively BALAR triggers expansion:λ=1\\lambda\{=\}1is a natural default that expands only when the entropy gap cannot be closed within the remaining budget under optimal questioning\.

##### Limitations\.

The independence assumption across dimensions introduces approximation error, and the functional formKq,u​\(θ,y\)∝∏jLq,u,θj​\(y∣θj\)K\_\{q,u\}\(\\theta,y\)\\propto\\prod\_\{j\}L\_\{q,u,\\theta\_\{j\}\}\(y\\mid\\theta\_\{j\}\)is a modeling choice\. Likelihood tables estimated by label\-to\-probability maps may be miscalibrated\. The method is most beneficial when the initial context is genuinely ambiguous\. For well\-specified prompts, the overhead of initialization is unnecessary\.

##### Future work\.

Promising extensions include: \(1\) online calibration of likelihood tables using observed answers, \(2\) supervised fine\-tuning on BALAR trajectories from cases with known ground truth as in STaR\(Zelikmanet al\.,[2022](https://arxiv.org/html/2605.05386#bib.bib9)\), and \(3\) exploring methods to better calibrate the model’s posterior estimates\.

## 8Conclusion

We presented BALAR, a training\-free Bayesian outer loop that equips LLM agents with a principled mechanism to detect, track, and resolve prompt ambiguity through targeted multi\-turn dialogue\. The framework maintains a structured factored belief over a space of disambiguating dimensions, selects questions by mutual information maximization, and dynamically expands the state representation when the existing belief is insufficient\. BALAR is task\-agnostic, compatible with any instruction\-following LLM, and its sleep\-time initialization keeps per\-turn interaction latency low\. More broadly, it can be seen as an inference\-time belief\-state controller for LLM agents, providing a principled foundation for building more reliable and collaborative AI systems\.

## 9Acknowledgement

We especially thank Michael Y\. Li for helpful comments, and the Dynamode group meeting for peer\-review feedback\. This work was supported in part by ONR Grant N00014\-22\-1\-2110, NSF Grant 2205084, and the Stanford Institute for Human\-Centered Artificial Intelligence \(HAI\)\. EBF is a Biohub, San Francisco, Investigator\.

## References

- C\. Andukuri, J\. Fränken, T\. Gerstenberg, and N\. D\. Goodman \(2024\)Star\-gate: teaching language models to ask clarifying questions\.arXiv preprint arXiv:2403\.19154\.Cited by:[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Deng, L\. Liao, L\. Chen, H\. Wang, W\. Lei, and T\. Chua \(2023\)Prompting and evaluating large language models for proactive dialogues: clarification, target\-guided, and non\-collaboration\.arXiv preprint arXiv:2305\.13626\.Cited by:[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px4.p1.1),[§5\.2](https://arxiv.org/html/2605.05386#S5.SS2.p2.1)\.
- D\. Golovin and A\. Krause \(2011\)Adaptive submodularity: theory and applications in active learning and stochastic optimization\.Journal of Artificial Intelligence Research42,pp\. 427–486\.Cited by:[Appendix C](https://arxiv.org/html/2605.05386#A3.p1.2)\.
- E\.J\. Horvitz, D\.E\. Heckerman, B\.N\. Nathwani, and L\.M\. Fagan \(1984\)Diagnostic strategies in the hypothesis\-directed pathfinder system\.InProceedings of the First Conference on Artificial Intelligence Applications,Denver, Colorado,pp\. 630–636\.Cited by:[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Hu, C\. Liu, X\. Feng, Y\. Zhao, S\. Ng, A\. T\. Luu, J\. He, P\. W\. Koh, and B\. Hooi \(2024\)Uncertainty of thoughts: uncertainty\-aware planning enhances information seeking in large language models\.arXiv preprint arXiv:2402\.03271\.Cited by:[§1](https://arxiv.org/html/2605.05386#S1.p2.1),[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2605.05386#S5.SS2.p2.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.arXiv preprint arXiv:2302\.09664\.Cited by:[Appendix F](https://arxiv.org/html/2605.05386#A6.p2.4),[§5\.1](https://arxiv.org/html/2605.05386#S5.SS1.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2605.05386#S6.SS1.p1.1)\.
- S\. Li, V\. Balachandran, S\. Feng, J\. Ilgen, E\. Pierson, P\. W\. W\. Koh, and Y\. Tsvetkov \(2024\)Mediq: question\-asking llms and a benchmark for reliable interactive clinical reasoning\.Advances in Neural Information Processing Systems37,pp\. 28858–28888\.Cited by:[§1](https://arxiv.org/html/2605.05386#S1.p2.1),[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.05386#S5.SS1.SSS0.Px3),[§5\.2](https://arxiv.org/html/2605.05386#S5.SS2.p3.1)\.
- K\. Lin, C\. Snell, Y\. Wang, C\. Packer, S\. Wooders, I\. Stoica, and J\. E\. Gonzalez \(2025\)Sleep\-time compute: beyond inference scaling at test\-time\.arXiv preprint arXiv:2504\.13171\.Cited by:[§4](https://arxiv.org/html/2605.05386#S4.p1.1),[§7](https://arxiv.org/html/2605.05386#S7.SS0.SSS0.Px1.p1.1)\.
- S\. Wu, M\. Galley, B\. Peng, H\. Cheng, G\. Li, Y\. Dou, W\. Cai, J\. Zou, J\. Leskovec, and J\. Gao \(2025\)CollabLLM: from passive responders to active collaborators\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.05386#S1.p2.1),[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§1](https://arxiv.org/html/2605.05386#S1.p2.1),[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px3.p1.1),[§5\.2](https://arxiv.org/html/2605.05386#S5.SS2.p2.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. Goodman \(2022\)Star: bootstrapping reasoning with reasoning\.Advances in Neural Information Processing Systems35,pp\. 15476–15488\.Cited by:[§7](https://arxiv.org/html/2605.05386#S7.SS0.SSS0.Px4.p1.1)\.
- Z\. Zhou, X\. Feng, Z\. Zhu, J\. Yao, S\. Koyejo, and B\. Han \(2025\)From passive to active reasoning: can large language models ask the right questions under incomplete information?\.InICML,Cited by:[§1](https://arxiv.org/html/2605.05386#S1.p2.1),[§2](https://arxiv.org/html/2605.05386#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.05386#S5.SS1.SSS0.Px1),[§5\.1](https://arxiv.org/html/2605.05386#S5.SS1.SSS0.Px2),[§5\.1](https://arxiv.org/html/2605.05386#S5.SS1.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.05386#S5.SS2.p2.1)\.

## Appendix APractical Considerations

##### Atomic reasoning\.

Most LLM calls in BALAR are kept as close to atomic as possible: each call typically handles a single dimension, question, user, or value\. This keeps prompts small and helps avoid the degradation that occurs when too many likelihood judgments are packed into one context window\. In particular, if too many likelihood entries are evaluated jointly, the model tends to return nearly uniform likelihoods, which in turn drives mutual information toward zero throughout the loop\.

There is, however, one deliberate exception: likelihood estimation is performed at the level of a*single*\(user,question,dimension\)\(\\text\{user\},\\text\{question\},\\text\{dimension\}\)triple, producing the full table over that dimension’s values and the question’s answer choices in one call\. In principle, one could push this further and score each individual cell atomically, but this would be substantially more expensive in the number of parallel LLM calls\. We therefore adopt this as a practical trade\-off between granularity and computational cost\.

##### Verifier\.

BALAR uses two layers of output checking\. First, every LLM response is required to conform to a structured output schema \(a Pydantic model\)\. If the response does not parse into the required schema, the call is retried, up to a fixed maximum number of attempts\. This handles formatting and structural failures such as malformed JSON, missing required fields, or invalid types\.

An optional verifier LLM provides a second layer of checking\. Unlike schema validation, which is purely syntactic and structural, the verifier is used to assess whether the response is logically coherent and semantically valid for the task\. For example, it can flag issues such as inconsistent reasoning, or other task\-level errors that still satisfy the formal schema\. When the verifier rejects an output, it returns corrective feedback, and the model is queried again with that feedback appended to the prompt\.

A lightweight version of this verification principle which is active in all current experiments is that every prompt explicitly asks the model to generate a shortreasonbefore committing to its output label or value\. This chain\-of\-thought prior to the structured output encourages the model to surface its reasoning explicitly, making logical inconsistencies easier to detect and reducing labeling errors even without a full verification pass\.

##### Parallelism\.

BALAR is implemented using asynchronous execution\. Most LLM calls are written asasyncfunctions and are dispatched concurrently usingasyncio\.gather, with a semaphore controlling the maximum number of simultaneous API requests\. This design is used not only during initialization \(Steps 1–4\), but also throughout the interaction loop for tasks such as likelihood\-table construction, prior estimation, and expansion\-related computations\. As a result, large batches of independent LLM calls can run in parallel, so the dominant cost becomes the latency of the slowest request rather than the sum of all calls\. This allows the system to maintain interactive latency even when many likelihood evaluations are required\.

##### State space size\.

The joint state space\|Θ\|=∏jnj\|\\Theta\|=\\prod\_\{j\}n\_\{j\}grows exponentially with the number of dimensions\. To keep inference tractable, BALAR enforces a configurable upper bound on the total number of states\. Before triggering anEXPANDstep, the algorithm checks whether adding another dimension would keep the total state space within this cap\. If the expansion would exceed the limit, no further dimensions are introduced\.

## Appendix BLLM Calls Complexity

We characterize the number of LLM calls made by BALAR and how parallelism is exploited to keep wall\-clock latency low\. Letppdenote the number of dimensions,nnthe maximum number of values per dimension,\|𝒬\|\|\\mathcal\{Q\}\|the number of questions,\|𝒰\|\|\\mathcal\{U\}\|the number of users,\|𝒜\|\|\\mathcal\{A\}\|the number of possible answers \(when defined\), and\|𝒬′\|\|\\mathcal\{Q\}^\{\\prime\}\|the number of new questions generated perExpandstep\.

##### Sleep\-time initialization \(Steps 1–4\)\.

Step 1 issues a single LLM call to propose allppdimensions jointly\. Step 2 elicits one call per dimension value to assign a prior label, for a total of∑jnj≤p⋅n\\sum\_\{j\}n\_\{j\}\\leq p\\cdot ncalls\. These are dispatched concurrently viaasyncio\.gather\. Step 3 issues a single call to generate all\|𝒬\|\|\\mathcal\{Q\}\|initial questions\. Step 4 issues one call per\(q,u,θj\)\(q,u,\\theta\_\{j\}\)triple to fill the likelihood table for that triple, for a total of\|𝒬\|⋅\|𝒰\|⋅p\|\\mathcal\{Q\}\|\\cdot\|\\mathcal\{U\}\|\\cdot pcalls, all dispatched concurrently\. When possible answers𝒜\\mathcal\{A\}are available, an additionalppparallel calls estimate the answer\-level likelihood tables \(one per dimension\)\. The total number of initialization calls is therefore:

Cinit=1\+p⋅n\+1\+\|𝒬\|⋅\|𝒰\|⋅p\+p=𝒪​\(\|𝒬\|⋅\|𝒰\|⋅p\),C\_\{\\text\{init\}\}=1\+p\\cdot n\+1\+\|\\mathcal\{Q\}\|\\cdot\|\\mathcal\{U\}\|\\cdot p\+p\\;=\\;\\mathcal\{O\}\(\|\\mathcal\{Q\}\|\\cdot\|\\mathcal\{U\}\|\\cdot p\),all parallelizable except for the sequential dependency between Steps 1–2–3–4 \(each step uses outputs of the previous one\)\.

##### ASK step\.

EachAskround issues exactly 2 sequential LLM calls: one to the user simulator to obtain the natural language answerrtr\_\{t\}, and one to soft\-maprtr\_\{t\}to a probability vectorω^\\hat\{\\omega\}over the discrete choices\. The MI computation over all candidate\(q,u\)\(q,u\)pairs is a single batched tensor operation \(no LLM call\)\. OverTaskT\_\{\\text\{ask\}\}rounds, the total is2​Task2T\_\{\\text\{ask\}\}sequential calls\.

##### EXPAND step\.

EachExpandround issues the following calls:

1. 1\.One call to propose the new dimensionθp\+1\\theta\_\{p\+1\}\.
2. 2\.np\+1n\_\{p\+1\}parallel calls to elicit its prior \(one per value\)\.
3. 3\.One call \(optional\) for the answer\-level likelihood of the new dimension \(when𝒜\\mathcal\{A\}is defined\)\.
4. 4\.\|𝒬\|⋅\|𝒰\|\|\\mathcal\{Q\}\|\\cdot\|\\mathcal\{U\}\|parallel calls to compute likelihood tables for all old questions over the new dimension\.
5. 5\.One call to generate up to\|𝒬′\|\|\\mathcal\{Q\}^\{\\prime\}\|new questions\.
6. 6\.\|𝒬′\|⋅\|𝒰\|⋅\(p\+1\)\|\\mathcal\{Q\}^\{\\prime\}\|\\cdot\|\\mathcal\{U\}\|\\cdot\(p\+1\)parallel calls to compute likelihood tables for the new questions over all dimensions\.

The dominant cost perExpandround is𝒪​\(\(\|𝒬\|\+\|𝒬′\|​\(p\+1\)\)⋅\|𝒰\|\)\\mathcal\{O\}\(\(\|\\mathcal\{Q\}\|\+\|\\mathcal\{Q\}^\{\\prime\}\|\(p\+1\)\)\\cdot\|\\mathcal\{U\}\|\)LLM calls, all within each group executed concurrently\.

##### Final answer\.

A single LLM call produces the final answer conditioned on the MAP state and conversation history\.

## Appendix CNear\-Optimality of Greedy MI Maximization

We prove that BALAR’s greedy question\-selection recovers at least a\(1−1/e\)\(1\-1/e\)fraction of the information gain of the best adaptive policy with a fixed pair space𝒬×𝒰\\mathcal\{Q\}\\times\\mathcal\{U\}\. This result can also be viewed as a special case ofGolovin and Krause \[[2011](https://arxiv.org/html/2605.05386#bib.bib12)\]\. In particular, it follows from adaptive monotonicity \(Definition 2\) and adaptive submodularity \(Definition 3\) together with Theorem 5 ofGolovin and Krause \[[2011](https://arxiv.org/html/2605.05386#bib.bib12)\]\.

##### Setup\.

Letθ∈Θ\\theta\\in\\Thetabe the latent state with priorπ0\\pi\_\{0\}\. Each*\(question, user\) pair*\(q,u\)∈𝒬×𝒰\(q,u\)\\in\\mathcal\{Q\}\\times\\mathcal\{U\}yields an answerYq,u∈𝒴qY\_\{q,u\}\\in\\mathcal\{Y\}\_\{q\}when queried\. A historyℋt=\{\(qi,ui,Yqi,ui\)\}i=1t\\mathcal\{H\}\_\{t\}=\\\{\(q\_\{i\},u\_\{i\},Y\_\{q\_\{i\},u\_\{i\}\}\)\\\}\_\{i=1\}^\{t\}records the pairs asked and answers received up to roundtt\. We write\(q,u\)∉ℋt\(q,u\)\\notin\\mathcal\{H\}\_\{t\}to mean the pair has not yet been asked\. We make the following assumption throughout\.

###### Assumption C\.1\(Conditional independence\)\.

The answers\{Yq,u\}\(q,u\)∈𝒬×𝒰\\\{Y\_\{q,u\}\\\}\_\{\(q,u\)\\in\\mathcal\{Q\}\\times\\mathcal\{U\}\}are conditionally independent givenθ\\theta\.

The*greedy policy*πg\\pi^\{g\}selects at each round

\(qt∗,ut∗\)∈arg⁡max\(q,u\)∉ℋt⁡It​\(θ;Yq,u∣ℋt;q,u\),\(q\_\{t\}^\{\*\},u\_\{t\}^\{\*\}\)\\;\\in\\;\\arg\\max\_\{\(q,u\)\\,\\notin\\,\\mathcal\{H\}\_\{t\}\}\\;I\_\{t\}\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\};\\,q,u\),which is exactly BALAR’sAskselection rule of[Section˜4\.3](https://arxiv.org/html/2605.05386#S4.SS3)\. The*optimal adaptive policy*π∗\\pi^\{\*\}is anykk\-budgeted policy maximising the expected total information gain\. Define

Gt:=𝔼​\[I​\(θ;\{Yqi,ui\}i=1t\)\],G∗:=𝔼​\[I​\(θ;\{Yq,u\}\(q,u\)∈ℋkπ∗\)\],G\_\{t\}\\;:=\\;\\mathbb\{E\}\\\!\\left\[I\\\!\\left\(\\theta;\\,\\\{Y\_\{q\_\{i\},u\_\{i\}\}\\\}\_\{i=1\}^\{t\}\\right\)\\right\],\\qquad G^\{\*\}\\;:=\\;\\mathbb\{E\}\\\!\\left\[I\\\!\\left\(\\theta;\\,\\\{Y\_\{q,u\}\\\}\_\{\(q,u\)\\in\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\}\\right\)\\right\],where the expectations are over the randomness of the respective policies\.

###### Theorem C\.2\.

Under Assumption[C\.1](https://arxiv.org/html/2605.05386#A3.Thmtheorem1), BALAR’s greedy MI\-maximising\(q,u\)\(q,u\)\-selection satisfies

Gk≥\(1−1e\)​G∗,G\_\{k\}\\;\\geq\\;\\left\(1\-\\tfrac\{1\}\{e\}\\right\)G^\{\*\},recovering at least a\(1−1/e\)≈63%\(1\-1/e\)\\approx 63\\%fraction of the cumulative information gain of the optimalkk\-budgeted adaptive policy\.

##### Proof of Theorem[C\.2](https://arxiv.org/html/2605.05386#A3.Thmtheorem2)\.

We first show that the MI of any fixed pair is non\-increasing as the history grows\.

###### Lemma C\.3\.

Under Assumption[C\.1](https://arxiv.org/html/2605.05386#A3.Thmtheorem1), for anyℋt⊆ℋt′\\mathcal\{H\}\_\{t\}\\subseteq\\mathcal\{H\}\_\{t^\{\{\}^\{\\prime\}\}\}and any\(q,u\)∉ℋt′\(q,u\)\\notin\\mathcal\{H\}\_\{t^\{\{\}^\{\\prime\}\}\},

I​\(θ;Yq,u∣ℋt′\)≤I​\(θ;Yq,u∣ℋt\)\.I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t^\{\{\}^\{\\prime\}\}\}\)\\;\\leq\\;I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}\)\.

###### Proof\.

Write:

I​\(θ;Yq,u∣ℋt\)=ℍ​\(Yq,u∣ℋt\)−ℍ​\(Yq,u∣θ,ℋt\)\.I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}\)\\;=\\;\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}\)\-\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\theta,\\,\\mathcal\{H\}\_\{t\}\)\.By Assumption[C\.1](https://arxiv.org/html/2605.05386#A3.Thmtheorem1),Yq,u⟂⟂ℋt∣θY\_\{q,u\}\\perp\\\!\\\!\\\!\\perp\\mathcal\{H\}\_\{t\}\\mid\\theta, so

ℍ​\(Yq,u∣θ,ℋt\)=ℍ​\(Yq,u∣θ\),\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\theta,\\,\\mathcal\{H\}\_\{t\}\)\\;=\\;\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\theta\),and therefore

I​\(θ;Yq,u∣ℋt\)=ℍ​\(Yq,u∣ℋt\)−ℍ​\(Yq,u∣θ\)\.I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}\)\\;=\\;\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}\)\-\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\theta\)\.\(1\)Sinceℋt⊆ℋt′\\mathcal\{H\}\_\{t\}\\subseteq\\mathcal\{H\}\_\{t^\{\{\}^\{\\prime\}\}\}, conditioning onℋt′\\mathcal\{H\}\_\{t^\{\{\}^\{\\prime\}\}\}provides at least as much information as conditioning onℋt\\mathcal\{H\}\_\{t\}, soℍ​\(Yq,u∣ℋt′\)≤ℍ​\(Yq,u∣ℋt\)\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t^\{\{\}^\{\\prime\}\}\}\)\\leq\\mathbb\{H\}\(Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}\)\. Applying equation[1](https://arxiv.org/html/2605.05386#A3.E1)to both sides gives the result\. ∎

Letℋkπ∗​\(ℋtg\)\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)denote the \(random\) set of at mostkkpairs thatπ∗\\pi^\{\*\}would ask if started from the greedy historyℋtg\\mathcal\{H\}\_\{t\}^\{g\}\. By the chain rule for mutual information and Assumption[C\.1](https://arxiv.org/html/2605.05386#A3.Thmtheorem1),

I​\(θ;\{Yq,u\}\(q,u\)∈ℋkπ∗​\(ℋtg\)\|ℋtg\)=∑\(q,u\)∈ℋkπ∗​\(ℋtg\)I​\(θ;Yq,u∣ℋtg,\{Yq′,u′\}\(q′,u′\)≺\(q,u\)\)I\\\!\\left\(\\theta;\\,\\\{Y\_\{q,u\}\\\}\_\{\(q,u\)\\in\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)\}\\,\\Big\|\\,\\mathcal\{H\}\_\{t\}^\{g\}\\right\)\\;=\\;\\sum\_\{\(q,u\)\\,\\in\\,\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)\}I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}^\{g\},\\,\\\{Y\_\{q^\{\\prime\},u^\{\\prime\}\}\\\}\_\{\(q^\{\\prime\},u^\{\\prime\}\)\\prec\(q,u\)\}\)\(2\)so

I​\(θ;\{Yq,u\}\(q,u\)∈ℋkπ∗​\(ℋtg\)\|ℋtg\)≤∑\(q,u\)∈ℋkπ∗​\(ℋtg\)I​\(θ;Yq,u∣ℋtg\),I\\\!\\left\(\\theta;\\,\\\{Y\_\{q,u\}\\\}\_\{\(q,u\)\\in\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)\}\\,\\Big\|\\,\\mathcal\{H\}\_\{t\}^\{g\}\\right\)\\;\\leq\\;\\sum\_\{\(q,u\)\\,\\in\\,\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)\}I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}^\{g\}\),\(3\)where\(q′,u′\)≺\(q,u\)\(q^\{\\prime\},u^\{\\prime\}\)\\prec\(q,u\)denotes pairs asked before\(q,u\)\(q,u\)byπ∗\\pi^\{\*\}\. Since\|ℋkπ∗​\(ℋtg\)\|≤k\|\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)\|\\leq k, we can further bound equation[3](https://arxiv.org/html/2605.05386#A3.E3)by

∑\(q,u\)∈ℋkπ∗​\(ℋtg\)I​\(θ;Yq,u∣ℋtg\)≤k⋅max\(q,u\)∉ℋtg⁡I​\(θ;Yq,u∣ℋtg\)\.\\sum\_\{\(q,u\)\\,\\in\\,\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)\}I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}^\{g\}\)\\;\\leq\\;k\\cdot\\max\_\{\(q,u\)\\,\\notin\\,\\mathcal\{H\}\_\{t\}^\{g\}\}\\;I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}^\{g\}\)\.\(4\)But,

G∗−Gt≤𝔼​\[I​\(θ;\{Yq,u\}\(q,u\)∈ℋkπ∗​\(ℋtg\)\|ℋtg\)\]\.G^\{\*\}\-G\_\{t\}\\;\\leq\\;\\mathbb\{E\}\\\!\\left\[I\\\!\\left\(\\theta;\\,\\\{Y\_\{q,u\}\\\}\_\{\(q,u\)\\in\\mathcal\{H\}\_\{k\}^\{\\pi^\{\*\}\}\(\\mathcal\{H\}\_\{t\}^\{g\}\)\}\\,\\Big\|\\,\\mathcal\{H\}\_\{t\}^\{g\}\\right\)\\right\]\.Combining with equation[4](https://arxiv.org/html/2605.05386#A3.E4)and the greedy selection rule:

G∗−Gt≤k​𝔼​\[max\(q,u\)∉ℋtg⁡I​\(θ;Yq,u∣ℋtg\)\]≤k​\(Gt\+1−Gt\),G^\{\*\}\-G\_\{t\}\\;\\leq\\;k\\;\\mathbb\{E\}\\\!\\left\[\\max\_\{\(q,u\)\\,\\notin\\,\\mathcal\{H\}\_\{t\}^\{g\}\}I\(\\theta;\\,Y\_\{q,u\}\\mid\\mathcal\{H\}\_\{t\}^\{g\}\)\\right\]\\;\\leq\\;k\\,\(G\_\{t\+1\}\-G\_\{t\}\),\(5\)where the last inequality holds because the greedy policy selects the maximally informative pair at roundt\+1t\+1\.

Rearranging equation[5](https://arxiv.org/html/2605.05386#A3.E5):

G∗≤k​Gt\+1−\(k−1\)​Gt,G^\{\*\}\\;\\leq\\;k\\,G\_\{t\+1\}\-\(k\-1\)\\,G\_\{t\},which gives

G∗−Gt\+1≤\(1−1k\)​\(G∗−Gt\)\.G^\{\*\}\-G\_\{t\+1\}\\;\\leq\\;\\left\(1\-\\tfrac\{1\}\{k\}\\right\)\(G^\{\*\}\-G\_\{t\}\)\.So withG0=0G\_\{0\}=0:

G∗−Gk≤\(1−1k\)k​G∗≤e−1​G∗,G^\{\*\}\-G\_\{k\}\\;\\leq\\;\\left\(1\-\\tfrac\{1\}\{k\}\\right\)^\{k\}G^\{\*\}\\;\\leq\\;e^\{\-1\}\\,G^\{\*\},The result of Theorem[C\.2](https://arxiv.org/html/2605.05386#A3.Thmtheorem2)then follows\.

##### Scope\.

This guarantee applies to theAsksub\-loop with a fixed pair space𝒬×𝒰\\mathcal\{Q\}\\times\\mathcal\{U\}\. TheExpandaction dynamically enlarges𝒬\\mathcal\{Q\}and thereby the pair space, and no analogous bound is claimed for rounds that triggerExpand\.

## Appendix DLikelihood labels

We use a discrete label setℒ=\{likely,neutral,unlikely\}\\mathcal\{L\}=\\\{\\texttt\{likely\},\\texttt\{neutral\},\\texttt\{unlikely\}\\\}to elicit priors and likelihoods from the LLM\. Each label corresponds to a qualitative judgment grounded in the available context:

- •likely: the value or answer is explicitly stated, strongly implied, or represents the most natural assumption given the prompt and context\.
- •neutral: the value or answer is plausible but not supported or contradicted by specific evidence; there is insufficient information to prefer it over alternatives\.
- •unlikely: the value or answer is contradicted by the prompt or context, or would require assumptions inconsistent with the provided information\.

These qualitative labels are mapped to probabilities via a fixed functionϕ\\phi, enabling consistent numerical priors and likelihoods across all LLM calls\.

## Appendix EHyperparameters

We perform a grid search over the hyperparameters listed in[Table˜3](https://arxiv.org/html/2605.05386#A5.T3)\. Fixed parameters are shared across all datasets\. Swept parameters are varied independently\. The label set isℒ=\{likely,neutral,unlikely\}\\mathcal\{L\}=\\\{\\texttt\{likely\},\\texttt\{neutral\},\\texttt\{unlikely\}\\\}with mapϕ​\(likely\)=0\.8\\phi\(\\texttt\{likely\}\)=0\.8,ϕ​\(neutral\)=0\.5\\phi\(\\texttt\{neutral\}\)=0\.5,ϕ​\(unlikely\)=0\.2\\phi\(\\texttt\{unlikely\}\)=0\.2\. This mapping is not tuned for performance, but chosen as simple and heuristically reasonable\. The ablation in[Table˜7](https://arxiv.org/html/2605.05386#A9.T7)shows it performs well in practice\.

Table 3:Hyperparameter settings\.
## Appendix FSemantic Equivalence Metric

For tasks where no single canonical ground\-truth answer exists but multiple phrasings can express the same underlying meaning, such as AR\-Bench\-SP, metrics such as character\-level F1 similarity are inadequate: a random permutation of the ground\-truth characters can achieve a perfect score while a faithful paraphrase may receive a low one\.

FollowingKuhnet al\.\[[2023](https://arxiv.org/html/2605.05386#bib.bib8)\], letaadenote the agent’s predicted answer anda∗a^\{\*\}the ground\-truth solution, both conditioned on the same prompt𝐩\\mathbf\{p\}\. We use a judge LLM𝒥\\mathcal\{J\}\(Qwen2\.5\-32B\-Instruct\) to assess the directional entailment relation

𝒥​\(a1,a2∣𝐩\)∈\{entailment,neutral,contradiction\},\\mathcal\{J\}\(a\_\{1\},a\_\{2\}\\mid\\mathbf\{p\}\)\\;\\in\\;\\\{\\texttt\{entailment\},\\,\\texttt\{neutral\},\\,\\texttt\{contradiction\}\\\},which evaluates whethera1a\_\{1\}semantically entailsa2a\_\{2\}in the context of𝐩\\mathbf\{p\}\. We query𝒥\\mathcal\{J\}in both directions, obtaining

e→=𝒥​\(a,a∗∣𝐩\),e←=𝒥​\(a∗,a∣𝐩\)\.e\_\{\\rightarrow\}=\\mathcal\{J\}\(a,a^\{\*\}\\mid\\mathbf\{p\}\),\\qquad e\_\{\\leftarrow\}=\\mathcal\{J\}\(a^\{\*\},a\\mid\\mathbf\{p\}\)\.We declareaaanda∗a^\{\*\}*semantically equivalent*\(non strict\) when neither direction is a contradiction and the pair is not jointly neutral:

Eq​\(a,a∗\)=1​\[\(contradiction∉\{e→,e←\}\)∧\(\(e→,e←\)≠\(neutral,neutral\)\)\]\.\\mathrm\{Eq\}\(a,a^\{\*\}\)\\;=\\;\\mathbf\{1\}\\\!\\left\[\\left\(\\texttt\{contradiction\}\\notin\\\{e\_\{\\rightarrow\},e\_\{\\leftarrow\}\\\}\\right\)\\;\\wedge\\;\\left\(\(e\_\{\\rightarrow\},e\_\{\\leftarrow\}\)\\neq\(\\texttt\{neutral\},\\texttt\{neutral\}\)\\right\)\\right\]\.This criterion accepts predictions that are entailed by the reference in at least one direction, provided no direction is contradicted\. It reflects the nature of such tasks, where multiple correct phrasings of the hidden explanation are possible and a partial match \(one direction entailment, the other neutral\) often corresponds to a correct answer that omits minor details\.

## Appendix GMain Results

We report full numerical results for all agent models and baselines in[Table˜4](https://arxiv.org/html/2605.05386#A7.T4), complementing the bar charts in[Figure˜8](https://arxiv.org/html/2605.05386#S6.F8)\.

Table 4:Main results \(%\)\. Columns correspond to agent models\. Outcome is*exact accuracy*for AR\-Bench\-DC, iCraft\-MD, and*non\-strict semantic equivalence*for AR\-Bench\-SP\.Task=25T\_\{\\text\{ask\}\}=25and Standard Errors=p​\(1−p\)/n=\\sqrt\{p\(1\-p\)/n\}\.
## Appendix HOracle

##### Oracle\.

The LLM is given the prompt, public context, and all users’ private information, providing an upper bound on performance\.

##### Oracle dimensions\.

The oracle returns a minimal set of ambiguity dimensions and their ground\-truth values\.[Table˜5](https://arxiv.org/html/2605.05386#A8.T5)reports the average number of dimensions \(±\\pmSE\), measuring intrinsic disambiguation complexity\.

Table 5:Oracle Dimensions\. Entries are the average number of oracle dimensions required to resolve a case, reported as mean±\\pmSE over instances\.
##### Oracle accuracy\.

We report oracle accuracy for iCraft\-MD in[Table˜6](https://arxiv.org/html/2605.05386#A8.T6)\. For AR\-Bench\-DC and AR\-Bench\-SP, full private information deterministically fixes the answer, so oracle accuracy is trivially100%100\\%and omitted\.

Table 6:Oracle accuracy \(%\) on iCraft\-MD: upper bound when the agent has access to all patient information\.[Table˜5](https://arxiv.org/html/2605.05386#A8.T5)shows that resolving an ambiguous prompt requires on average33–55disambiguating dimensions, with larger models proposing fewer, more precise dimensions\. On iCraft\-MD, oracle accuracy ranges from65\.9%65\.9\\%to83\.6%83\.6\\%in[Table˜6](https://arxiv.org/html/2605.05386#A8.T6), reflecting irreducible ambiguity in the underlying clinical questions\. BALAR withQwen2\.5\-32B\-Instructachieves73\.6%73\.6\\%without access to private information, closing88%88\\%of the gap to the oracle upper bound\.

## Appendix IPrior Sensitivity

Table 7:Prior sensitivity ablation on iCraft\-MD / Qwen2\.5\-32B\-Instruct\. Config:Task=25T\_\{\\text\{ask\}\}\{=\}25,α=0\.1\\alpha\{=\}0\.1,p=1p\{=\}1,\|𝒬\|=2\|\\mathcal\{Q\}\|\{=\}2\.We study the effect of different prior mappings\.[Table˜7](https://arxiv.org/html/2605.05386#A9.T7)shows that BALAR is robust to moderate changes in the prior mapping: flatter priors incur only a3\.9%3\.9\\%relative drop\. However, overly sharp priors degrade performance substantially \(29\.1%29\.1\\%relative drop\), as they over\-concentrate the initial belief\.

## Appendix JRound Distribution

We analyze how BALAR allocates its interaction budget acrossAskandExpandactions\.[Figure˜10](https://arxiv.org/html/2605.05386#A10.F10)shows the distribution of the number ofAskandExpandrounds per instance underQwen2\.5\-32B\-Instruct\. On AR\-Bench\-DC, nearly all runs exhaust the fullAskbudget with fewExpandcalls, suggesting the initial dimensions suffice\. On AR\-Bench\-SP,Expandis called11–55times per run\. On iCraft\-MD,Expandis called55–88times consistently, indicating that clinical cases regularly require state refinement\.

![Refer to caption](https://arxiv.org/html/2605.05386v1/x11.png)Figure 10:Distribution ofAskandExpandrounds for BALAR underQwen2\.5\-32B\-Instructon each dataset\. Each panel shows normalized histograms over runs, so bar heights correspond to proportions\. The dashed vertical line indicates the selectedTaskT\_\{\\mathrm\{ask\}\}budget\. Fixed configs: DC \(Task=25T\_\{\\text\{ask\}\}\{=\}25,α=0\.3\\alpha\{=\}0\.3,p=5p\{=\}5,\|𝒬\|=10\|\\mathcal\{Q\}\|\{=\}10\), SP \(Task=25T\_\{\\text\{ask\}\}\{=\}25,α=0\.3\\alpha\{=\}0\.3,β=0\.5\\beta\{=\}0\.5,p=5p\{=\}5,\|𝒬\|=10\|\\mathcal\{Q\}\|\{=\}10\), iCraft\-MD \(Task=25T\_\{\\text\{ask\}\}\{=\}25,α=0\.1\\alpha\{=\}0\.1,p=1p\{=\}1,\|𝒬\|=2\|\\mathcal\{Q\}\|\{=\}2\)\.
## Appendix KPrompts

All BALAR prompts consist of a*system prompt*and a*user prompt*\. Each subsection below covers one prompt role and contains, for each of the four domains \(AR\-Bench\-DC, AR\-Bench\-SP, iCraft\-MD\), the system prompt followed by the user prompt\(s\)\. Every prompt requests a shortreasonfield before the structured output, implementing the lightweight chain\-of\-thought verification described in[Appendix˜A](https://arxiv.org/html/2605.05386#A1)\.

### K\.1User Simulator

#### AR\-Bench\-DC

##### System prompt\.

Youaresimulatingasuspectbeinginterrogatedinamurderinvestigation\.Youmuststayincharacterandanswerbasedonlyonyourcharacter’sknowledgeandstory\.Ifyouarethemurderer,youwilltrytodeflectsuspicionwhileremainingconsistentwithyourstory\.Ifyouareinnocent,answerhonestlybasedonwhatyourcharacterknows\.

##### With discrete choices\.

<USER\_CONTEXT\>

\{user\_context\}

</USER\_CONTEXT\>

<QUESTION\>

\{question\}

</QUESTION\>

<CHOICES\_WITH\_IDS\>

\{choices\_with\_ids\}

</CHOICES\_WITH\_IDS\>

Task:

Youareasuspectbeinginterrogatedbyadetective\.Answer<QUESTION\>inawaythatisconsistentwithyourcharacterdescribedin<USER\_CONTEXT\>\.Ifyouarethemurderer,youmaydeflectorbeevasive,butyoumustremainconsistentwithyourstory\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhyyouchosetheansweryoudid\.

\-answer\_id:theidofthechoiceyouareselectingasyouranswertothequestion\.

Constraints:

UseONLYinformationthatissupportedby<USER\_CONTEXT\>\.

answer\_idmustbeoneoftheidsprovidedin<CHOICES\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"answer\_id":string

\}\}

##### Without discrete choices \(free\-form\)\.

<USER\_CONTEXT\>

\{user\_context\}

</USER\_CONTEXT\>

<QUESTION\>

\{question\}

</QUESTION\>

Task:

Youareasuspectbeinginterrogatedbyadetective\.Answer<QUESTION\>inawaythatisconsistentwithyourcharacterdescribedin<USER\_CONTEXT\>\.Ifyouarethemurderer,youmaydeflectorbeevasive,butyoumustremainconsistentwithyourstory\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhyyouchosetheansweryoudid\.

\-answer:youranswertothequestion\.

Constraints:

UseONLYinformationthatissupportedby<USER\_CONTEXT\>\.

answermustbeanaturallanguageanswertothequestion\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"answer":string

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youarethehostofathinkingpuzzle\(situationpuzzle\)\.Youknowthehiddenexplanationbehindthepuzzle\.Answerthesolver’squestionstruthfullybasedonthesolution\.Neverrevealthefullsolutiondirectly;onlyconfirmordenyspecificaspectswhenasked\.

##### With discrete choices\.

<USER\_CONTEXT\>

\{user\_context\}

</USER\_CONTEXT\>

<QUESTION\>

\{question\}

</QUESTION\>

<CHOICES\_WITH\_IDS\>

\{choices\_with\_ids\}

</CHOICES\_WITH\_IDS\>

Task:

Youarethehostofathinkingpuzzle\.Asolverisaskingyou<QUESTION\>totrytofigureoutthehiddenexplanation\.Answerbasedonyourknowledgeofthepuzzlesolutiondescribedin<USER\_CONTEXT\>\.Giveatruthfulanswer\-\-\-donotmislead,butdonotgiveextrainformation\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhyyouchosetheansweryoudid\.

\-answer\_id:theidofthechoiceyouareselectingasyouranswertothequestion\.

Constraints:

UseONLYinformationthatissupportedby<USER\_CONTEXT\>\.

answer\_idmustbeoneoftheidsprovidedin<CHOICES\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"answer\_id":string

\}\}

##### Without discrete choices \(free\-form\)\.

<USER\_CONTEXT\>

\{user\_context\}

</USER\_CONTEXT\>

<QUESTION\>

\{question\}

</QUESTION\>

Task:

Youarethehostofathinkingpuzzle\.Asolverisaskingyou<QUESTION\>totrytofigureoutthehiddenexplanation\.Answerbasedonyourknowledgeofthepuzzlesolutiondescribedin<USER\_CONTEXT\>\.Giveatruthfulanswer\-\-\-donotmislead,butdonotvolunteerextrainformation\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhyyouchosetheansweryoudid\.

\-answer:youranswertothequestion\.

Constraints:

UseONLYinformationthatissupportedby<USER\_CONTEXT\>\.

answermustbeanaturallanguageanswertothequestion\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"answer":string

\}\}

#### iCraft\-MD

##### System prompt\.

Youaresimulatingapatientbeingexaminedbyaphysician\.AnswerquestionsbasedONLYonyourmedicalrecordandsymptoms\.Usefactsfromyourmedicalrecord\.Donotgiveinformationthatwasnotspecificallyaskedabout\.

##### With discrete choices\.

<USER\_CONTEXT\>

\{user\_context\}

</USER\_CONTEXT\>

<QUESTION\>

\{question\}

</QUESTION\>

<CHOICES\_WITH\_IDS\>

\{choices\_with\_ids\}

</CHOICES\_WITH\_IDS\>

Task:

Youareapatientbeingexaminedbyaphysician\.Answer<QUESTION\>inawaythatisconsistentwithyourmedicalrecordandsymptomsdescribedin<USER\_CONTEXT\>\.Usefactsfromyourmedicalrecord\.Donotgiveinformationthatwasnotspecificallyaskedabout\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhyyouchosetheansweryoudid\.

\-answer\_id:theidofthechoiceyouareselectingasyouranswertothequestion\.

Constraints:

UseONLYinformationthatissupportedby<USER\_CONTEXT\>\.

answer\_idmustbeoneoftheidsprovidedin<CHOICES\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"answer\_id":string

\}\}

##### Without discrete choices \(free\-form\)\.

<USER\_CONTEXT\>

\{user\_context\}

</USER\_CONTEXT\>

<QUESTION\>

\{question\}

</QUESTION\>

Task:

Youareapatientbeingexaminedbyaphysician\.Answer<QUESTION\>inawaythatisconsistentwithyourmedicalrecordandsymptomsdescribedin<USER\_CONTEXT\>\.Usefactsfromyourmedicalrecord\.Donotgiveinformationthatwasnotspecificallyaskedabout\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhyyouchosetheansweryoudid\.

\-answer:youranswertothequestion\.

Constraints:

UseONLYinformationthatissupportedby<USER\_CONTEXT\>\.

answermustbeanaturallanguageanswertothequestion\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"answer":string

\}\}

### K\.2Step 1 — Dimension Proposal

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectiveanalyzingamurdercase\.Yourgoalistoidentifythekeyinvestigativedimensionsthatmustberesolvedtodeterminewhotherealmurdereris\.

##### User prompt\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

Task:

Identifythekeyinvestigativedimensionsinthe<CASE\_QUESTION\>thatmustberesolvedtodeterminewhotherealmurdereris\.

Definition:

Aninvestigativedimensionisaspecificaspectofthemurdercase\(e\.g\.,motive,alibi,accesstothemurderweapon,relationshiptothevictim\)wheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.Oncethedimension’svalueisknown,thecasemovestowardidentifyingasinglesuspectasthemurderer\.

Whattogenerate:

\-Produceaminimal,non\-overlappingsetofinvestigativedimensions\.

\-Produceexactly\{num\_initial\_dims\}investigativedimensions\.

\-Eachdimensionmustcorrespondtoadistinct,investigativeuncertainty\.

\-If<CASE\_BACKGROUND\>alreadyresolvesadimension,donotincludeit\.

\-If<CASE\_BACKGROUND\>proposessomeinvestigativedimensions,usethem\.

Foreachinvestigativedimension,provide:

\-reason:ashortone\-sentenceexplanationofwhythisdimensioniscriticalforidentifyingthemurderer\.

\-name:ashort,specificlabel\(e\.g\.,"Motive","Alibiattimeofdeath","Accesstomurderweapon",etc\.\)

\-values:alistofplausiblevalues\(e\.g\.,onepersuspectorperscenario\),nolargerthan\{max\_num\_values\_per\_dim\},thatthisdimensioncouldtakeinthecontextofthecase\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>and<CASE\_BACKGROUND\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYonidentifyinginvestigativedimensions\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>or<CASE\_BACKGROUND\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"dimensions":\[

\{\{

"reason":string,

"name":string,

"values":\[string,\.\.\.\]

\}\},

\.\.\.

\]

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzlesolver\.Yourgoalistoidentifythehiddendimensionsofthepuzzle\-\-\-theunstatedaspectsofthescenariowhosetruevalueswouldexplainthestrangesituationpresented\.

##### User prompt\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

Task:

Identifythehiddendimensionsofthethinkingpuzzlein<PUZZLE\>thatmustbeuncoveredtoexplainthestrangescenario\.

Definition:

Apuzzledimensionisahiddenaspectofthescenario\-\-\-suchasanon\-obviouswordmeaning,anunstatedcontext,asurprisingidentity,oranunusualcausalmechanism\-\-\-whereknowingitstruevaluewouldexplainthepuzzle\.Oncethedimension’svalueisknown,thepuzzlemovestowardasinglecoherentexplanation\.

Whattogenerate:

\-Produceaminimal,non\-overlappingsetofpuzzledimensions\.

\-Produceexactly\{num\_initial\_dims\}puzzledimensions\.

\-Eachdimensionmustcorrespondtoadistincthiddenaspectofthepuzzle\.

\-If<PUZZLE\_CONTEXT\>alreadyresolvesadimension,donotincludeit\.

\-If<PUZZLE\_CONTEXT\>proposessomepuzzledimensions,usethem\.

Foreachpuzzledimension,provide:

\-reason:ashortone\-sentenceexplanationofwhythisdimensionisakeyunknowninthepuzzle\.

\-name:ashort,specificlabel

\-values:alistofplausibleinterpretations,nolargerthan\{max\_num\_values\_per\_dim\},thatthisdimensioncouldtake\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>and<PUZZLE\_CONTEXT\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYonidentifyinghiddendimensions\.

\-DoNOTrewriteorrestatethe<PUZZLE\>or<PUZZLE\_CONTEXT\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"dimensions":\[

\{\{

"reason":string,

"name":string,

"values":\[string,\.\.\.\]

\}\},

\.\.\.

\]

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianperformingadiagnosis\.Yourgoalistoidentifythekeyclinicaldimensionsthatmustbeclarifiedtoarriveatthecorrectdiagnosisorclinicaldecision\.

##### User prompt\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

Task:

Identifythekeyclinicaldimensionsinthe<CLINICAL\_QUESTION\>thatmustbeclarifiedtoarriveatthecorrectdiagnosisorclinicaldecision\.

Definition:

Aclinicaldimensionisaspecificclinicalfactor\(e\.g\.,symptomcharacterization,labfinding,riskfactor,pastmedicalhistory\)wheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.Oncethedimension’svalueisknown,thediagnosisnarrowstowardasinglecorrectanswer\.

Whattogenerate:

\-Produceaminimal,non\-overlappingsetofclinicaldimensions\.

\-Produceexactly\{num\_initial\_dims\}clinicaldimensions\.

\-Eachdimensionmustcorrespondtoadistinctclinicaluncertainty\.

\-If<PATIENT\_INFORMATION\>alreadyresolvesadimension,donotincludeit\.

\-If<PATIENT\_INFORMATION\>proposessomeclinicaldimensions,usethem\.

Foreachclinicaldimension,provide:

\-reason:ashortone\-sentenceexplanationofwhythisclinicalfactorisdiscriminatingbetweendiagnoses\.

\-name:ashort,specificclinicallabel

\-values:alistofclinicallyplausiblevalues,nolargerthan\{max\_num\_values\_per\_dim\},thatthisdimensioncouldtake\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>and<PATIENT\_INFORMATION\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYonidentifyingclinicaldimensions\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>or<PATIENT\_INFORMATION\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"dimensions":\[

\{\{

"reason":string,

"name":string,

"values":\[string,\.\.\.\]

\}\},

\.\.\.

\]

\}\}

### K\.3Step 2 — Prior Elicitation

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectiveforminginitialhypothesesaboutamurdercasebasedontheavailableevidenceandcasebackground\.

##### User prompt\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUE\>

\{dimension\_value\}

</DIMENSION\_VALUE\>

Task:

Given<CASE\_QUESTION\>and<CASE\_BACKGROUND\>,judgehowlikelytheinvestigativedimension<DIMENSION\_NAME\>takesonthevalue<DIMENSION\_VALUE\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythe<DIMENSION\_NAME\>islikely,unlikely,orneutraltotakeonthevalue<DIMENSION\_VALUE\>\.

\-label:oneof"likely","unlikely",or"neutral"accordingtothefollowingdefinitions:

\-likely:<DIMENSION\_VALUE\>isexplicitlystated,stronglyimplied,oristhemostnaturalassumptiongiventheevidenceinthe<CASE\_QUESTION\>and<CASE\_BACKGROUND\>\.

\-neutral:<DIMENSION\_VALUE\>isplausiblebutnotimpliedorsupportedbyspecificevidenceinthe<CASE\_QUESTION\>or<CASE\_BACKGROUND\>\.

\-unlikely:<DIMENSION\_VALUE\>iscontradictedbythe<CASE\_QUESTION\>or<CASE\_BACKGROUND\>,orwouldrequireassumptionsthatareinconsistentwiththeavailableevidence\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>and<CASE\_BACKGROUND\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>or<CASE\_BACKGROUND\>\.

\-labelmustbeoneof"likely","unlikely",or"neutral"\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"label":string

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzlesolverforminginitialhypothesesaboutthehiddenaspectsofapuzzlebasedonthescenariodescription\.

##### User prompt\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUE\>

\{dimension\_value\}

</DIMENSION\_VALUE\>

Task:

Given<PUZZLE\>and<PUZZLE\_CONTEXT\>,judgehowlikelythepuzzledimension<DIMENSION\_NAME\>takesonthevalue<DIMENSION\_VALUE\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythe<DIMENSION\_NAME\>islikely,unlikely,orneutraltotakeonthevalue<DIMENSION\_VALUE\>\.

\-label:oneof"likely","unlikely",or"neutral"accordingtothefollowingdefinitions:

\-likely:<DIMENSION\_VALUE\>isexplicitlysuggestedby,stronglyimpliedby,oristhemostnaturalinterpretationgiventhecluesinthe<PUZZLE\>and<PUZZLE\_CONTEXT\>\.

\-neutral:<DIMENSION\_VALUE\>isaplausibleinterpretationbutnotimpliedorsupportedbyspecificcluesinthe<PUZZLE\>or<PUZZLE\_CONTEXT\>\.

\-unlikely:<DIMENSION\_VALUE\>iscontradictedbythe<PUZZLE\>or<PUZZLE\_CONTEXT\>,orwouldrequireassumptionsthatareinconsistentwiththescenario\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>and<PUZZLE\_CONTEXT\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYonjudgingthelikelihoodofthedimensionvalue\.

\-DoNOTrewriteorrestatethe<PUZZLE\>or<PUZZLE\_CONTEXT\>\.

\-labelmustbeoneof"likely","unlikely",or"neutral"\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"label":string

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianforminginitialclinicalhypothesesbasedontheavailablepatientinformation\.Useyourclinicalknowledgetoassessthelikelihoodofdifferentclinicalfindings\.

##### User prompt\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUE\>

\{dimension\_value\}

</DIMENSION\_VALUE\>

Task:

Given<CLINICAL\_QUESTION\>and<PATIENT\_INFORMATION\>,judgehowlikelytheclinicaldimension<DIMENSION\_NAME\>takesonthevalue<DIMENSION\_VALUE\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythe<DIMENSION\_NAME\>islikely,unlikely,orneutraltotakeonthevalue<DIMENSION\_VALUE\>\.

\-label:oneof"likely","unlikely",or"neutral"accordingtothefollowingdefinitions:

\-likely:<DIMENSION\_VALUE\>isexplicitlystated,stronglyimplied,oristhemostnaturalclinicalassumptiongiventhepatientinformationin<CLINICAL\_QUESTION\>and<PATIENT\_INFORMATION\>\.

\-neutral:<DIMENSION\_VALUE\>isclinicallyplausiblebutnotimpliedorsupportedbyspecificevidenceinthe<CLINICAL\_QUESTION\>or<PATIENT\_INFORMATION\>\.

\-unlikely:<DIMENSION\_VALUE\>iscontradictedbythe<CLINICAL\_QUESTION\>or<PATIENT\_INFORMATION\>,orwouldrequireassumptionsthatareinconsistentwiththepatient’spresentation\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>and<PATIENT\_INFORMATION\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>or<PATIENT\_INFORMATION\>\.

\-labelmustbeoneof"likely","unlikely",or"neutral"\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"label":string

\}\}

### K\.4Step 3 — Question Generation

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectivepreparinginterrogationquestionsforsuspectsinamurderinvestigation\.Yourquestionsshouldbedesignedtorevealinconsistencies,uncovermotives,andverifyalibis\.

##### User prompt\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<INVESTIGATIVE\_DIMENSIONS\>

\{dimensions\_with\_values\}

</INVESTIGATIVE\_DIMENSIONS\>

Task:

Given<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,and<INVESTIGATIVE\_DIMENSIONS\>,generateexactly\{num\_initial\_questions\}interrogationquestionstoaskthesuspectsthatwouldhelpidentifytherealmurderer\.Eachquestionshouldtargetoneormore<INVESTIGATIVE\_DIMENSIONS\>andhavemultiple\-choiceanswers\.

Definition:

<INVESTIGATIVE\_DIMENSIONS\>isalistofinvestigativedimensions,whereeachdimensionhasanameandalistofpossiblevaluesitcouldtake\.Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

Whattogenerate:

Foreachofthe\{num\_initial\_questions\}questions,provide:

\-reason:ashortone\-sentenceexplanationofwhythisquestionwouldhelpidentifythemurderer\.

\-question:thetextoftheinterrogationquestion\.

\-choices:alistofmultiple\-choiceansweroptionsforthequestion,nolargerthan\{max\_choices\_per\_question\}\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,and<INVESTIGATIVE\_DIMENSIONS\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYongeneratinginterrogationquestions\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>or<CASE\_BACKGROUND\>\.

\-Eachquestionmustbedesignedtoelicitinformationaboutoneormoreofthedimensionsin<INVESTIGATIVE\_DIMENSIONS\>\.

\-Eachquestionmusthavemultiple\-choiceanswers\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"questions":\[

\{\{

"reason":string,

"question":string,

"choices":\[string,\.\.\.\]

\}\},

\.\.\.

\]

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzlesolver\.Generateclarifyingquestionstoaskthepuzzlehostthatwillhelpyouuncoverthehiddenexplanation\.Goodpuzzlequestionstestspecifichypothesesaboutwhatisreallygoingoninthescenario\.

##### User prompt\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<PUZZLE\_DIMENSIONS\>

\{dimensions\_with\_values\}

</PUZZLE\_DIMENSIONS\>

Task:

Given<PUZZLE\>,<PUZZLE\_CONTEXT\>,and<PUZZLE\_DIMENSIONS\>,generateexactly\{num\_initial\_questions\}clarifyingquestionstoaskthepuzzlehostthatwouldhelpuncoverthehiddenexplanation\.Eachquestionshouldtargetoneormore<PUZZLE\_DIMENSIONS\>andhave"yes"/"no"answers\.

Definition:

<PUZZLE\_DIMENSIONS\>isalistofpuzzledimensions,whereeachdimensionhasanameandalistofpossiblevaluesitcouldtake\.Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

Whattogenerate:

Foreachofthe\{num\_initial\_questions\}questions,provide:

\-reason:ashortone\-sentenceexplanationofwhythisquestionwouldhelpsolvethepuzzle\.

\-question:thetextoftheclarifyingquestiontoaskthepuzzlehost\.

\-choices:\["yes","no"\]asthe\{max\_choices\_per\_question\}multiple\-choiceansweroptionsforthequestion\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>,<PUZZLE\_CONTEXT\>,and<PUZZLE\_DIMENSIONS\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYongeneratingclarifyingquestions\.

\-DoNOTrewriteorrestatethe<PUZZLE\>or<PUZZLE\_CONTEXT\>\.

\-Eachquestionmustbedesignedtoelicitinformationaboutoneormoreofthedimensionsin<PUZZLE\_DIMENSIONS\>\.

\-Eachquestionmusthave"yes"/"no"answers\.

\-Keepeachquestionshort:atmost20words\.

\-Keepeachreasonshort:atmost15words\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"questions":\[

\{\{

"reason":string,

"question":string,

"choices":\["yes","no"\]

\}\},

\.\.\.

\]

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianconductingapatientinterviewtogatherclinicalinformationforadiagnosis\.Yourquestionsshouldbetargeted,clinicallyrelevant,anddesignedtodiscriminatebetweencompetingdiagnoses\.

##### User prompt\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<CLINICAL\_DIMENSIONS\>

\{dimensions\_with\_values\}

</CLINICAL\_DIMENSIONS\>

Task:

Given<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,and<CLINICAL\_DIMENSIONS\>,generateexactly\{num\_initial\_questions\}clinicalquestionstoaskthepatientthatwouldhelparriveatthecorrectdiagnosis\.Eachquestionshouldtargetoneormore<CLINICAL\_DIMENSIONS\>andhavemultiple\-choiceanswers\.

Definition:

<CLINICAL\_DIMENSIONS\>isalistofclinicaldimensions,whereeachdimensionhasanameandalistofpossiblevaluesitcouldtake\.Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnoses\.

Whattogenerate:

Foreachofthe\{num\_initial\_questions\}questions,provide:

\-reason:ashortone\-sentenceexplanationofwhythisquestionwouldhelpnarrowthediagnosis\.

\-question:thetextoftheclinicalquestiontoaskthepatient\.

\-choices:alistofmultiple\-choiceansweroptionsforthequestion,nolargerthan\{max\_choices\_per\_question\}\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,and<CLINICAL\_DIMENSIONS\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYongeneratingclinicalquestions\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>or<PATIENT\_INFORMATION\>\.

\-Eachquestionmustbedesignedtoelicitinformationaboutoneormoreofthedimensionsin<CLINICAL\_DIMENSIONS\>\.

\-Eachquestionmusthavemultiple\-choiceanswers\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"questions":\[

\{\{

"reason":string,

"question":string,

"choices":\[string,\.\.\.\]

\}\},

\.\.\.

\]

\}\}

### K\.5Step 4 — Likelihood Table Construction

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectiveevaluatinghowasuspectwouldlikelyrespondtoaninterrogationquestionunderdifferentassumptionsaboutthecase\.Considerthatguiltysuspectsmaydeflect,lie,orgiveevasiveanswers,whileinnocentsuspectswillanswerbasedontheirgenuineknowledge\.

##### Without conversation history\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<SUSPECT\_INFO\>

\{user\_info\}

</SUSPECT\_INFO\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<QUESTION\>

\{question\_text\}

</QUESTION\>

<QUESTION\_CHOICES\_WITH\_IDS\>

\{question\_choices\_with\_ids\}

</QUESTION\_CHOICES\_WITH\_IDS\>

Definition:

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<QUESTION\_CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letchoices\[j\]bethej\-thelementof<QUESTION\_CHOICES\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-Assumethetruestateofthecaseis<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Impersonatethesuspectdescribedin<SUSPECT\_INFO\>\.

\-Judgehowlikelyitisthatthissuspectwouldanswerthequestion<QUESTION\>withchoices\[j\]\["text"\]underthatassumption\.Considerthataguiltysuspectmaytrytodeflectormislead\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-question\_choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhychoices\[j\]islabeledlikely/neutral/unlikelyandwhytheothertwolabelswerenotchosen\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthesuspectactsaccordingto<SUSPECT\_INFO\>,thesuspectisexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthesuspectactsaccordingto<SUSPECT\_INFO\>,choices\[j\]\["text"\]isplausiblebutnotspecificallysupported;thereisinsufficientevidencetosaythatthesuspectwouldorwouldnotpreferitoverotherchoices\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthesuspectactsaccordingto<SUSPECT\_INFO\>,thesuspectisnotexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,and<SUSPECT\_INFO\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthequestionchoicesunderdifferentassumptionsaboutthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,or<SUSPECT\_INFO\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandquestionchoice\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"question\_choice\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachquestionchoice

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_question\_choices\}objects\(oneperquestionchoice\)\.

##### With conversation history\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<SUSPECT\_INFO\>

\{user\_info\}

</SUSPECT\_INFO\>

<INTERROGATION\_LOG\>

\{conversation\_log\}

</INTERROGATION\_LOG\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<QUESTION\>

\{question\_text\}

</QUESTION\>

<QUESTION\_CHOICES\_WITH\_IDS\>

\{question\_choices\_with\_ids\}

</QUESTION\_CHOICES\_WITH\_IDS\>

Definition:

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<QUESTION\_CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

<INTERROGATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheinterrogationbetweenthedetectiveandthesuspectsuptothispoint\.Thisinformationmayprovideadditionalcontextandrevealinconsistencies\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letchoices\[j\]bethej\-thelementof<QUESTION\_CHOICES\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-Assumethetruestateofthecaseis<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Impersonatethesuspectdescribedin<SUSPECT\_INFO\>\.

\-Judgehowlikelyitisthatthissuspectwouldanswerthequestion<QUESTION\>withchoices\[j\]\["text"\]underthatassumption\.Considerthataguiltysuspectmaytrytodeflectormislead\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-question\_choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhychoices\[j\]islabeledlikely/neutral/unlikelyandwhytheothertwolabelswerenotchosen\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthesuspectactsaccordingto<SUSPECT\_INFO\>,thesuspectisexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthesuspectactsaccordingto<SUSPECT\_INFO\>,choices\[j\]\["text"\]isplausiblebutnotspecificallysupported;thereisinsufficientevidencetosaythatthesuspectwouldorwouldnotpreferitoverotherchoices\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthesuspectactsaccordingto<SUSPECT\_INFO\>,thesuspectisnotexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<SUSPECT\_INFO\>,and<INTERROGATION\_LOG\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthequestionchoicesunderdifferentassumptionsaboutthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<SUSPECT\_INFO\>,or<INTERROGATION\_LOG\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandquestionchoice\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"question\_choice\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachquestionchoice

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_question\_choices\}objects\(oneperquestionchoice\)\.

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzleanalyst\.Evaluatehowapuzzlehostwhoknowsthehiddenexplanationwouldlikelyrespondtoasolver’squestionunderdifferentassumptionsaboutthepuzzle’shiddenaspects\.Thehostanswerstruthfully,withoutgivingextrainformation\.

##### Without conversation history\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<HOST\_INFO\>

\{user\_info\}

</HOST\_INFO\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<QUESTION\>

\{question\_text\}

</QUESTION\>

<QUESTION\_CHOICES\_WITH\_IDS\>

\{question\_choices\_with\_ids\}

</QUESTION\_CHOICES\_WITH\_IDS\>

Definition:

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<QUESTION\_CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letchoices\[j\]bethej\-thelementof<QUESTION\_CHOICES\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-Assumethehiddenexplanationissuchthat<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Impersonatethepuzzlehostdescribedin<HOST\_INFO\>,whoknowsthehiddenexplanation\.

\-Judgehowlikelyitisthatthishostwouldanswerthequestion<QUESTION\>withchoices\[j\]\["text"\]underthatassumption\.Thehostanswerstruthfully\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-question\_choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhychoices\[j\]islabeledlikely/neutral/unlikelyandwhytheothertwolabelswerenotchosen\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthehostactsaccordingto<HOST\_INFO\>,thehostisexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthehostactsaccordingto<HOST\_INFO\>,choices\[j\]\["text"\]isplausiblebutnotspecificallysupported;thereisinsufficientevidencetosaythatthehostwouldorwouldnotpreferitoverotherchoices\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthehostactsaccordingto<HOST\_INFO\>,thehostisnotexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>,<PUZZLE\_CONTEXT\>,and<HOST\_INFO\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYonjudgingthelikelihoodofthequestionchoicesunderdifferentassumptionsaboutthedimensionvalue\.

\-DoNOTrewriteorrestatethe<PUZZLE\>,<PUZZLE\_CONTEXT\>,or<HOST\_INFO\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandquestionchoice\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"question\_choice\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachquestionchoice

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_question\_choices\}objects\(oneperquestionchoice\)\.

##### With conversation history\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<HOST\_INFO\>

\{user\_info\}

</HOST\_INFO\>

<CONVERSATION\_LOG\>

\{conversation\_log\}

</CONVERSATION\_LOG\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<QUESTION\>

\{question\_text\}

</QUESTION\>

<QUESTION\_CHOICES\_WITH\_IDS\>

\{question\_choices\_with\_ids\}

</QUESTION\_CHOICES\_WITH\_IDS\>

Definition:

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<QUESTION\_CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

<CONVERSATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheconversationbetweenthesolverandthehostuptothispoint\.Thisinformationmayprovideadditionalclues\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letchoices\[j\]bethej\-thelementof<QUESTION\_CHOICES\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-Assumethehiddenexplanationissuchthat<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Impersonatethepuzzlehostdescribedin<HOST\_INFO\>,whoknowsthehiddenexplanation\.

\-Judgehowlikelyitisthatthishostwouldanswerthequestion<QUESTION\>withchoices\[j\]\["text"\]underthatassumption\.Thehostanswerstruthfully\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-question\_choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhychoices\[j\]islabeledlikely/neutral/unlikelyandwhytheothertwolabelswerenotchosen\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthehostactsaccordingto<HOST\_INFO\>,thehostisexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthehostactsaccordingto<HOST\_INFO\>,choices\[j\]\["text"\]isplausiblebutnotspecificallysupported;thereisinsufficientevidencetosaythatthehostwouldorwouldnotpreferitoverotherchoices\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthehostactsaccordingto<HOST\_INFO\>,thehostisnotexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>,<PUZZLE\_CONTEXT\>,<HOST\_INFO\>,and<CONVERSATION\_LOG\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYonjudgingthelikelihoodofthequestionchoicesunderdifferentassumptionsaboutthedimensionvalue\.

\-DoNOTrewriteorrestatethe<PUZZLE\>,<PUZZLE\_CONTEXT\>,<HOST\_INFO\>,or<CONVERSATION\_LOG\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandquestionchoice\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"question\_choice\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachquestionchoice

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_question\_choices\}objects\(oneperquestionchoice\)\.

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianevaluatinghowapatientwouldlikelyrespondtoaclinicalquestionunderdifferentassumptionsabouttheirunderlyingcondition\.

##### Without conversation history\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<PATIENT\_PROFILE\>

\{user\_info\}

</PATIENT\_PROFILE\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<QUESTION\>

\{question\_text\}

</QUESTION\>

<QUESTION\_CHOICES\_WITH\_IDS\>

\{question\_choices\_with\_ids\}

</QUESTION\_CHOICES\_WITH\_IDS\>

Definition:

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<QUESTION\_CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letchoices\[j\]bethej\-thelementof<QUESTION\_CHOICES\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-Assumethepatient’strueclinicalstateis<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Impersonatethepatientdescribedin<PATIENT\_PROFILE\>\.

\-Judgehowlikelyitisthatthispatientwouldanswerthequestion<QUESTION\>withchoices\[j\]\["text"\]underthatassumption\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-question\_choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhychoices\[j\]islabeledlikely/neutral/unlikelyandwhytheothertwolabelswerenotchosen\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientactsaccordingto<PATIENT\_PROFILE\>,thepatientisexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientactsaccordingto<PATIENT\_PROFILE\>,choices\[j\]\["text"\]isplausiblebutnotspecificallysupported;thereisinsufficientevidencetosaythatthepatientwouldorwouldnotpreferitoverotherchoices\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientactsaccordingto<PATIENT\_PROFILE\>,thepatientisnotexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,and<PATIENT\_PROFILE\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthequestionchoicesunderdifferentassumptionsaboutthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,or<PATIENT\_PROFILE\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandquestionchoice\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"question\_choice\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachquestionchoice

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_question\_choices\}objects\(oneperquestionchoice\)\.

##### With conversation history\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<PATIENT\_PROFILE\>

\{user\_info\}

</PATIENT\_PROFILE\>

<CLINICAL\_INTERVIEW\_LOG\>

\{conversation\_log\}

</CLINICAL\_INTERVIEW\_LOG\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<QUESTION\>

\{question\_text\}

</QUESTION\>

<QUESTION\_CHOICES\_WITH\_IDS\>

\{question\_choices\_with\_ids\}

</QUESTION\_CHOICES\_WITH\_IDS\>

Definition:

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<QUESTION\_CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

<CLINICAL\_INTERVIEW\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheclinicalinterviewbetweenthephysicianandthepatientuptothispoint\.Thisinformationmayrevealadditionalsymptomsorclinicaldetails\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letchoices\[j\]bethej\-thelementof<QUESTION\_CHOICES\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-Assumethepatient’strueclinicalstateis<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Impersonatethepatientdescribedin<PATIENT\_PROFILE\>\.

\-Judgehowlikelyitisthatthispatientwouldanswerthequestion<QUESTION\>withchoices\[j\]\["text"\]underthatassumption\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(choices\)\-1:

\-question\_choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhychoices\[j\]islabeledlikely/neutral/unlikelyandwhytheothertwolabelswerenotchosen\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientactsaccordingto<PATIENT\_PROFILE\>,thepatientisexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientactsaccordingto<PATIENT\_PROFILE\>,choices\[j\]\["text"\]isplausiblebutnotspecificallysupported;thereisinsufficientevidencetosaythatthepatientwouldorwouldnotpreferitoverotherchoices\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientactsaccordingto<PATIENT\_PROFILE\>,thepatientisnotexpectedtogivechoices\[j\]\["text"\]for<QUESTION\>\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<PATIENT\_PROFILE\>,and<CLINICAL\_INTERVIEW\_LOG\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthequestionchoicesunderdifferentassumptionsaboutthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<PATIENT\_PROFILE\>,or<CLINICAL\_INTERVIEW\_LOG\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandquestionchoice\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"question\_choice\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachquestionchoice

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_question\_choices\}objects\(oneperquestionchoice\)\.

### K\.6Soft\-Map Scoring

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectiveanalyzingasuspect’sresponsetoaninterrogationquestion\.Yourtaskistojudgehowwellthesuspect’sanswermapstoeachofthepredefinedanswerchoices\.

##### User prompt\.

<QUESTION\>

\{question\}

</QUESTION\>

<CHOICES\_WITH\_IDS\>

\{choices\_with\_ids\}

</CHOICES\_WITH\_IDS\>

<SUSPECT\_ANSWER\>

\{user\_answer\}

</SUSPECT\_ANSWER\>

Definition:

<CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"value"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

Letchoices\[i\]bethei\-thelementof<CHOICES\_WITH\_IDS\>\.

Task:

Judgehowwellthe<SUSPECT\_ANSWER\>mapstoeachofthechoicesin<CHOICES\_WITH\_IDS\>forthequestion<QUESTION\>\.

Whattogenerate:

Fori=0\.\.len\(choices\)\-1:

\-choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofchoices\[i\]\["value"\]islikely/neutral/unlikelygiventhe<SUSPECT\_ANSWER\>\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":choices\[i\]\["value"\]alignswellwiththe<SUSPECT\_ANSWER\>andfitsitbetterthanmostotherchoices\.

\-"neutral":choices\[i\]\["value"\]isneitherclearlysupportednorclearlycontradictedbythe<SUSPECT\_ANSWER\>\.

\-"unlikely":choices\[i\]\["value"\]fitsthe<SUSPECT\_ANSWER\>worsethanotherchoices,orconflictswiththemeaningofthe<SUSPECT\_ANSWER\>\.

Constraints:

\-UseONLYtheinformationprovidedin<QUESTION\>,<CHOICES\_WITH\_IDS\>,and<SUSPECT\_ANSWER\>\.

\-DoNOTanswerthe<QUESTION\>itself\.FocusONLYonjudginghowwellthe<SUSPECT\_ANSWER\>mapstotheprovidedchoices\.

\-DoNOTrewriteorrestatethe<QUESTION\>or<SUSPECT\_ANSWER\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverychoicein<CHOICES\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"scores":\[

\{\{

"choice\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.

\]

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzleanalyst\.Yourtaskistojudgehowwellapuzzlehost’sresponsemapstoeachofthepredefinedanswerchoices\.Thehostgivestruthfulanswers\("yes"or"no"\)\.

##### User prompt\.

<QUESTION\>

\{question\}

</QUESTION\>

<CHOICES\_WITH\_IDS\>

\{choices\_with\_ids\}

</CHOICES\_WITH\_IDS\>

<HOST\_ANSWER\>

\{user\_answer\}

</HOST\_ANSWER\>

Definition:

<CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"value"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

Letchoices\[i\]bethei\-thelementof<CHOICES\_WITH\_IDS\>\.

Task:

Judgehowwellthe<HOST\_ANSWER\>mapstoeachofthechoicesin<CHOICES\_WITH\_IDS\>forthequestion<QUESTION\>\.

Whattogenerate:

Fori=0\.\.len\(choices\)\-1:

\-choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofchoices\[i\]\["value"\]islikely/neutral/unlikelygiventhe<HOST\_ANSWER\>\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":choices\[i\]\["value"\]alignswellwiththe<HOST\_ANSWER\>andfitsitbetterthanmostotherchoices\.

\-"neutral":choices\[i\]\["value"\]isneitherclearlysupportednorclearlycontradictedbythe<HOST\_ANSWER\>\.

\-"unlikely":choices\[i\]\["value"\]fitsthe<HOST\_ANSWER\>worsethanotherchoices,orconflictswiththemeaningofthe<HOST\_ANSWER\>\.

Constraints:

\-UseONLYtheinformationprovidedin<QUESTION\>,<CHOICES\_WITH\_IDS\>,and<HOST\_ANSWER\>\.

\-DoNOTanswerthe<QUESTION\>itself\.FocusONLYonjudginghowwellthe<HOST\_ANSWER\>mapstotheprovidedchoices\.

\-DoNOTrewriteorrestatethe<QUESTION\>or<HOST\_ANSWER\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverychoicein<CHOICES\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"scores":\[

\{\{

"choice\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.

\]

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianinterpretingapatient’sresponsetoaclinicalquestion\.Yourtaskistojudgehowwellthepatient’sanswermapstoeachofthepredefinedanswerchoices\.

##### User prompt\.

<QUESTION\>

\{question\}

</QUESTION\>

<CHOICES\_WITH\_IDS\>

\{choices\_with\_ids\}

</CHOICES\_WITH\_IDS\>

<PATIENT\_ANSWER\>

\{user\_answer\}

</PATIENT\_ANSWER\>

Definition:

<CHOICES\_WITH\_IDS\>isalistofdictswith"id"and"value"fields\.Eachdictcorrespondstoamultiple\-choiceansweroptionforthequestion\.

Letchoices\[i\]bethei\-thelementof<CHOICES\_WITH\_IDS\>\.

Task:

Judgehowwellthe<PATIENT\_ANSWER\>mapstoeachofthechoicesin<CHOICES\_WITH\_IDS\>forthequestion<QUESTION\>\.

Whattogenerate:

Fori=0\.\.len\(choices\)\-1:

\-choice\_id:theidofthequestionchoicebeingevaluated,i\.e\.choices\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofchoices\[i\]\["value"\]islikely/neutral/unlikelygiventhe<PATIENT\_ANSWER\>\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":choices\[i\]\["value"\]alignswellwiththe<PATIENT\_ANSWER\>andfitsitbetterthanmostotherchoices\.

\-"neutral":choices\[i\]\["value"\]isneitherclearlysupportednorclearlycontradictedbythe<PATIENT\_ANSWER\>\.

\-"unlikely":choices\[i\]\["value"\]fitsthe<PATIENT\_ANSWER\>worsethanotherchoices,orconflictswiththemeaningofthe<PATIENT\_ANSWER\>\.

Constraints:

\-UseONLYtheinformationprovidedin<QUESTION\>,<CHOICES\_WITH\_IDS\>,and<PATIENT\_ANSWER\>\.

\-DoNOTanswerthe<QUESTION\>itself\.FocusONLYonjudginghowwellthe<PATIENT\_ANSWER\>mapstotheprovidedchoices\.

\-DoNOTrewriteorrestatethe<QUESTION\>or<PATIENT\_ANSWER\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverychoicein<CHOICES\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"scores":\[

\{\{

"choice\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.

\]

\}\}

### K\.7Expand — New Dimension Proposal

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectivewhoneedstoexploreanewlineofinvestigationinamurdercase\.Thecurrentinvestigativedimensionshavenotbeensufficienttoidentifythemurderer,soyoumustidentifyanewaspectofthecasetoinvestigate\.

##### User prompt\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<PAST\_INVESTIGATIVE\_DIMENSIONS\>

\{past\_dimensions\}

</PAST\_INVESTIGATIVE\_DIMENSIONS\>

<INTERROGATION\_LOG\>

\{conversation\_log\}

</INTERROGATION\_LOG\>

Definition:

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<PAST\_INVESTIGATIVE\_DIMENSIONS\>isalistofdictswith"name"fields,representingtheinvestigativedimensionsthathavealreadybeenexploredintheinvestigation\.

<INTERROGATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheinterrogationbetweenthedetectiveandthesuspectsuptothispoint\.

Task:

Identifyanewinvestigativedimensioninthemurdercasedescribedby<CASE\_QUESTION\>thathasnotbeenpreviouslyidentifiedin<PAST\_INVESTIGATIVE\_DIMENSIONS\>\.Useinsightsfromthe<INTERROGATION\_LOG\>toguideyourchoice\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythisdimensionisacriticalnewlineofinvestigation\.

\-name:ashort,specificlabelforthisinvestigativedimension\(e\.g\.,"Forensicevidence","Financialmotive","Witnesscredibility",etc\.\)\.

\-values:alistofplausiblevalues,nolargerthan\{max\_num\_values\_per\_dim\},thatthisdimensioncouldtakeinthecontextofthecase\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<PAST\_INVESTIGATIVE\_DIMENSIONS\>,and<INTERROGATION\_LOG\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYonidentifyinganewinvestigativedimension\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<PAST\_INVESTIGATIVE\_DIMENSIONS\>,or<INTERROGATION\_LOG\>\.

\-Thegenerateddimensionnamemustnotbethesameasanyofthenamesin<PAST\_INVESTIGATIVE\_DIMENSIONS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"name":string,

"values":\[string,\.\.\.\]

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzlesolver\.Thepuzzledimensionsexploredsofarhavenotbeensufficienttoexplainthepuzzle,soyoumustidentifyanewhiddenaspectofthescenariotoinvestigate\.

##### User prompt\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<PAST\_PUZZLE\_DIMENSIONS\>

\{past\_dimensions\}

</PAST\_PUZZLE\_DIMENSIONS\>

<CONVERSATION\_LOG\>

\{conversation\_log\}

</CONVERSATION\_LOG\>

Definition:

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<PAST\_PUZZLE\_DIMENSIONS\>isalistofdictswith"name"fields,representingthepuzzledimensionsthathavealreadybeenexplored\.

<CONVERSATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheconversationbetweenthesolverandthehostuptothispoint\.

Task:

Identifyanewpuzzledimensionin<PUZZLE\>thathasnotbeenpreviouslyidentifiedin<PAST\_PUZZLE\_DIMENSIONS\>\.Useinsightsfromthe<CONVERSATION\_LOG\>toguideyourchoice\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythisdimensionisakeyunknowninthepuzzle\.

\-name:ashort,specificlabelforthispuzzledimension\.

\-values:alistofplausibleinterpretations,nolargerthan\{max\_num\_values\_per\_dim\},thatthisdimensioncouldtake\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>,<PUZZLE\_CONTEXT\>,<PAST\_PUZZLE\_DIMENSIONS\>,and<CONVERSATION\_LOG\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYonidentifyinganewpuzzledimension\.

\-DoNOTrewriteorrestatethe<PUZZLE\>,<PUZZLE\_CONTEXT\>,<PAST\_PUZZLE\_DIMENSIONS\>,or<CONVERSATION\_LOG\>\.

\-Thegenerateddimensionnamemustnotbethesameasanyofthenamesin<PAST\_PUZZLE\_DIMENSIONS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"name":string,

"values":\[string,\.\.\.\]

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianwhoneedstoexploreanewlineofclinicalinquiry\.Theclinicaldimensionsinvestigatedsofarhavenotbeensufficienttoarriveatadefinitivediagnosis,soyoumustidentifyanewclinicalfactortoassess\.

##### User prompt\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<PAST\_CLINICAL\_DIMENSIONS\>

\{past\_dimensions\}

</PAST\_CLINICAL\_DIMENSIONS\>

<CLINICAL\_INTERVIEW\_LOG\>

\{conversation\_log\}

</CLINICAL\_INTERVIEW\_LOG\>

Definition:

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<PAST\_CLINICAL\_DIMENSIONS\>isalistofdictswith"name"fields,representingtheclinicaldimensionsthathavealreadybeenassessedinthepatientinterview\.

<CLINICAL\_INTERVIEW\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheclinicalinterviewbetweenthephysicianandthepatientuptothispoint\.

Task:

Identifyanewclinicaldimensionrelevanttothe<CLINICAL\_QUESTION\>thathasnotbeenpreviouslyidentifiedin<PAST\_CLINICAL\_DIMENSIONS\>\.Useinsightsfromthe<CLINICAL\_INTERVIEW\_LOG\>toguideyourchoice\-\-\-thepatient’sanswersmayrevealtheneedtoinvestigateadditionalclinicalfactors\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythisclinicalfactorisimportantfornarrowingthediagnosis\.

\-name:ashort,specificclinicallabelforthisdimension\.

\-values:alistofclinicallyplausiblevalues,nolargerthan\{max\_num\_values\_per\_dim\},thatthisdimensioncouldtake\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<PAST\_CLINICAL\_DIMENSIONS\>,and<CLINICAL\_INTERVIEW\_LOG\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYonidentifyinganewclinicaldimension\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<PAST\_CLINICAL\_DIMENSIONS\>,or<CLINICAL\_INTERVIEW\_LOG\>\.

\-Thegenerateddimensionnamemustnotbethesameasanyofthenamesin<PAST\_CLINICAL\_DIMENSIONS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"name":string,

"values":\[string,\.\.\.\]

\}\}

### K\.8Expand — Prior Elicitation

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectiveformingahypothesisaboutanewlyidentifiedaspectofamurdercase,takingintoaccountboththecasebackgroundandwhathasbeenrevealedduringtheinterrogationsofar\.

##### User prompt\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<INTERROGATION\_LOG\>

\{conversation\_log\}

</INTERROGATION\_LOG\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUE\>

\{dimension\_value\}

</DIMENSION\_VALUE\>

Definition:

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<INTERROGATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheinterrogationbetweenthedetectiveandthesuspectsuptothispoint\.

Task:

Given<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,andaspecificinvestigativedimensiondefinedby<DIMENSION\_NAME\>and<DIMENSION\_VALUE\>,judgehowlikelyitisthatthe<DIMENSION\_NAME\>takesonthevalue<DIMENSION\_VALUE\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythe<DIMENSION\_NAME\>islikely,unlikely,orneutraltotakeonthevalue<DIMENSION\_VALUE\>\.

\-label:oneof"likely","unlikely",or"neutral"accordingtothefollowingdefinitions:

\-likely:<DIMENSION\_VALUE\>isexplicitlystated,stronglyimplied,oristhemostnaturalassumptiongiventheevidenceinthe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,and<INTERROGATION\_LOG\>\.

\-neutral:<DIMENSION\_VALUE\>isplausiblebutnotimpliedorsupportedbyspecificevidenceinthe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,or<INTERROGATION\_LOG\>\.

\-unlikely:<DIMENSION\_VALUE\>iscontradictedbythe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>or<INTERROGATION\_LOG\>,orwouldrequireassumptionsthatareinconsistentwiththeavailableevidence\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,and<INTERROGATION\_LOG\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,or<INTERROGATION\_LOG\>\.

\-labelmustbeoneof"likely","unlikely",or"neutral"\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"label":string

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzlesolverformingahypothesisaboutanewlyidentifiedhiddenaspectofthepuzzle,takingintoaccountboththepuzzlescenarioandwhatthehosthasrevealedsofar\.

##### User prompt\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<CONVERSATION\_LOG\>

\{conversation\_log\}

</CONVERSATION\_LOG\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUE\>

\{dimension\_value\}

</DIMENSION\_VALUE\>

Definition:

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<CONVERSATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheconversationbetweenthesolverandthehostuptothispoint\.

Task:

Given<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,andaspecificpuzzledimensiondefinedby<DIMENSION\_NAME\>and<DIMENSION\_VALUE\>,judgehowlikelyitisthatthe<DIMENSION\_NAME\>takesonthevalue<DIMENSION\_VALUE\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythe<DIMENSION\_NAME\>islikely,unlikely,orneutraltotakeonthevalue<DIMENSION\_VALUE\>\.

\-label:oneof"likely","unlikely",or"neutral"accordingtothefollowingdefinitions:

\-likely:<DIMENSION\_VALUE\>isexplicitlysuggestedby,stronglyimpliedby,oristhemostnaturalinterpretationgiventhecluesinthe<PUZZLE\>,<PUZZLE\_CONTEXT\>,and<CONVERSATION\_LOG\>\.

\-neutral:<DIMENSION\_VALUE\>isplausiblebutnotimpliedorsupportedbyspecificcluesinthe<PUZZLE\>,<PUZZLE\_CONTEXT\>,or<CONVERSATION\_LOG\>\.

\-unlikely:<DIMENSION\_VALUE\>iscontradictedbythe<PUZZLE\>,<PUZZLE\_CONTEXT\>or<CONVERSATION\_LOG\>,orwouldrequireassumptionsthatareinconsistentwiththescenario\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>,<PUZZLE\_CONTEXT\>,and<CONVERSATION\_LOG\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYonjudgingthelikelihoodofthedimensionvalue\.

\-DoNOTrewriteorrestatethe<PUZZLE\>,<PUZZLE\_CONTEXT\>,or<CONVERSATION\_LOG\>\.

\-labelmustbeoneof"likely","unlikely",or"neutral"\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"label":string

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianformingaclinicalhypothesisaboutanewlyidentifiedclinicalfactor,takingintoaccountthepatient’sbaselineinformationandwhathasbeenrevealedduringtheclinicalinterviewsofar\.

##### User prompt\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<CLINICAL\_INTERVIEW\_LOG\>

\{conversation\_log\}

</CLINICAL\_INTERVIEW\_LOG\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUE\>

\{dimension\_value\}

</DIMENSION\_VALUE\>

Definition:

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<CLINICAL\_INTERVIEW\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheclinicalinterviewbetweenthephysicianandthepatientuptothispoint\.

Task:

Given<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,andaspecificclinicaldimensiondefinedby<DIMENSION\_NAME\>and<DIMENSION\_VALUE\>,judgehowlikelyitisthatthe<DIMENSION\_NAME\>takesonthevalue<DIMENSION\_VALUE\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythe<DIMENSION\_NAME\>islikely,unlikely,orneutraltotakeonthevalue<DIMENSION\_VALUE\>\.

\-label:oneof"likely","unlikely",or"neutral"accordingtothefollowingdefinitions:

\-likely:<DIMENSION\_VALUE\>isexplicitlystated,stronglyimplied,oristhemostnaturalclinicalassumptiongiventhepatientinformationin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,and<CLINICAL\_INTERVIEW\_LOG\>\.

\-neutral:<DIMENSION\_VALUE\>isclinicallyplausiblebutnotimpliedorsupportedbyspecificevidenceinthe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,or<CLINICAL\_INTERVIEW\_LOG\>\.

\-unlikely:<DIMENSION\_VALUE\>iscontradictedbythe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>or<CLINICAL\_INTERVIEW\_LOG\>,orwouldrequireassumptionsthatareinconsistentwiththepatient’spresentation\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,and<CLINICAL\_INTERVIEW\_LOG\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYonjudgingthelikelihoodofthedimensionvalue\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,or<CLINICAL\_INTERVIEW\_LOG\>\.

\-labelmustbeoneof"likely","unlikely",or"neutral"\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"label":string

\}\}

### K\.9Expand — Question Generation

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectivepreparingnewinterrogationquestionsbasedonanewlydiscoveredlineofinvestigationandunresolvedaspectsofthemurdercase\.

##### User prompt\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<INTERROGATION\_LOG\>

\{conversation\_log\}

</INTERROGATION\_LOG\>

<NEW\_INVESTIGATIVE\_DIMENSION\>

\{new\_dimension\_with\_values\}

</NEW\_INVESTIGATIVE\_DIMENSION\>

<UNRESOLVED\_INVESTIGATIVE\_DIMENSIONS\>

\{high\_uncertainty\_dimensions\_with\_values\}

</UNRESOLVED\_INVESTIGATIVE\_DIMENSIONS\>

Definition:

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<INTERROGATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheinterrogationbetweenthedetectiveandthesuspectsuptothispoint\.

<NEW\_INVESTIGATIVE\_DIMENSION\>isadictwith"name"and"values"fields,representingthenewlyidentifiedlineofinvestigationalongwithitspossiblevalues\.

<UNRESOLVED\_INVESTIGATIVE\_DIMENSIONS\>isalistofdictswith"name"and"values"fields,representingtheinvestigativedimensionsthatcurrentlyhavethehighestuncertainty\.Theydonotincludethenewdimensionin<NEW\_INVESTIGATIVE\_DIMENSION\>\.

Task:

Given<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,anewlyidentifiedinvestigativedimensionin<NEW\_INVESTIGATIVE\_DIMENSION\>,andthemostuncertaindimensionsin<UNRESOLVED\_INVESTIGATIVE\_DIMENSIONS\>,generateinterrogationquestionsthatwouldhelpidentifythemurdererbytargetingthenewdimensionand/ortheunresolveddimensions\.

Whattogenerate:

Generateatmost\{max\_new\_questions\_per\_round\}interrogationquestions\.Foreachquestion,provide:

\-reason:ashortone\-sentenceexplanationofwhythisquestionwouldhelpidentifythemurderer\.

\-question:thetextoftheinterrogationquestion\.

\-choices:alistofmultiple\-choiceansweroptionsforthequestion,nolargerthan\{max\_choices\_per\_question\}\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,<NEW\_INVESTIGATIVE\_DIMENSION\>,and<UNRESOLVED\_INVESTIGATIVE\_DIMENSIONS\>\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYongeneratinginterrogationquestions\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,<NEW\_INVESTIGATIVE\_DIMENSION\>,or<UNRESOLVED\_INVESTIGATIVE\_DIMENSIONS\>\.

\-Eachquestionmustbedesignedtoelicitinformationaboutthenewdimensionin<NEW\_INVESTIGATIVE\_DIMENSION\>and/ortheunresolveddimensionsin<UNRESOLVED\_INVESTIGATIVE\_DIMENSIONS\>\.

\-Eachquestionmusthavemultiple\-choiceanswers\.

\-Generateatmost\{max\_new\_questions\_per\_round\}questions\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"questions":\[

\{\{

"reason":string,

"question":string,

"choices":\[string,\.\.\.\]

\}\},

\.\.\.

\]

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzlesolvergeneratingnewquestionstoaskthepuzzlehost\.Focusontestingspecifichypothesesaboutthehiddenexplanation,especiallytargetingnewlydiscoveredorstill\-uncertainaspectsofthepuzzle\.

##### User prompt\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<CONVERSATION\_LOG\>

\{conversation\_log\}

</CONVERSATION\_LOG\>

<NEW\_PUZZLE\_DIMENSION\>

\{new\_dimension\_with\_values\}

</NEW\_PUZZLE\_DIMENSION\>

<UNCERTAIN\_PUZZLE\_DIMENSIONS\>

\{high\_uncertainty\_dimensions\_with\_values\}

</UNCERTAIN\_PUZZLE\_DIMENSIONS\>

Definition:

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<CONVERSATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheconversationbetweenthesolverandthehostuptothispoint\.

<NEW\_PUZZLE\_DIMENSION\>isadictwith"name"and"values"fields,representingthenewlyidentifiedhiddenaspectofthepuzzlealongwithitspossibleinterpretations\.

<UNCERTAIN\_PUZZLE\_DIMENSIONS\>isalistofdictswith"name"and"values"fields,representingthepuzzledimensionsthatcurrentlyhavethehighestuncertainty\.Theydonotincludethenewdimensionin<NEW\_PUZZLE\_DIMENSION\>\.

Task:

Given<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,anewlyidentifiedpuzzledimensionin<NEW\_PUZZLE\_DIMENSION\>,andthemostuncertaindimensionsin<UNCERTAIN\_PUZZLE\_DIMENSIONS\>,generatequestionstoaskthepuzzlehostthatwouldhelpuncoverthehiddenexplanationbytargetingthenewdimensionand/ortheuncertaindimensions\.

Whattogenerate:

Generateatmost\{max\_new\_questions\_per\_round\}clarifyingquestions\.Foreachquestion,provide:

\-reason:ashortone\-sentenceexplanationofwhythisquestionwouldhelpsolvethepuzzle\.

\-question:thetextoftheclarifyingquestiontoaskthepuzzlehost\.

\-choices:\["yes","no"\]asthe\{max\_choices\_per\_question\}multiple\-choiceansweroptionsforthequestion\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,<NEW\_PUZZLE\_DIMENSION\>,and<UNCERTAIN\_PUZZLE\_DIMENSIONS\>\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYongeneratingclarifyingquestions\.

\-DoNOTrewriteorrestatethe<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,<NEW\_PUZZLE\_DIMENSION\>,or<UNCERTAIN\_PUZZLE\_DIMENSIONS\>\.

\-Eachquestionmustbedesignedtoelicitinformationaboutthenewdimensionin<NEW\_PUZZLE\_DIMENSION\>and/ortheuncertaindimensionsin<UNCERTAIN\_PUZZLE\_DIMENSIONS\>\.

\-Eachquestionmusthavemultiple\-choiceanswers\.

\-Generateatmost\{max\_new\_questions\_per\_round\}questions\.

\-Keepeachquestionshort:atmost20words\.

\-Keepeachreasonshort:atmost15words\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"questions":\[

\{\{

"reason":string,

"question":string,

"choices":\["yes","no"\]

\}\},

\.\.\.

\]

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianpreparingadditionalclinicalquestionsforapatientinterviewbasedonanewlyidentifiedclinicalfactorandunresolvedaspectsofthediagnosis\.

##### User prompt\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<CLINICAL\_INTERVIEW\_LOG\>

\{conversation\_log\}

</CLINICAL\_INTERVIEW\_LOG\>

<NEW\_CLINICAL\_DIMENSION\>

\{new\_dimension\_with\_values\}

</NEW\_CLINICAL\_DIMENSION\>

<UNRESOLVED\_CLINICAL\_DIMENSIONS\>

\{high\_uncertainty\_dimensions\_with\_values\}

</UNRESOLVED\_CLINICAL\_DIMENSIONS\>

Definition:

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<CLINICAL\_INTERVIEW\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthehistoryoftheclinicalinterviewbetweenthephysicianandthepatientuptothispoint\.

<NEW\_CLINICAL\_DIMENSION\>isadictwith"name"and"values"fields,representingthenewlyidentifiedclinicalfactoralongwithitspossiblevalues\.

<UNRESOLVED\_CLINICAL\_DIMENSIONS\>isalistofdictswith"name"and"values"fields,representingtheclinicaldimensionsthatcurrentlyhavethehighestdiagnosticuncertainty\.Theydonotincludethenewdimensionin<NEW\_CLINICAL\_DIMENSION\>\.

Task:

Given<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,anewlyidentifiedclinicaldimensionin<NEW\_CLINICAL\_DIMENSION\>,andthemostuncertaindimensionsin<UNRESOLVED\_CLINICAL\_DIMENSIONS\>,generateclinicalquestionstoaskthepatientthatwouldhelpnarrowthediagnosisbytargetingthenewdimensionand/ortheunresolveddimensions\.

Whattogenerate:

Generateatmost\{max\_new\_questions\_per\_round\}clinicalquestions\.Foreachquestion,provide:

\-reason:ashortone\-sentenceexplanationofwhythisquestionwouldhelpnarrowthediagnosis\.

\-question:thetextoftheclinicalquestiontoaskthepatient\.

\-choices:alistofmultiple\-choiceansweroptionsforthequestion,nolargerthan\{max\_choices\_per\_question\}\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,<NEW\_CLINICAL\_DIMENSION\>,and<UNRESOLVED\_CLINICAL\_DIMENSIONS\>\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYongeneratingclinicalquestions\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,<NEW\_CLINICAL\_DIMENSION\>,or<UNRESOLVED\_CLINICAL\_DIMENSIONS\>\.

\-Eachquestionmustbedesignedtoelicitinformationaboutthenewdimensionin<NEW\_CLINICAL\_DIMENSION\>and/ortheunresolveddimensionsin<UNRESOLVED\_CLINICAL\_DIMENSIONS\>\.

\-Eachquestionmusthavemultiple\-choiceanswers\.

\-Generateatmost\{max\_new\_questions\_per\_round\}questions\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"questions":\[

\{\{

"reason":string,

"question":string,

"choices":\[string,\.\.\.\]

\}\},

\.\.\.

\]

\}\}

### K\.10Final Answer

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectiveconcludingamurderinvestigation\.Basedonalltheevidencegatheredfrominterrogatingthesuspectsandyouranalysisofthecase,youmustnowidentifytherealmurderer\.

##### Without discrete choices\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<INTERROGATION\_LOG\>

\{conversation\_log\}

</INTERROGATION\_LOG\>

<INVESTIGATION\_CONCLUSION\>

\{map\_state\}

</INVESTIGATION\_CONCLUSION\>

Definition:

<INTERROGATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthefullhistoryoftheinterrogationbetweenthedetectiveandthesuspects\.

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<INVESTIGATION\_CONCLUSION\>isastructuredrepresentationofthedetective’scurrentunderstandingofthecase,whereeachinvestigativedimensionismappedtoitsmostlikelyvalue\.Thisrepresentsthedetective’sbestassessmentofthetruestateofthecasebasedontheinvestigationsofar\.

Task:

Given<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,and<INVESTIGATION\_CONCLUSION\>,identifytherealmurderer\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhytheidentifiedsuspectistherealmurderergiventheevidencein<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,and<INVESTIGATION\_CONCLUSION\>\.

\-final\_answer:thenameofthesuspectyouareidentifyingastherealmurderer\.

Constraints:

\-Thefinalanswermustbeconsistentwiththeevidencein<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,and<INVESTIGATION\_CONCLUSION\>\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,or<INVESTIGATION\_CONCLUSION\>\.

\-answermustbeanaturallanguageanswertothe<CASE\_QUESTION\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"final\_answer":string

\}\}

##### With discrete choices\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<INTERROGATION\_LOG\>

\{conversation\_log\}

</INTERROGATION\_LOG\>

<INVESTIGATION\_CONCLUSION\>

\{map\_state\}

</INVESTIGATION\_CONCLUSION\>

<SUSPECTS\_WITH\_IDS\>

\{possible\_answers\_with\_ids\}

</SUSPECTS\_WITH\_IDS\>

Definition:

<INTERROGATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthefullhistoryoftheinterrogationbetweenthedetectiveandthesuspects\.

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<INVESTIGATION\_CONCLUSION\>isastructuredrepresentationofthedetective’scurrentunderstandingofthecase,whereeachinvestigativedimensionismappedtoitsmostlikelyvalue\.Thisrepresentsthedetective’sbestassessmentofthetruestateofthecasebasedontheinvestigationsofar\.

<SUSPECTS\_WITH\_IDS\>isalistofdictswith"id"and"value"fields\.Eachdictcorrespondstoasuspectwhocouldbethemurderer\.

Letpossible\_answers\[i\]bethei\-thelementof<SUSPECTS\_WITH\_IDS\>\.

Task:

Given<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,and<INVESTIGATION\_CONCLUSION\>,identifywhichsuspectfrom<SUSPECTS\_WITH\_IDS\>istherealmurderer\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhytheidentifiedsuspectistherealmurderergiventheevidencein<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,and<INVESTIGATION\_CONCLUSION\>\.

\-final\_answer\_id:theidofthesuspectin<SUSPECTS\_WITH\_IDS\>thatyouareidentifyingastherealmurderer\.

Constraints:

\-Thefinalanswermustbeconsistentwiththeevidencein<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,and<INVESTIGATION\_CONCLUSION\>\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,<INTERROGATION\_LOG\>,or<INVESTIGATION\_CONCLUSION\>\.

\-final\_answer\_idmustbeoneoftheids\(i\.epossible\_answers\[i\]\["id"\]\)providedin<SUSPECTS\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"final\_answer\_id":string

\}\}

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzlesolver\.Basedonallthecluesgatheredfromthepuzzlehost’sresponsesandyouranalysisofthepuzzledimensions,youmustnowprovidethehiddenexplanationofthepuzzle\.

##### Without discrete choices\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<CONVERSATION\_LOG\>

\{conversation\_log\}

</CONVERSATION\_LOG\>

<PUZZLE\_SOLUTION\_STATE\>

\{map\_state\}

</PUZZLE\_SOLUTION\_STATE\>

Definition:

<CONVERSATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthefullhistoryoftheconversationbetweenthesolverandthepuzzlehost\.

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<PUZZLE\_SOLUTION\_STATE\>isastructuredrepresentationofthesolver’scurrentunderstandingofthepuzzle,whereeachpuzzledimensionismappedtoitsmostlikelyvalue\.Thisrepresentsthesolver’sbestguessofthehiddenexplanationbasedonthecluesgatheredsofar\.

Task:

Given<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,and<PUZZLE\_SOLUTION\_STATE\>,providethehiddenexplanationofthepuzzle\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythisisthecorrectexplanationgiventhecluesin<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,and<PUZZLE\_SOLUTION\_STATE\>\.

\-final\_answer:thehiddenexplanationofthepuzzle\.

Constraints:

\-Thefinalanswermustbeconsistentwiththecluesin<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,and<PUZZLE\_SOLUTION\_STATE\>\.

\-DoNOTrewriteorrestatethe<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,or<PUZZLE\_SOLUTION\_STATE\>\.

\-answermustbeanaturallanguageexplanationofthepuzzle\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"final\_answer":string

\}\}

##### With discrete choices\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<CONVERSATION\_LOG\>

\{conversation\_log\}

</CONVERSATION\_LOG\>

<PUZZLE\_SOLUTION\_STATE\>

\{map\_state\}

</PUZZLE\_SOLUTION\_STATE\>

<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>

\{possible\_answers\_with\_ids\}

</POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>

Definition:

<CONVERSATION\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthefullhistoryoftheconversationbetweenthesolverandthepuzzlehost\.

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<PUZZLE\_SOLUTION\_STATE\>isastructuredrepresentationofthesolver’scurrentunderstandingofthepuzzle,whereeachpuzzledimensionismappedtoitsmostlikelyvalue\.Thisrepresentsthesolver’sbestguessofthehiddenexplanationbasedonthecluesgatheredsofar\.

<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>isalistofdictswith"id"and"value"fields\.Eachdictcorrespondstoapossibleexplanationofthepuzzle\.

Letpossible\_answers\[i\]bethei\-thelementof<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>\.

Task:

Given<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,and<PUZZLE\_SOLUTION\_STATE\>,selectthecorrectexplanationfrom<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythisisthecorrectexplanationgiventhecluesin<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,and<PUZZLE\_SOLUTION\_STATE\>\.

\-final\_answer\_id:theidoftheexplanationin<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>thatyouareselectingasthehiddenexplanationofthepuzzle\.

Constraints:

\-Thefinalanswermustbeconsistentwiththecluesin<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,and<PUZZLE\_SOLUTION\_STATE\>\.

\-DoNOTrewriteorrestatethe<PUZZLE\>,<PUZZLE\_CONTEXT\>,<CONVERSATION\_LOG\>,or<PUZZLE\_SOLUTION\_STATE\>\.

\-final\_answer\_idmustbeoneoftheids\(i\.epossible\_answers\[i\]\["id"\]\)providedin<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"final\_answer\_id":string

\}\}

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianconcludingaclinicalassessment\.Basedonalltheclinicalinformationgatheredfromthepatientinterviewandyouranalysisoftheclinicaldimensions,youmustnowprovideyourdiagnosisorclinicaldecision\.

##### Without discrete choices\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<CLINICAL\_INTERVIEW\_LOG\>

\{conversation\_log\}

</CLINICAL\_INTERVIEW\_LOG\>

<CLINICAL\_ASSESSMENT\>

\{map\_state\}

</CLINICAL\_ASSESSMENT\>

Definition:

<CLINICAL\_INTERVIEW\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthefullhistoryoftheclinicalinterviewbetweenthephysicianandthepatient\.

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<CLINICAL\_ASSESSMENT\>isastructuredrepresentationofthephysician’scurrentunderstandingofthepatient’scondition,whereeachclinicaldimensionismappedtoitsmostlikelyvalue\.Thisrepresentsthephysician’sbestassessmentofthetrueclinicalstatebasedontheinterviewsofar\.

Task:

Given<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,and<CLINICAL\_ASSESSMENT\>,provideyourdiagnosisorclinicaldecision\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythisisthecorrectdiagnosisorclinicaldecisiongiventheclinicalevidencein<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,and<CLINICAL\_ASSESSMENT\>\.

\-final\_answer:yourdiagnosisorclinicaldecision\.

Constraints:

\-Thefinalanswermustbeconsistentwiththeclinicalevidencein<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,and<CLINICAL\_ASSESSMENT\>\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,or<CLINICAL\_ASSESSMENT\>\.

\-answermustbeanaturallanguageanswertothe<CLINICAL\_QUESTION\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"final\_answer":string

\}\}

##### With discrete choices\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<CLINICAL\_INTERVIEW\_LOG\>

\{conversation\_log\}

</CLINICAL\_INTERVIEW\_LOG\>

<CLINICAL\_ASSESSMENT\>

\{map\_state\}

</CLINICAL\_ASSESSMENT\>

<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>

\{possible\_answers\_with\_ids\}

</DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>

Definition:

<CLINICAL\_INTERVIEW\_LOG\>isalistofdictswith"question\_text","user\_name",and"user\_answer"fields,representingthefullhistoryoftheclinicalinterviewbetweenthephysicianandthepatient\.

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<CLINICAL\_ASSESSMENT\>isastructuredrepresentationofthephysician’scurrentunderstandingofthepatient’scondition,whereeachclinicaldimensionismappedtoitsmostlikelyvalue\.Thisrepresentsthephysician’sbestassessmentofthetrueclinicalstatebasedontheinterviewsofar\.

<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>isalistofdictswith"id"and"value"fields\.Eachdictcorrespondstoapossiblediagnosisorclinicaldecision\.

Letpossible\_answers\[i\]bethei\-thelementof<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>\.

Task:

Given<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,and<CLINICAL\_ASSESSMENT\>,selectthecorrectdiagnosisorclinicaldecisionfrom<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>\.

Whattogenerate:

\-reason:ashortone\-sentenceexplanationofwhythisisthecorrectdiagnosisorclinicaldecisiongiventheclinicalevidencein<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,and<CLINICAL\_ASSESSMENT\>\.

\-final\_answer\_id:theidofthechoicein<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>thatyouareselectingasyourdiagnosisorclinicaldecision\.

Constraints:

\-Thefinalanswermustbeconsistentwiththeclinicalevidencein<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,and<CLINICAL\_ASSESSMENT\>\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,<CLINICAL\_INTERVIEW\_LOG\>,or<CLINICAL\_ASSESSMENT\>\.

\-final\_answer\_idmustbeoneoftheids\(i\.epossible\_answers\[i\]\["id"\]\)providedin<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"reason":string,

"final\_answer\_id":string

\}\}

### K\.11Answer Likelihood

#### AR\-Bench\-DC

##### System prompt\.

Youareanexperienceddetectiveevaluatinghowlikelyeachsuspectistobetherealmurdererunderdifferentassumptionsaboutthestateofthecase\.

##### User prompt\.

<CASE\_QUESTION\>

\{ambiguous\_prompt\}

</CASE\_QUESTION\>

<CASE\_BACKGROUND\>

\{meta\_context\}

</CASE\_BACKGROUND\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<SUSPECTS\_WITH\_IDS\>

\{possible\_answers\_with\_ids\}

</SUSPECTS\_WITH\_IDS\>

Definition:

Aninvestigativedimensionisaspecificaspectofthemurdercasewheremultiplesuspectscouldplausiblybeimplicated,andresolvingitwouldnarrowdownthetruemurderer\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<SUSPECTS\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoasuspectwhocouldbetherealmurderer\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letanswers\[j\]bethej\-thelementof<SUSPECTS\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(answers\)\-1:

\-Assumethetruestateofthecaseis<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Judgehowlikelyitisthatanswers\[j\]\["text"\]istherealmurderergiventhatassumptionandtheevidencein<CASE\_BACKGROUND\>\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(answers\)\-1:

\-answer\_id:theidofthesuspectbeingevaluated,i\.e\.answers\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhyanswers\[j\]islikely/neutral/unlikelytobetherealmurderergiven<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andtheevidencein<CASE\_BACKGROUND\>,answers\[j\]\["text"\]istheexpectedrealmurderer\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andtheevidencein<CASE\_BACKGROUND\>,answers\[j\]\["text"\]isaplausiblesuspectbutnotspecificallyimplicated;thereisinsufficientevidencetosaythissuspectismoreorlessguiltythanothers\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andtheevidencein<CASE\_BACKGROUND\>,answers\[j\]\["text"\]isnotexpectedtobetherealmurderer\.

Constraints:

\-UseONLYtheinformationprovidedin<CASE\_QUESTION\>,<CASE\_BACKGROUND\>,andtheassumeddimensionvalue\.

\-DoNOTanswerthe<CASE\_QUESTION\>itself\.FocusONLYonjudginghowlikelyeachsuspectistobetherealmurdererunderdifferentassumptionsabouttheinvestigativedimension\.

\-DoNOTrewriteorrestatethe<CASE\_QUESTION\>or<CASE\_BACKGROUND\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandsuspect\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"answer\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachsuspect

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_possible\_answers\}objects\(onepersuspect\)\.

#### AR\-Bench\-SP

##### System prompt\.

Youareanexpertthinkingpuzzleanalyst\.Evaluatehowlikelyeachcandidateexplanationistobethecorrecthiddenexplanationofthepuzzleunderdifferentassumptionsaboutthepuzzle’shiddendimensions\.

##### User prompt\.

<PUZZLE\>

\{ambiguous\_prompt\}

</PUZZLE\>

<PUZZLE\_CONTEXT\>

\{meta\_context\}

</PUZZLE\_CONTEXT\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>

\{possible\_answers\_with\_ids\}

</POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>

Definition:

Apuzzledimensionisahiddenaspectofthescenariowhereknowingitstruevaluewouldexplainthepuzzle\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoacandidatehiddenexplanationofthe<PUZZLE\>\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letanswers\[j\]bethej\-thelementof<POSSIBLE\_EXPLANATIONS\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(answers\)\-1:

\-Assumethehiddenexplanationissuchthat<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Judgehowlikelyitisthatanswers\[j\]\["text"\]isthecorrecthiddenexplanationofthe<PUZZLE\>underthatassumption\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(answers\)\-1:

\-answer\_id:theidofthecandidateexplanationbeingevaluated,i\.e\.answers\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhyanswers\[j\]islikely/neutral/unlikelytobethecorrecthiddenexplanationgiven<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthecluesin<PUZZLE\_CONTEXT\>,answers\[j\]\["text"\]istheexpectedcorrecthiddenexplanationofthe<PUZZLE\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthecluesin<PUZZLE\_CONTEXT\>,answers\[j\]\["text"\]isaplausibleexplanationbutnotspecificallysupported;thereisinsufficientevidencetosayitismoreorlesscorrectthanotherexplanations\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthecluesin<PUZZLE\_CONTEXT\>,answers\[j\]\["text"\]isnotexpectedtobethecorrecthiddenexplanationofthe<PUZZLE\>\.

Constraints:

\-UseONLYtheinformationprovidedin<PUZZLE\>,<PUZZLE\_CONTEXT\>,andtheassumeddimensionvalue\.

\-DoNOTsolvethe<PUZZLE\>itself\.FocusONLYonjudginghowlikelyeachcandidateexplanationistobecorrectunderdifferentassumptionsaboutthepuzzledimension\.

\-DoNOTrewriteorrestatethe<PUZZLE\>or<PUZZLE\_CONTEXT\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandcandidateexplanation\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"answer\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachcandidateexplanation

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_possible\_answers\}objects\(onepercandidateexplanation\)\.

#### iCraft\-MD

##### System prompt\.

Youareanexperiencedphysicianevaluatinghowlikelyeachcandidatediagnosisorclinicaldecisionistobecorrectunderdifferentassumptionsaboutthepatient’sclinicalstate\.

##### User prompt\.

<CLINICAL\_QUESTION\>

\{ambiguous\_prompt\}

</CLINICAL\_QUESTION\>

<PATIENT\_INFORMATION\>

\{meta\_context\}

</PATIENT\_INFORMATION\>

<DIMENSION\_NAME\>

\{dimension\_name\}

</DIMENSION\_NAME\>

<DIMENSION\_VALUES\_WITH\_IDS\>

\{dimension\_values\_with\_ids\}

</DIMENSION\_VALUES\_WITH\_IDS\>

<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>

\{possible\_answers\_with\_ids\}

</DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>

Definition:

Aclinicaldimensionisaspecificclinicalfactorwheredifferentvalueswouldpointtowarddifferentdiagnosesorclinicaldecisions\.

<DIMENSION\_VALUES\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoapossiblevaluethatthe<DIMENSION\_NAME\>couldtake\.

<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>isalistofdictswith"id"and"text"fields\.Eachdictcorrespondstoacandidatediagnosisorclinicaldecisionforthe<CLINICAL\_QUESTION\>\.

Letvalues\[i\]bethei\-thelementof<DIMENSION\_VALUES\_WITH\_IDS\>\.

Letanswers\[j\]bethej\-thelementof<DIAGNOSTIC\_OPTIONS\_WITH\_IDS\>\.

Task\(row\-majororder\):

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(answers\)\-1:

\-Assumethepatient’strueclinicalstateis<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-Judgehowlikelyitisthatanswers\[j\]\["text"\]isthecorrectdiagnosisorclinicaldecisionfor<CLINICAL\_QUESTION\>underthatassumption\.

Whattogenerate:

Fori=0\.\.len\(values\)\-1:

Forj=0\.\.len\(answers\)\-1:

\-answer\_id:theidofthecandidatediagnosisbeingevaluated,i\.e\.answers\[j\]\["id"\]

\-dimension\_value\_id:theidofthedimensionvaluebeingevaluated,i\.e\.values\[i\]\["id"\]

\-reason:ashortone\-sentenceexplanationofwhyanswers\[j\]islikely/neutral/unlikelytobethecorrectdiagnosisgiven<DIMENSION\_NAME\>=values\[i\]\["text"\]\.

\-label:oneof"likely","neutral",or"unlikely"accordingtothefollowingdefinitions:

\-"likely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientinformationin<PATIENT\_INFORMATION\>,answers\[j\]\["text"\]istheexpectedcorrectdiagnosisorclinicaldecisionfor<CLINICAL\_QUESTION\>\.

\-"neutral":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientinformationin<PATIENT\_INFORMATION\>,answers\[j\]\["text"\]isaplausiblediagnosisbutnotspecificallysupported;thereisinsufficientclinicalevidencetosayitismoreorlesscorrectthanotheroptions\.

\-"unlikely":Given<DIMENSION\_NAME\>=values\[i\]\["text"\]andthepatientinformationin<PATIENT\_INFORMATION\>,answers\[j\]\["text"\]isnotexpectedtobethecorrectdiagnosisorclinicaldecisionfor<CLINICAL\_QUESTION\>\.

Constraints:

\-UseONLYtheinformationprovidedin<CLINICAL\_QUESTION\>,<PATIENT\_INFORMATION\>,andtheassumeddimensionvalue\.

\-DoNOTanswerthe<CLINICAL\_QUESTION\>itself\.FocusONLYonjudginghowlikelyeachcandidatediagnosisistobecorrectunderdifferentassumptionsabouttheclinicaldimension\.

\-DoNOTrewriteorrestatethe<CLINICAL\_QUESTION\>or<PATIENT\_INFORMATION\>\.

\-labelmustbeoneof"likely","neutral",or"unlikely"\.

\-Theoutputmustincludeanentryforeverycombinationofdimensionvalueandcandidatediagnosis\.

Outputformat:

ReturnSTRICTJSONonlywiththefollowingschema:

\{\{

"evaluations":\[

\[

\{\{

"answer\_id":string,

"dimension\_value\_id":string,

"reason":string,

"label":string

\}\},

\.\.\.//oneobjectforeachcandidatediagnosis

\],

\.\.\.//onearrayforeachdimensionvalue

\]

\}\}

The"evaluations"fieldmustcontainexactly\{num\_dimension\_values\}arrays\(oneperdimensionvalue\)\.

Eachinnerarraymustcontainexactly\{num\_possible\_answers\}objects\(onepercandidatediagnosis\)\.

Similar Articles

Adaptive Latent Agentic Reasoning

arXiv cs.CL

This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

arXiv cs.CL

The paper introduces CALIBER, a method for calibrating confidence in reasoning language models by eliciting confidence estimates both before and after reasoning, with supervision targets matched to the information state. It achieves significant reductions in Expected Calibration Error (up to 52.5%) and strong Brier scores and AUROC across multiple benchmarks.