Zero-Shot Goal Recognition with Large Language Models
Summary
This paper presents the first systematic zero-shot evaluation of frontier large language models as goal recognizers on classical PDDL planning benchmarks, finding that some models scale with evidence while others rely on world-knowledge priors regardless of observation accumulation.
View Cached Full Text
Cached at: 05/18/26, 06:32 AM
# Zero-Shot Goal Recognition with Large Language Models
Source: [https://arxiv.org/html/2605.15333](https://arxiv.org/html/2605.15333)
Kin Max Piamolini Gusmão PUCRS Porto Alegre, Brazil kin\.gusmao@edu\.pucrs\.br&Nathan Gavenski King’s College London London, England nathan\.schneider\_gavenski@kcl\.ac\.uk Nir Oren University of Aberdeen Aberdeen, Scotland n\.oren@abdn\.ac\.uk &Felipe Meneguzzi University of Aberdeen / PUCRS Aberdeen, Scotland felipe\.meneguzzi@abdn\.ac\.uk
###### Abstract
Large language models have recently reached near\-parity with classical planners on well\-known planning domains, yet this competence relies on world\-knowledge exploitation rather than genuine symbolic reasoning\. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences\. This paper provides the first systematic zero\-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks\. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark\-based accuracy at full observations, while others remain anchored to world\-knowledge priors regardless of how much evidence accumulates\. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity\. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs\.color=red\!70, noinlinecolor=red\!70, noinlinetodo:color=red\!70, noinlineFelipe: @Nathan, when I finished writing this I realise this is exactly the opposite of what you suggested, so I will need to work a little on this throughout the day\.
## 1Introduction
Large language models \(LLMs\) have recently been adopted as foundational components in a wide range of applications, including those that require reasoning\. Recent benchmarking shows that these models, including reasoning models, have reached near\-parity with classical planners on well\-known planning domains\[[21](https://arxiv.org/html/2605.15333#bib.bib155),[3](https://arxiv.org/html/2605.15333#bib.bib156)\]\. This competence, however, rests on a brittle foundation since performance collapses when domain and object names are obfuscated, revealing that LLMs exploit world knowledge rather than performing genuine symbolic reasoning\[[20](https://arxiv.org/html/2605.15333#bib.bib151),[19](https://arxiv.org/html/2605.15333#bib.bib152)\]\.
Goal and plan recognition are related but fundamentally different tasks\[[13](https://arxiv.org/html/2605.15333#bib.bib73),[14](https://arxiv.org/html/2605.15333#bib.bib15),[8](https://arxiv.org/html/2605.15333#bib.bib138)\]\. Given a set of candidate goals and a partial sequence of an agent’s observed actions, the task is to infer which goal the agent pursues, alternatively, what plan it uses to achieve such goal\. Unlike planning, goal recognition is abductive: the recogniser does not construct a valid action sequence but instead asks which hypothesis best explains the evidence so far\. Evaluating partial evidence against a fixed hypothesis set aligns naturally with how LLMs leverage the world knowledge encoded in their training, making goal recognition a principled benchmark for probing that foundational planning knowledge\. A model that relies solely on world\-knowledge priors will produce stable predictions regardless of how much evidence accumulates, revealing the floor set by domain familiarity; a model that scales with observations demonstrates capabilities beyond that floor\.
This paper investigates whether LLM competence on known planning domains transfers to goal recognition\. We run the first systematic zero\-shot evaluation of frontier LLMs as goal recognisers on classical PDDL benchmarks, comparing them against the landmark\-based approach ofPereiraet al\.\[[12](https://arxiv.org/html/2605.15333#bib.bib34)\]\.color=red\!70, noinlinecolor=red\!70, noinlinetodo:color=red\!70, noinlineFelipe: I removed the mention of the costs here, but it is a valid argument, in my opinion\.We evaluate on four domains chosen for their contrasting characteristics \(Section[4](https://arxiv.org/html/2605.15333#S4)\):blocks\-world,campus,driverlog, anddock\-worker robots\. Our main contributions are:color=blue\!40, noinlinecolor=blue\!40, noinlinetodo:color=blue\!40, noinlineNir: What would happen if you obfuscate in a slightly different way? Namely take e\.g\., blocksworld by say you are stacking elevators \(or even something like balls as intuitively they would rolll of each other, i\.e\., you are using LLM domain knowledge against it?color=red\!70, noinlinecolor=red\!70, noinlinetodo:color=red\!70, noinlineFelipe: Good further work\.color=red\!70, noinlinecolor=red\!70, noinlinetodo:color=red\!70, noinlineFelipe: Revise this when we have full results\.\(i\)The first systematic zero\-shot evaluation of frontier LLMs on classical PDDL goal recognition benchmarks, comparing against landmark\-based goal recognition\[[12](https://arxiv.org/html/2605.15333#bib.bib34)\]\.\(ii\)Identification of three qualitatively distinct LLM behaviour patterns: GPT\-4o achieves competitive recall at low observation counts but never improves with more evidence; GPT\-OSS and GPT\-5\.4 scale with observations, approaching landmark\-based accuracy at full observations; and Qwen 3\.5 fails almost entirely on all but the simplest domain\.\(iii\)Evidence that high spread is a shared failure mode for LLMs at sparse observations, and that accumulating evidence is not a solution for all models\.\(iv\)Our results provide a valuable extension to the insights from the mystery\-domain findings ofValmeekamet al\.\[[20](https://arxiv.org/html/2605.15333#bib.bib151),[19](https://arxiv.org/html/2605.15333#bib.bib152)\]\. Even on familiar domains with standard names, goal recognition shows that frontier LLM performance deteriorates as plans and the space of hypothesis grow larger, suggesting that the bottleneck is not whether LLMs know the domain but whether they can integrate a growing sequence of observations into a coherent prediction\.\(v\)A qualitative analysis of model reasoning traces that identifies shared failure modes \(confabulation, overconfidence at sparse observations, position bias, and errors at high observability\) and derives concrete interventions to guide the development of LLM\-based goal recognisers\.
## 2Background
Goal recognition is the problem of inferring the goal of an observed agent from a partial sequence of its actions\. This task is intimately related to that of planning, which consists of determining how to achieve a given goalGG\{\}from a known initial stateℐ\\mathcal\{I\}\{\}\. However, instead of having to simulate paths fromℐ\\mathcal\{I\}\{\}towardsGG\{\}, goal recognition is an inverse problem, that of inferring the true goalG∗G^\{\*\}\{\}of the observed agent from observed actions\. \[[2009](https://arxiv.org/html/2605.15333#bib.bib73),[2010](https://arxiv.org/html/2605.15333#bib.bib15)\] formalise this problem in terms of classical planning, andMirskyet al\.\[[8](https://arxiv.org/html/2605.15333#bib.bib138)\]survey its breadth\.
###### Definition 1\(Goal Recognition Problem\)\.
Agoal recognition problemis a tuple𝒫GR=⟨Ξ,ℐ,𝒢,𝒪⟩\\mathcal\{P\}\_\{GR\}=\\langle\{\\Xi,\\mathcal\{I\},\\mathcal\{G\},\\mathcal\{O\}\}\\rangle, whereΞ\\Xiis a planning domain,ℐ\\mathcal\{I\}is the initial state,𝒢=\{G1,…,Gn\}\\mathcal\{G\}=\\\{G\_\{1\},\\ldots,G\_\{n\}\\\}is a finite set of candidate goal hypotheses, and𝒪=⟨a1,…,ak⟩\\mathcal\{O\}=\\langle\{a\_\{1\},\\ldots,a\_\{k\}\}\\rangleis an ordered sequence of observed actions\. The true goalG∗∈𝒢G^\{\*\}\\in\\mathcal\{G\}is unknown\.
Given a goal recognition problem, the task is to return a ranking or probability distribution over𝒢\\mathcal\{G\}given𝒪\\mathcal\{O\}, identifyingG∗G^\{\*\}as the most likely candidate\. In practice, many approaches return the probability distribution over allG∈𝒢G\\in\\mathcal\{G\}rather than a strict ordering, and allow for ties\. We note that the notion ofpartial observabilityin goal recognition differs from partially observable Markov decision processes\[[7](https://arxiv.org/html/2605.15333#bib.bib126)\]\. The recogniser never observes the world state directly; it receives only an ordered subset of the actions the agent executes, and must infer the goal from that incomplete trace\.
###### Definition 2\(Observation\)\.
Letπ=⟨a1,…,am⟩\\pi=\\langle\{a\_\{1\},\\ldots,a\_\{m\}\}\\ranglebe the full plan executed by the observed agent\. Anobservation sequenceis a tuple𝒪=⟨ai1,…,aik⟩\\mathcal\{O\}=\\langle\{a\_\{i\_\{1\}\},\\ldots,a\_\{i\_\{k\}\}\}\\ranglewhere1≤i1<i2<⋯<ik≤m1\\leq i\_\{1\}<i\_\{2\}<\\cdots<i\_\{k\}\\leq m, that is, an ordered subsequence ofπ\\pithat preserves action order but may omit actions at any position\. Theobservation percentageis100⋅k/m100\\cdot k/m\.
Because𝒪\\mathcal\{O\}is a subsequence rather than a prefix, the recogniser has no information about actions at unobserved positions\. Observations consist of ground actions rather than state snapshots; this is the standard assumption in classical goal recognition\[[13](https://arxiv.org/html/2605.15333#bib.bib73),[7](https://arxiv.org/html/2605.15333#bib.bib126)\]\. We assume all observed actions are accurate, i\.e\., that the partial observation is an ordered subset ofπ\\pi\. Noisy observations, in which some observed actions are replaced by incorrect ones, form a distinct problem setting that we explicitly exclude\. At low observation percentages the evidence is sparse and many candidate goals remain consistent with𝒪\\mathcal\{O\}, and as the recogniser receives more actions, disambiguation becomes easier\.
Pereiraet al\.\[[12](https://arxiv.org/html/2605.15333#bib.bib34)\]develop an efficient approach to goal recognition based on planning*landmarks*\. A landmark for a goalGGis an action that any agent pursuingGGmust eventually execute, regardless of which specific plan it follows; the set of all such actions isL\(G\)L\(\{G\}\)\. ComputingL\(G\)L\(\{G\}\)is expensive, in the worst case as hard as planning itself\[[4](https://arxiv.org/html/2605.15333#bib.bib1)\], but it can be approximated in polynomial time via delete\-relaxed landmark extraction and done offline once per problem\. The resulting landmark sets encode knowledge about each goal analogously to how an LLM’s parameters encode world knowledge, i\.e\., a costly upfront computation that makes subsequent inference cheap\.
Given an observation sequence𝒪\\mathcal\{O\}, recognition then proceeds by counting which landmarks each candidate goal has already achieved\. The landmark score ofGGis the fraction of its landmarks observed so farscore\(G,𝒪\)=\|\{l∈L\(G\):l∈𝒪\}\|/\|L\(G\)\|\\mathrm\{score\}\(\{G\},\{\\mathcal\{O\}\}\)\\;=\\;\\nicefrac\{\{\|\\\{l\\in L\(\{G\}\):l\\in\\mathcal\{O\}\\\}\|\}\}\{\{\|L\(\{G\}\)\|\}\}The most\-likely goal set is𝒢^=argmaxG∈𝒢score\(G,𝒪\)\\hat\{\\mathcal\{G\}\}=\\operatorname\{argmax\}\_\{G\\in\\mathcal\{G\}\}\\,\\mathrm\{score\}\(\{G\},\{\\mathcal\{O\}\}\)\. Because landmark achievement accumulates monotonically as𝒪\\mathcal\{O\}grows,score\(G∗,𝒪\)\\mathrm\{score\}\(\{G^\{\*\}\},\{\\mathcal\{O\}\}\)tends to increase relative to false goals with each additional observation, so recognition accuracy improves steadily as the observation percentage rises\. This monotonic scaling behaviour is the key property against which we compare LLM\-based recognisers in Section[5](https://arxiv.org/html/2605.15333#S5)\.
## 3Zero\-Shot Goal Recognition with Large Language Models
A recogniser requires world knowledge of planning domains for zero\-shot goal recognition, as well as the ability to update its goal assessments as new observations arrive, a demand that pattern matching alone cannot satisfy\. This section introduces our approach to zero\-shot goal recognition using LLMs\. We first describe our prompting strategy, then situate goal recognition within the complexity landscape of planning to motivate it as a natural benchmark for the foundational planning knowledge of LLMs\.
### 3\.1Prompting Strategy
We treat zero\-shot goal recognition as a structured prompting task\. We provide no fine\-tuning, few\-shot examples, or domain\-specific context\. The evaluation is strictly zero\-shot to isolate the contribution of the LLM’s world knowledge\.
The prompt has four parts\. The first part assigns the model a role as an expert in AI planning and describes each input it will receive:\(i\)the PDDL domain with its action schemas \(names, typed parameters, preconditions, and effects\);\(ii\)the problem template with the list of objects, the initial state, and a placeholder for the goal;\(iii\)the candidate goal set𝒢\\mathcal\{G\}, where eachG∈𝒢G\\in\\mathcal\{G\}is a conjunction of facts; and\(iv\)the observation sequence𝒪\\mathcal\{O\}as an ordered list of grounded actions\.The second part states the task: for eachG∈𝒢G\\in\\mathcal\{G\}, assign a likelihood score in\[0,1\]\[0,1\]such that the scores sum to one, then return the set𝒢^\\hat\{\\mathcal\{G\}\}of goals tied at the highest score, together with a free\-text explanation of the reasoning\. The third part inserts the concrete problem instance\. The fourth part provides a rigid output format template, listing each hypothesis with its score, the most\-likely goal set𝒢^\\hat\{\\mathcal\{G\}\}, and the reasoning trace, and instructs the model to follow it exactly, without additional text or formatting\. This explicit format constraint allows us to parse the model’s output programmatically without post\-hoc cleaning\. Figure[2](https://arxiv.org/html/2605.15333#A1.F2)in Appendix[A](https://arxiv.org/html/2605.15333#A1)shows the full template\.
The normalised scoring requirement follows directly from the formalisation of goal recognition in Section[2](https://arxiv.org/html/2605.15333#S2)\. The standard output of a recogniser is a probability distribution over𝒢\\mathcal\{G\}\[[13](https://arxiv.org/html/2605.15333#bib.bib73),[14](https://arxiv.org/html/2605.15333#bib.bib15)\], and requiring scores to sum to one operationalises this within the prompt\. The free\-text explanation serves one purpose: it provides interpretable evidence of the model’s decision process for the qualitative analysis in Section[6](https://arxiv.org/html/2605.15333#S6)\. Because the output format places the reasoning block*after*the scores and the predicted goal set, the explanation is a post\-hoc rationale rather than a chain\-of\-thought that guides the likelihood assignments\[[22](https://arxiv.org/html/2605.15333#bib.bib158)\]\. We adopt the standard correctness criterion ofRamírez and Geffner \[[13](https://arxiv.org/html/2605.15333#bib.bib73),[14](https://arxiv.org/html/2605.15333#bib.bib15)\], Pereiraet al\.\[[12](https://arxiv.org/html/2605.15333#bib.bib34)\]whereby a recogniser is correct on a problem when the true goalG∗G^\{\*\}is contained in the returned set𝒢^\\hat\{\\mathcal\{G\}\}\. This is the same criterion from the landmark\-based baseline against which we compare\.
### 3\.2Goal Recognition as an LLM Benchmark
Ramírez and Geffner \[[13](https://arxiv.org/html/2605.15333#bib.bib73),[14](https://arxiv.org/html/2605.15333#bib.bib15)\]formalise goal recognition as a compilation to planning: for each candidate goalG∈𝒢G\\in\\mathcal\{G\}, two planning problems are solved to determine whether the observations𝒪\\mathcal\{O\}are consistent with an optimal plan forGG\. Since classical propositional planning is PSPACE\-complete\[[1](https://arxiv.org/html/2605.15333#bib.bib2)\]and this compilation preserves that hardness\[[9](https://arxiv.org/html/2605.15333#bib.bib3)\], goal recognition via planning is no easier than planning itself\.Pereiraet al\.\[[12](https://arxiv.org/html/2605.15333#bib.bib34)\]avoid the compilation entirely, and separate the complexity of understanding the domain dynamics from that of incorporating observation evidence\. First, their approach extracts landmarks for each goal hypothesis \(L\(G\)L\(\{G\}\)\)\. Then, it scores each candidate goal by comparing observations with the precomputed landmarks\. This yields a complexity ofO\(\|𝒢\|×\|L\(G\)\|×\|𝒪\|\)O\(\|\\mathcal\{G\}\|\\times\|L\(\{G\}\)\|\\times\|\\mathcal\{O\}\|\), where\|L\(G\)\|\|L\(\{G\}\)\|is bounded by the number of ground actions in the domain\. LLM\-based recognition requires a single forward pass over the prompt, with cost dominated by the size of the input rather than the planning state space\.
The key structural distinction is between*generative*and*discriminative*tasks\. Planning asks a model to*produce*a valid action sequence from scratch, requiring consistent long\-horizon causal reasoning\. Goal recognition asks a model to*evaluate*whether an observed sequence𝒪\\mathcal\{O\}is consistent with each candidate goalG∈𝒢G\\in\\mathcal\{G\}, a discriminative task\. Given the same PDDL representation, goal recognition never requires generating actions, since the model already observed the agent’s behaviour, and the model’s role is to interpret it\. This distinction suggests that LLMs should be structurally better suited to goal recognition than to planning, since evaluating consistency with world knowledge is closer to what LLMs reliably do well\[[20](https://arxiv.org/html/2605.15333#bib.bib151),[19](https://arxiv.org/html/2605.15333#bib.bib152)\]\.
Beyond its practical interest, goal recognition constitutes a principled benchmark for the foundational planning knowledge encoded in LLMs\. A model with strong domain knowledge will have prior beliefs about which goals are likely that suffice at low observability, where little evidence is available to discriminate between hypotheses\. A model relying solely on such priors, however, will produce approximately constant predictions regardless of how much evidence accumulates; correct recognition at higher observability additionally requires updating those beliefs incrementally as each new action is observed, a capability that world knowledge alone cannot provide\. We therefore hypothesise that LLMs will perform competitively against landmark\-based recognition at low observation rates, where foundational knowledge priors suffice, but that only models capable of integrating new evidence will improve further\. The resulting divergence \(a world\-knowledge floor at sparse observations and an evidence\-dependent ceiling thereafter\) is the central signal our experiments are designed to measure, and is what distinguishes world\-knowledge retrieval from genuine evidential reasoning\.
## 4Experimental Setup
To understand the capabilities of LLMs in solving goal recognition problems, we compare them against a state\-of\-the\-art symbolic approach, namely the landmark\-based method byPereiraet al\.\[[12](https://arxiv.org/html/2605.15333#bib.bib34)\]\. For the LLMs, we select two commercial and two open\-weight models\. Among commercial models, we choose OpenAI’s GPT\-4o, for offering a balanced trade\-off between cost and performance, and GPT\-5\.4, the frontier OpenAI model at the time of our experiments, for representing the upper bound of available proprietary capability\. Among open\-weight models, we choose GPT\-OSS\-20B, the open\-weight variant of the GPT line111[OpenAI’s GPT\-OSS](https://openai.com/index/introducing-gpt-oss/); available via[Ollama](https://ollama.com/library/gpt-oss)\., enabling a direct comparison between open and closed models within the same training lineage, and Qwen 3\.5\-9B222[Alibaba’s Qwen 3\.5\-9B](https://qwen.ai/blog?id=qwen3.5); available via[Ollama](https://ollama.com/library/qwen3.5:9b)\., for representing a competitive open\-weight model from a distinct model family and organisation, deployable locally without API dependency\. We run the experiments on a machine equipped with a 2GHz Intel Xeon Gold 6138 CPU with 32 GB of RAM and an NVIDIA RTX 2080 Ti GPU with 11 GB of VRAM\.
We draw problems from the goal and plan recognition benchmark ofPereira and Meneguzzi \[[11](https://arxiv.org/html/2605.15333#bib.bib9)\], which comprises1515PDDL domains\. Due to the cost of running commercial LLM APIs at scale, we evaluate on four domains due to their varied characteristics:*Blocks World*,*Campus*,*Driverlog*, and*Dock\-Worker Robots*\(*DWR*\)\.*Blocks World*is a classic combinatorial domain with a vocabulary well\-represented in LLM training data; failure here cannot be attributed to an unfamiliar vocabulary, making it the primary test of evidence\-integration ability\.*Campus*is a simple control domain with short plans and a small goal space; it serves as a minimum competence check, since any approach that fails here is operationally useless\.*Driverlog*is a moderately difficult domain with medium\-length plans, a large goal space, and real\-world logistics vocabulary \(drivers, trucks, packages, locations\), testing whether LLM world knowledge transfers to a combinatorially harder setting\.*DWR*is a difficult domain for both classical search and LLM\-based recognisers, with very long plans, large goal spaces, and many interdependent objects to track across steps\.
For each domain, the problems are divided into five observability degrees:\[10%,100%\]\[10\\%,100\\%\], in increments of20%20\\%, of the plan length\. Recall that we define the observability in goal recognition as the percentage of the plan’s actions that are observed by the recogniser\. Therefore, the lower the observability degree, the harder the problem is as more goals may share the observed plan\.Pereira and Meneguzzi \[[11](https://arxiv.org/html/2605.15333#bib.bib9)\]generate each problem by randomly retaining actions from the full plan with probability equal to the observability degree, running this sampling up to three times per plan and discarding duplicate observation sets; at100%100\\%observability the recogniser always receives the full plan, so each plan yields exactly one problem, resulting in fewer problems at full observability than at partial observability\. The domains also vary in the average number of goal hypotheses per problem and the average number of landmarks per hypothesis, both of which directly affect the landmark\-based baseline\.
In this work, we use four main metrics:\(i\)spread in goals \(S∈𝒢S\\in\\mathcal\{G\}\),\(ii\)accuracy \(Acc\),\(iii\)precision \(Prec\), and\(iv\)recognition time\.Spread is the average cardinality of the set of goal hypotheses returned by the recogniser as most likely correct ones\. Accuracy is the percentage of problems where the recogniser correctly identified the correct intended goal,i\.e\.when the correct intended goalG∗G^\{\*\}is included in the set of returned hypotheses\. Since a recogniser that always returns the entire candidate goal set achieves perfect accuracy, we also report the average precision per problem,i\.e\.the average fraction of returned hypotheses that are correct, by dividing the number of correctly recognised goals \(11if the model was correct,0otherwise\) by the spread in𝒢\\mathcal\{G\}\. Finally, we report the recognition time in seconds as a measure of efficiency for the different models and the baseline\.
For the LLM\-based recognisers, we also report the token usage and cost in USD \(for the commercial models\) as additional metrics of interest\. We report not only the total number of tokens \(TT\) used, but also the breakdown between input \(PT\) and output \(CT\) tokens\. The idea behind these metrics is understanding which model is more efficient in terms of token usage versus recognition performance\. We provide an overview of these metrics in Section[5](https://arxiv.org/html/2605.15333#S5), and the complete information on token usage and cost in Table[3](https://arxiv.org/html/2605.15333#A2.T3)in Appendix[B](https://arxiv.org/html/2605.15333#A2)\.
To augment the quantitative analysis, we perform a qualitative analysis over the reasoning reported by the different models to better understand their behavior\. In it, we recognise how LLMs approach the problem, identify potential patterns in the reasoning that could explain the differences in performance between each model, and provide some common failure points in all models\. Due to space constraints, all examples for the qualitative analysisappear as part of Appendix[C](https://arxiv.org/html/2605.15333#A3)\. Finally, we provide some potential points towards a productive agenda to improve LLMs further in goal recognition tasks\.
## 5Results
Table 1:Results for Landmark\-Based, Qwen 3\.5, GPT\-OSS, GPT\-4o, and GPT\-5\.4 goal recognition on all problems\. ALL DOMAINS shows the average across all problems in the four domains\.Landmark\-BasedQwen3\.5GPT\-OSSGPT\-4oGPT\-5\.4Domain\|ℒ\|\|\\mathcal\{L\}\|𝒪%\\mathcal\{O\}~\\%\|P\|\|P\|\|𝒢\|\|\\mathcal\{G\}\|\|𝒪\|\|\\mathcal\{O\}\|S∈𝒢S\{\\in\}\\mathcal\{G\}AccPrecTimeS∈𝒢S\{\\in\}\\mathcal\{G\}AccPrecTimeS∈𝒢S\{\\in\}\\mathcal\{G\}AccPrecTimeS∈𝒢S\{\\in\}\\mathcal\{G\}AccPrecTimeS∈𝒢S\{\\in\}\\mathcal\{G\}AccPrecTimeBLOCKSWORLD16\.61024620\.31\.81\.344\.3138\.042\.90\.00\.410\.41517\.66\.663\.1020\.23201\.45\.343\.5014\.0512\.16\.453\.2511\.6316\.73024620\.34\.91\.256\.1051\.022\.80\.00\.410\.41536\.24\.663\.4132\.51137\.52\.630\.4913\.6612\.12\.536\.5918\.8017\.05024620\.37\.71\.163\.4158\.882\.20\.01\.631\.42675\.73\.569\.5143\.71126\.12\.233\.3319\.5010\.92\.045\.9327\.2817\.67024620\.311\.11\.283\.7478\.321\.90\.02\.032\.03538\.72\.580\.4963\.21135\.82\.634\.1518\.1510\.81\.654\.0739\.7218\.01009220\.314\.51\.4100\.0085\.691\.70\.322\.8321\.74671\.81\.498\.9194\.68121\.22\.041\.3026\.1810\.81\.577\.1759\.8716\.0CAMPUS9\.710152\.01\.01\.186\.6780\.002\.81\.080\.0080\.00204\.51\.493\.3373\.3327\.11\.9100\.0053\.335\.71\.593\.3370\.006\.630152\.02\.01\.186\.6780\.002\.90\.10\.000\.00407\.51\.186\.6786\.6732\.11\.7100\.0066\.674\.91\.593\.3370\.008\.450152\.03\.01\.193\.3386\.672\.80\.26\.676\.67475\.41\.186\.6780\.0029\.01\.4100\.0080\.007\.21\.3100\.0083\.337\.970152\.04\.51\.0100\.00100\.002\.91\.0100\.00100\.00202\.81\.086\.6786\.6728\.81\.3100\.0086\.675\.71\.1100\.0096\.679\.7100152\.05\.41\.0100\.00100\.002\.80\.00\.000\.00466\.41\.186\.6783\.3325\.91\.1100\.0096\.676\.11\.0100\.00100\.009\.0DRIVERLOG10\.810847\.12\.61\.345\.2440\.584\.40\.00\.000\.00538\.82\.865\.4836\.6557\.62\.745\.2416\.929\.32\.344\.0520\.2615\.030847\.17\.01\.259\.5251\.984\.40\.00\.000\.00514\.42\.070\.2448\.3879\.91\.830\.9517\.759\.41\.341\.6731\.7915\.950847\.111\.21\.377\.3867\.364\.40\.422\.6220\.83588\.31\.486\.9075\.5874\.21\.544\.0528\.279\.41\.153\.5747\.4216\.570847\.115\.61\.292\.8684\.134\.31\.080\.9580\.36346\.41\.295\.2490\.6775\.21\.747\.6230\.999\.61\.273\.8163\.2916\.3100287\.121\.71\.2100\.0090\.484\.10\.00\.000\.00498\.81\.0100\.00100\.0050\.01\.346\.4335\.718\.51\.296\.4383\.9314\.0DWR32\.010847\.35\.71\.238\.1035\.124\.40\.00\.000\.00455\.84\.073\.8126\.7677\.62\.335\.7116\.268\.42\.225\.009\.0716\.230847\.316\.01\.164\.2961\.314\.40\.734\.5230\.06645\.22\.578\.5750\.10105\.51\.525\.0017\.269\.31\.423\.8115\.1616\.850847\.326\.21\.172\.6270\.244\.50\.15\.955\.95527\.92\.085\.7159\.47110\.91\.826\.1915\.978\.31\.126\.1921\.7317\.070847\.336\.91\.090\.4888\.103\.80\.13\.573\.57567\.71\.590\.4878\.08131\.61\.526\.1919\.358\.51\.130\.9526\.7917\.1100287\.351\.91\.092\.8692\.863\.60\.867\.8667\.86524\.81\.092\.8692\.86105\.61\.925\.0013\.077\.81\.053\.5751\.7914\.7ALLDOMAINS17\.31042914\.52\.71\.344\.7639\.433\.50\.13\.033\.03498\.74\.967\.3027\.59133\.64\.144\.2916\.4210\.64\.647\.3214\.8615\.93042914\.57\.41\.259\.4454\.233\.50\.16\.996\.12548\.83\.568\.5340\.96116\.32\.231\.9317\.0210\.72\.037\.0622\.4216\.55042914\.511\.81\.169\.0063\.733\.10\.16\.766\.29622\.72\.776\.6954\.30109\.62\.036\.3622\.6410\.01\.745\.4532\.0916\.97042914\.516\.81\.187\.4182\.132\.80\.321\.2121\.10495\.02\.085\.5572\.32119\.42\.237\.5323\.299\.91\.455\.0143\.7917\.210016314\.121\.31\.298\.7789\.062\.50\.324\.5423\.93597\.91\.396\.9394\.2397\.51\.844\.7932\.059\.41\.378\.5366\.3114\.8
Table[1](https://arxiv.org/html/2605.15333#S5.T1)summarises accuracy, precision, spread, and time in seconds for all five recognisers across the four evaluation domains\. The central result is that the recognisers divide into three qualitatively distinct groups\. GPT\-OSS leads all LLMs at every observation percentage, achieving67\.3%67\.3\\%accuracy at10%10\\%observations across all domains, surpassing landmark\-based in accuracy \(44\.8%44\.8\\%\), and scaling to96\.9%96\.9\\%at full observations, nearly matching landmark\-based \(98\.8%98\.8\\%\)\. GPT\-5\.4 shows partial scaling, rising from47\.3%47\.3\\%to78\.5%78\.5\\%in accuracy, but falls short of GPT\-OSS throughout\. GPT\-4o, by contrast, remains effectively flat:44\.3%44\.3\\%in accuracy at10%10\\%observations and44\.8%44\.8\\%at100%100\\%, a range of less than one percentage point despite a tenfold increase in evidence\. Qwen 3\.5 fails almost entirely on all but the simplest domain, reaching only24\.5%24\.5\\%at full observations across all domains\.
The accuracy gap between GPT\-4o and GPT\-OSS is clearest on*Blocks World*, where landmark\-based and GPT\-4o are tied at10%10\\%observations \(44\.3%44\.3\\%and43\.5%43\.5\\%respectively\), but diverge completely thereafter: landmark\-based reaches100%100\\%at full observations, GPT\-OSS reaches96\.9%96\.9\\%, and GPT\-4o never exceeds43\.5%43\.5\\%\. On*DWR*, the hardest domain, GPT\-OSS reaches92\.9%92\.9\\%at full observations, matching landmark\-based exactly, while GPT\-4o remains near25%25\\%regardless of evidence\.*Campus*, the simplest domain, is the exception: GPT\-4o achieves100%100\\%accuracy at all observability degrees, suggesting that world knowledge alone suffices when the goal space is small \(\|𝒢\|=2\|\\mathcal\{G\}\|=2\) and the vocabulary familiar\. Qwen 3\.5 fails even on campus at30%30\\%and100%100\\%observations, indicating that local inference at this scale is insufficient for reliable output formatting, not merely for reasoning\.




Figure 1:Average acc, prec, and spread for different observation percentages for ALL DOMAINS\.Accuracy alone is an incomplete picture because a recogniser returning all candidate goals achieves perfect recall at the cost of zero discriminative power\. Spread and precision track this tradeoff directly\. We use the spread to compute the precision on each problem, dividing the number of correctly identified goals \(which can be at most one\) by the number of returned hypotheses\. At10%10\\%observations, GPT\-OSS and GPT\-4o start with similarly high spread \(4\.94\.9and4\.14\.1respectively, versus1\.31\.3for landmark\-based\), reflecting that both LLMs hedge broadly when evidence is sparse\. Even with a higher spread, GPT\-OSS achieves27\.6%27\.6\\%precision at10%10\\%observations due to its higher accuracy, compared to16\.4%16\.4\\%for GPT\-4o and39\.4%39\.4\\%for landmark\-based\. As observations accumulate, GPT\-OSS resolves its uncertainty: spread falls to1\.31\.3at full observations, approaching landmark\-based \(1\.21\.2\), and precision rises to94\.2%94\.2\\%, the best overall on full observations\. GPT\-4o’s precision increase for higher observability is solely due to its spread decreasing, as its accuracy remains flat\. Its spread decreases only to1\.81\.8and accuracy remains flat at full observations, so precision reaches just32\.1%32\.1\\%, compared to89\.1%89\.1\\%for landmark\-based and94\.2%94\.2\\%for GPT\-OSS\. The precision gap between GPT\-OSS and GPT\-4o at full observations is thus as striking as the accuracy gap: both can retrieve the correct goal, but only GPT\-OSS commits to it\. On the other hand, GPT\-5\.4 shows a more balanced improvement in both accuracy and spread, leading to a steady increase in precision from14\.9%14\.9\\%at10%10\\%observations to66\.3%66\.3\\%at100%100\\%\. Similarly to its accuracy results, Qwen 3\.5’s precision is also low across all observation percentages, reaching only23\.9%23\.9\\%at full observations\. We also see that Qwen 3\.5’s spread’s average is mostly below11, due to the fact that it often fails completely and returns no candidate goals at all\. Figure[1](https://arxiv.org/html/2605.15333#S5.F1)shows these trends more clearly\. We discuss the reasoning inconsistencies that lead to this kind of anomaly in the results in Section[6](https://arxiv.org/html/2605.15333#S6)\.
In terms of recognition time, the landmark\-based baseline is by far the most efficient, with its average time ranging from1\.71\.7to4\.54\.5seconds across all domains and observation percentages\. Among the LLM\-based recognisers, GPT\-4o is the fastest one, with the average recognition time ranging from4\.94\.9to12\.112\.1seconds, followed by GPT\-5\.4\. Qwen 3\.5 and GPT\-OSS are the slowest, which is due to the fact that they are non\-commercial models running on local hardware\.
Regarding token usage, Table[2](https://arxiv.org/html/2605.15333#S5.T2)shows that GPT\-4o is overall the most efficient model in terms of the usage of completion tokens, while GPT\-5\.4 consumes less prompt tokens\. GPT\-4o dominates as the most token\-efficient model, with a total average of around3,0003\{,\}000tokens per problem, compared to around3,3003\{,\}300for GPT\-5\.4\. GPT\-OSS and Qwen 3\.5 are the least efficient, with11,50011\{,\}500total tokens used for GPT\-OSS, and18,70018\{,\}700for Qwen 3\.5\. In terms of cost in the commercial models, all of them kept the average cost per problem around0\.010\.01and0\.020\.02USD\. We report the complete information on the cost of commercial models in terms of the average number of input and output tokens, as well as the average cost per recognition, in Table[3](https://arxiv.org/html/2605.15333#A2.T3)in Appendix[B](https://arxiv.org/html/2605.15333#A2)\.
Table 2:Token usage and cost in USD \(avg\. per problem\) for LLMs on all domains\.Qwen3\.5GPT\-OSSGPT\-4oGPT\-5\.4Domain𝒪%\\mathcal\{O\}~\\%\|P\|\|P\|\|𝒢\|\|\\mathcal\{G\}\|\|𝒪\|\|\\mathcal\{O\}\|PTCTTTPTCTTTPTCTTTCostPTCTTTCostALLDOMAINS1042914\.52\.72135\.913656\.415792\.32003\.36824\.38827\.62031\.0956\.02987\.00\.012030\.01216\.43246\.40\.023042914\.57\.42202\.216258\.818460\.92155\.010301\.812456\.72089\.8918\.53008\.30\.012088\.81220\.23309\.00\.025042914\.511\.82283\.618912\.621196\.22211\.310626\.412837\.62146\.1903\.53049\.60\.012145\.11241\.63386\.60\.027042914\.516\.82354\.515529\.117883\.72273\.611220\.513494\.02208\.4914\.23122\.60\.012207\.41232\.53439\.90\.0210016314\.121\.32395\.418072\.720468\.02331\.37836\.010167\.42266\.3882\.93149\.10\.012265\.31127\.83393\.00\.02
color=blue\!40, noinlinecolor=blue\!40, noinlinetodo:color=blue\!40, noinlineNir: Token usage is \(potentially\) useful but I’m not convinced by the cost element\.color=green\!80, noinlinecolor=green\!80, noinlinetodo:color=green\!80, noinlineNathan: I think we can remove PC from the table\. It doesn’t add much to it, and since PT is just TT \- CT, readers can easily compute it if they want\.color=blue\!40, noinlinecolor=blue\!40, noinlinetodo:color=blue\!40, noinlineNir: At some point there was a discussion about runtime\. There’s an obvious question about how this scales for the different approaches as the size of the \(same\) domain increases\.
## 6Discussion
Given the quantitative results, we now focus on analysing the reasoning process of LLMs when performing goal recognition, and provide some examples where we believe improvements can be made to allow LLMs to perform better\. We select examples from the*Blocks\-World*domain, as it is a commonly used domain in the goal recognition literature, and one that is adopted for the planning experiments in[Corrêaet al\.](https://arxiv.org/html/2605.15333#bib.bib156)’s work \[[2025](https://arxiv.org/html/2605.15333#bib.bib156)\]\. We reference figures containing the reasoning excerpts in Appendix[C](https://arxiv.org/html/2605.15333#A3)for optional review\.
#### GPT\-4o
GPT\-4o’s reasoning is consistently inconsistent with its own score assignments, a pattern that holds across all observability levels\. At low observability, the model tends to anchor on the few observed actions and check which hypotheses include the affected blocks, but does so overconfidently: in Figure[3](https://arxiv.org/html/2605.15333#A3.F3), it selects a single hypothesis despite multiple others being equally compatible with the observation, and its justification contains factual errors about the preconditions\. This overconfidence is compounded by*confabulation*, where the model produces plausible\-sounding justifications that directly contradict its own scores: in Figure[4](https://arxiv.org/html/2605.15333#A3.F4)it claims no hypothesis mentions the relevant block, when several do; in Figure[5](https://arxiv.org/html/2605.15333#A3.F5)it states it favoured hypotheses involving blockP, yet the correct hypothesis does not prominently includeP; and in Figure[6](https://arxiv.org/html/2605.15333#A3.F6)the reasoning identifies discriminating facts and then collapses to a uniform distribution without explanation\. At higher observability, the failure mode shifts but the underlying problem persists: in Figure[10](https://arxiv.org/html/2605.15333#A3.F10)the reasoning names blockPas relevant but the scores ignore it; in Figure[8](https://arxiv.org/html/2605.15333#A3.F8)the model identifies a relation between blocksAandJthat is not supported by the observations; and in Figure[9](https://arxiv.org/html/2605.15333#A3.F9)the conclusion abruptly introduces blocksMandO, neither present in the observations nor in the preceding reasoning, to justify assigning maximum score to an incorrect hypothesis\. Across all cases, evidence that is named in the reasoning is not enforced in the scores, and evidence that drives the scores is not grounded in the reasoning\.
#### GPT\-5\.4
GPT\-5\.4 exhibits more sophisticated planning reasoning than GPT\-4o, but this comes with its own failure modes\. Unlike GPT\-4o, it identifies precondition violations arising from partial observability and reasons around them rather than ignoring them: in Figure[14](https://arxiv.org/html/2605.15333#A3.F14), it correctly infers that two of the three observed actions jointly establish\(ON P A\), identifies the unique hypothesis containing that fact, and selects it correctly\. It also reasons about hypotheses as goals that the observed sequence may be partially progressing toward, which is a principled strategy under partial observability, as illustrated in Figure[16](https://arxiv.org/html/2605.15333#A3.F16), where it narrows to two candidates and selects the one most consistent with the observation as a partial plan\. However, this same tendency to commit early becomes a liability when evidence is insufficient or the problem is complex: under full observability it selects the wrong hypothesis in Figure[16](https://arxiv.org/html/2605.15333#A3.F16)by misreading a completed sequence as an incomplete one, it derives an entirely incorrect1010\-block final stack in Figure[15](https://arxiv.org/html/2605.15333#A3.F15)despite having all actions available, and in Figure[17](https://arxiv.org/html/2605.15333#A3.F17)it overcommits to a single hypothesis at30%30\\%observability despite initially narrowing correctly to a set of four candidates that included the correct goal\. The net effect is higher precision when the reasoning is right, but lower accuracy when it is not\.
#### GPT\-OSS
GPT\-OSS displays a recurring behaviour of applying a fact\-matching strategy in most problems, trying to check the actual facts present in the state resulting from the execution of the observed actions\. In Figure[11](https://arxiv.org/html/2605.15333#A3.F11), the model correctly identifies the correct hypothesis by reasoning about what are the facts present in the state following the applying of the single observed action, and assigns to each hypothesis a score proportional to the number of matched facts\. By doing so, it is able to obtain more consistent reasoning, but this behaviour still leads it to incorrect inference many times, like in the example from Figure[12](https://arxiv.org/html/2605.15333#A3.F12)\. In this example, the model does its best with the provided information, and correctly identifies the hypothesis that best matches the observed actions, but since there is lacking information due to partial observability, the model interprets an intermediate state as a final one\. Since in the reasoning it correctly identifies the missing information by stating that it is ignoring the preconditions being violated, it could in theory try to better distribute the score to include a larger number of matching hypotheses\. Different from the previous example, where the model explicitly informed about ignoring the violated preconditions, in another problem the same model stated that since the observations contained precondition violations, it could not infer the correct intended goal, and hence assigned a uniform score to all hypotheses \(Figure[13](https://arxiv.org/html/2605.15333#A3.F13)\)\. This showcases an inconsistency in the model’s behaviour, which is expected to some degree given the nature of LLMs\. Additionally, it is interesting to see how it deals with the same issue differently, considering that the prompt does not specify that the observation sequence can be a non\-contiguous partial subset of the full observation sequence\.
#### Qwen 3\.5
Out of all four evaluated models, Qwen 3\.5 was the worst when it comes to reasoning inconsistencies and hallucinations\. For many problems across all domains and observability degrees, the model did not provide any response at all, returning a completely empty output, even while consuming tokens\. In other cases, it correctly identifies a precondition violation under partial observability, but in turn simply states the sequence of actions is impossible and refuses to provide any further response, as in the example from Figure[18](https://arxiv.org/html/2605.15333#A3.F18)\. In a specific problem, Qwen 3\.5 provides unsound reasoning, showcasing explicit confusion and even confusing block names\. It also fails to deliver a response in the required format and select a hypothesis that was not even in the hypothesis list \(Figure[19](https://arxiv.org/html/2605.15333#A3.F19)\)\. In another problem, with30%30\\%observability, the model fails completely and generates a nonsensical response, with its only response being “You put x on table\.” as shown in Figure[20](https://arxiv.org/html/2605.15333#A3.F20)\. Another example of this kind of behaviour from Qwen 3\.5 is shown in Figure[21](https://arxiv.org/html/2605.15333#A3.F21), where the model provides a completely incoherent response comprised only of “\[:answer A\]” under full observability\. For both cases, the model used around12,00012\{,\}000completion tokens, meaning there was reasoning effort but still no sound response\. Still, there are cases where Qwen 3\.5 shows reasoning capabilities and even correctly narrows down to a single and correct hypothesis under sensible reasoning, as in the example from Figure[22](https://arxiv.org/html/2605.15333#A3.F22)\. In this example, even under50%50\\%observability, the model correctly identifies the resulting state from the observed actions and narrows down to two possibly correct hypotheses\. After that, it identifies an inconsistency between one of the selected hypotheses and the initial state, unchanged by the observations, and hence correctly selects the other hypothesis as the most likely correct one\.
#### Common Patterns
Despite their differences in capability and failure mode, the models examined here share a consistent set of weaknesses that together point toward a productive research agenda\. All models struggle to translate stated reasoning into consistent score assignments, all exhibit overconfidence when observations are sparse, and none reliably improve their predictions as additional evidence accumulates, which is the defining discriminative demand of goal recognition\. All models also display position bias consistent with ‘lost\-in\-the\-middle’\[[6](https://arxiv.org/html/2605.15333#bib.bib160)\]: attention degrades for hypotheses placed in the middle of the candidate list, with GPT\-4o exhibiting a pronounced tendency to favour hypotheses at the first position\. Yet the broader picture is not discouraging\. The models collectively demonstrate that world knowledge of planning domains transfers meaningfully to zero\-shot goal recognition: even without fine\-tuning, all models perform non\-trivially at low observability, and the progression from GPT\-4o to GPT\-5\.4 shows that stronger general reasoning capabilities do address specific failure modes, notably precondition awareness and partial\-plan reasoning\. This suggests that the bottleneck is not knowledge but the ability to integrate evidence incrementally\. Several targeted interventions could plausibly close the remaining gap\. Restructuring the output format to elicit reasoning before scores would allow deliberate comparison of candidate goals to genuinely constrain the prediction\. Training models to maintain an explicit running state representation updated action by action would directly address the state\-tracking failures observed at higher observability\. Improving calibration so that models express uncertainty proportional to the available evidence would reduce the precision\-accuracy trade\-off that characterises GPT\-5\.4’s behaviour\. A systematic prompt sensitivity analysis varying hypothesis ordering and position would further quantify the extent of position bias and inform more robust evaluation protocols\[[18](https://arxiv.org/html/2605.15333#bib.bib159)\]\. More broadly, goal recognition provides an unusually clean benchmark for the foundational planning knowledge of LLMs: the ground truth is unambiguous, the evidence is structured and incremental, and the failure modes are precisely measurable\. The plateau at low observability reveals the floor set by world\-knowledge priors; the divergence in scaling behaviour across models reveals how far that foundational knowledge alone can carry recognition and where evidence integration becomes the limiting factor\.
## 7Related Work
The relationship between LLMs and automated planning is one of the most actively studied questions in recent AI research\. Early systematic evaluation using the PlanBench benchmark shows that LLMs fail to reliably solve classical planning problems; on standard IPC domains zero\-shot performance is near chance, and it collapses entirely when domain and object names are obfuscated to remove semantic cues\[[20](https://arxiv.org/html/2605.15333#bib.bib151),[19](https://arxiv.org/html/2605.15333#bib.bib152)\]\.Kambhampatiet al\.\[[5](https://arxiv.org/html/2605.15333#bib.bib153)\]argue against using LLMs as standalone planners and propose the LLM\-Modulo framework, which embeds LLMs within a loop controlled by external formal verifiers\. A complementary finding is that LLMs cannot reliably verify their own planning outputs\[[16](https://arxiv.org/html/2605.15333#bib.bib154)\]\. More recent benchmarking finds that frontier LLMs and reasoning models have closed much of the gap with classical planners on standard IPC domains\[[21](https://arxiv.org/html/2605.15333#bib.bib155),[3](https://arxiv.org/html/2605.15333#bib.bib156)\]; obfuscated variants continue to expose substantial performance drops, confirming that this progress rests on world\-knowledge exploitation rather than symbolic search\.
Chiariet al\.\[[2](https://arxiv.org/html/2605.15333#bib.bib129)\]introduce GRNet, which recasts goal recognition as a deep learning classification task: a neural network trained on plan traces from a given domain learns to predict the goal from a partial observation sequence\.Serinaet al\.\[[15](https://arxiv.org/html/2605.15333#bib.bib97)\]extend this line to online settings, where the recogniser updates its prediction incrementally as new observations arrive\. These approaches achieve high accuracy and fast inference but require domain\-specific training data, making them unsuitable for zero\-shot evaluation across novel domains\. A distinct line of work addresses goal recognition under noisy or incomplete observations using neuro\-symbolic methods\[[10](https://arxiv.org/html/2605.15333#bib.bib33),[17](https://arxiv.org/html/2605.15333#bib.bib11)\]; that setting differs from ours, in which observations are assumed complete but partial\. Our contribution is complementary to both lines: we ask whether world knowledge encoded in pre\-trained LLMs suffices for zero\-shot goal recognition, without any domain\-specific training\.
## 8Conclusion
This paper provides the first systematic evaluation of LLMs in the task of goal recognition over PDDL benchmarks\. Given the substantial challenge\[[19](https://arxiv.org/html/2605.15333#bib.bib152)\], and recently perceived progress\[[3](https://arxiv.org/html/2605.15333#bib.bib156)\], these benchmarks provide a closely related, but qualitatively different challenge to language models, and how they employ world knowledge\. Our empirical experiments show that even models that approximate the performance of symbolic algorithms for planning \(GPT\-5\.4\) collapse when performing goal recognition in the same domain\. These failures happen even without any obfuscation with three qualitatively distinct profiles\. GPT\-OSS leads all models at every observation percentage, scaling from67\.3%67\.3\\%to96\.9%96\.9\\%accuracy and approachinglm\(98\.8%98\.8\\%\) at full observations\. GPT\-5\.4 shows partial scaling \(47\.3%47\.3\\%to78\.5%78\.5\\%\)\. GPT\-4o remains flat at44%44\\%regardless of evidence\. Qwen 3\.5 fails almost entirely due to output\-format failures at this model scale\.
The key discriminator between profiles is not domain knowledge but the ability to integrate additional evidence: GPT\-4o matcheslmat10%10\\%observations, where world\-knowledge priors suffice, then diverges completely as evidence accumulates\. This is the predicted failure mode from Section[3\.2](https://arxiv.org/html/2605.15333#S3.SS2): world\-knowledge retrieval masquerading as goal recognition\. Here GPT\-4o indicates that more domain knowledge acts as a disadvantage if the models do not perform reasoning\. Our prompt places likelihood scores before the free\-text reasoning block as a deliberate methodological control\. It requires each model to commit to a distribution over goals using only its domain knowledge and the observation evidence, preventing the reasoning trace from post\-hoc steering the scores toward a rationalised answer\. GPT\-4o’s flat accuracy curve therefore reflects a genuine failure to integrate evidence rather than an artefact of the model being steered toward prior\-anchored scores by its own reasoning\. Whether reversing the order, reasoning first and scoring after, would improve evidence integration for models like GPT\-4o is an open question and a natural next step\.
Goal recognition provides a principled benchmark for LLM reasoning: the accuracy floor at sparse observations measures the domain familiarity encoded in training, while the scaling slope over increasing observations reveals whether a model integrates evidence or merely retrieves priors\. The contrast between GPT\-4o \(flat slope, high floor\) and GPT\-OSS \(rising slope\) makes this separation empirically clean\. Several directions remain open: \(i\) a mystery\-domain ablation to test whether GPT\-OSS’s scaling survives name obfuscation\[[20](https://arxiv.org/html/2605.15333#bib.bib151)\]; \(ii\) chain\-of\-thought prompting with reasoning placed before scores, to test whether evidence integration improves; and \(iii\) evaluation under noisy observations, where some observed actions are incorrect, a setting in which LLMs’ approximate reasoning may provide an advantage over classical methods that assume noiseless input\.
## Acknowledgments and Disclosure of Funding
color=red\!70, inlinecolor=red\!70, inlinetodo:color=red\!70, inlineFelipe: Acknowledgements\. Remove this section from the anonymised submission version\.
## References
- \[1\]T\. Bylander\(1994\)The computational complexity of propositional strips planning\.Artificial Intelligence69\(1\-2\),pp\. 165–204\.Cited by:[§3\.2](https://arxiv.org/html/2605.15333#S3.SS2.p1.6)\.
- \[2\]M\. Chiari, A\. E\. Gerevini, F\. Percassi, L\. Putelli, I\. Serina, and M\. Olivato\(2023\)Goal recognition as a deep learning task: the grnet approach\.InProceedings of the Thirty\-Third International Conference on Automated Planning and Scheduling,S\. Koenig, R\. Stern, and M\. Vallati \(Eds\.\),pp\. 560–568\.External Links:[Document](https://dx.doi.org/10.1609/ICAPS.V33I1.27237)Cited by:[§7](https://arxiv.org/html/2605.15333#S7.p2.1)\.
- \[3\]A\. B\. Corrêa, A\. G\. Pereira, and J\. Seipp\(2025\)The 2025 planning performance of frontier large language models\.CoRRabs/2511\.09378\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2511.09378),2511\.09378Cited by:[§1](https://arxiv.org/html/2605.15333#S1.p1.1),[§6](https://arxiv.org/html/2605.15333#S6.p1.1),[§7](https://arxiv.org/html/2605.15333#S7.p1.1),[§8](https://arxiv.org/html/2605.15333#S8.p1.6)\.
- \[4\]J\. Hoffmann, J\. Porteous, and L\. Sebastia\(2004\)Ordered landmarks in planning\.Journal of Artificial Intelligence Research22\(1\),pp\. 215–278\.External Links:[Document](https://dx.doi.org/10.1613/jair.1492)Cited by:[§2](https://arxiv.org/html/2605.15333#S2.p4.4)\.
- \[5\]S\. Kambhampati, K\. Valmeekam, L\. Guan, K\. Stechly, M\. Verma, S\. Bhambri, L\. Saldyt, and A\. Murthy\(2024\)LLMs can’t plan, but can help planning in LLM\-modulo frameworks\.InProceedings of the Forty\-First International Conference on Machine Learning,External Links:[Link](https://arxiv.org/abs/2402.01817)Cited by:[§7](https://arxiv.org/html/2605.15333#S7.p1.1)\.
- \[6\]N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang\(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.External Links:[Link](https://aclanthology.org/2024.tacl-1.9/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00638)Cited by:[§6](https://arxiv.org/html/2605.15333#S6.SS0.SSS0.Px5.p1.1)\.
- \[7\]F\. Meneguzzi and R\. F\. Pereira\(2021\)A survey on goal recognition as planning\.InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence,Z\. Zhou \(Ed\.\),pp\. 4524–4532\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2021/616)Cited by:[§2](https://arxiv.org/html/2605.15333#S2.p2.4),[§2](https://arxiv.org/html/2605.15333#S2.p3.3)\.
- \[8\]R\. Mirsky, S\. Keren, and C\. Geib\(2021\)Introduction to symbolic plan and goal recognition\.Vol\.16,Morgan & Claypool Publishers\.Cited by:[§1](https://arxiv.org/html/2605.15333#S1.p2.1),[§2](https://arxiv.org/html/2605.15333#S2.p1.5)\.
- \[9\]B\. Nebel\(2000\)On the compilability and expressive power of propositional planning formalisms\.Journal of Artificial Intelligence Research12,pp\. 271–315\.Cited by:[§3\.2](https://arxiv.org/html/2605.15333#S3.SS2.p1.6)\.
- \[10\]R\. F\. Pereira, F\. Fuggitti, F\. Meneguzzi, and G\. De Giacomo\(2023\-12\)Temporally extended goal recognition in fully observable non\-deterministic domain models: temporally extended goal recognition in fond planning\.Applied Intelligence\.External Links:ISSN 1573\-7497,[Document](https://dx.doi.org/10.1007/s10489-023-05087-1)Cited by:[§7](https://arxiv.org/html/2605.15333#S7.p2.1)\.
- \[11\]R\. F\. Pereira and F\. Meneguzzi\(2017\)Goal and plan recognition datasets using classical planning domains\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.825878)Cited by:[§4](https://arxiv.org/html/2605.15333#S4.p2.1),[§4](https://arxiv.org/html/2605.15333#S4.p3.3)\.
- \[12\]R\. F\. Pereira, N\. Oren, and F\. Meneguzzi\(2020\)Landmark\-based approaches for goal recognition as planning\.Artificial Intelligence279,pp\. 103217\.External Links:[Document](https://dx.doi.org/10.1016/j.artint.2019.103217),[Link](https://linkinghub.elsevier.com/retrieve/pii/S0004370219300013)Cited by:[item \(i\)](https://arxiv.org/html/2605.15333#S1.I1.i1.1),[§1](https://arxiv.org/html/2605.15333#S1.p3.1),[§2](https://arxiv.org/html/2605.15333#S2.p4.4),[§3\.1](https://arxiv.org/html/2605.15333#S3.SS1.p3.3),[§3\.2](https://arxiv.org/html/2605.15333#S3.SS2.p1.6),[§4](https://arxiv.org/html/2605.15333#S4.p1.1)\.
- \[13\]M\. Ramírez and H\. Geffner\(2009\)Plan recognition as planning\.InProceedings of the 21st International Joint Conference on Artificial Intelligence,C\. Boutilier \(Ed\.\),pp\. 1778–1783\.External Links:[Link](http://ijcai.org/Proceedings/09/Papers/296.pdf)Cited by:[§1](https://arxiv.org/html/2605.15333#S1.p2.1),[§2](https://arxiv.org/html/2605.15333#S2.p1.5),[§2](https://arxiv.org/html/2605.15333#S2.p3.3),[§3\.1](https://arxiv.org/html/2605.15333#S3.SS1.p3.3),[§3\.2](https://arxiv.org/html/2605.15333#S3.SS2.p1.6)\.
- \[14\]M\. Ramírez and H\. Geffner\(2010\)Probabilistic plan recognition using off\-the\-shelf classical planners\.InAAAI Conference on Artificial Intelligence,pp\. 1121–1126\.External Links:[Link](http://dblp.org/rec/conf/aaai/RamirezG10)Cited by:[§1](https://arxiv.org/html/2605.15333#S1.p2.1),[§2](https://arxiv.org/html/2605.15333#S2.p1.5),[§3\.1](https://arxiv.org/html/2605.15333#S3.SS1.p3.3),[§3\.2](https://arxiv.org/html/2605.15333#S3.SS2.p1.6)\.
- \[15\]L\. Serina, M\. Chiari, A\. E\. Gerevini, L\. Putelli, and I\. Serina\(2025\)Towards efficient online goal recognition through deep learning\.InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems,pp\. 1895–1903\.External Links:[Link](https://iris.unibs.it/bitstream/11379/628225/1/3709347.3743826.pdf),ISBN 9798400714269Cited by:[§7](https://arxiv.org/html/2605.15333#S7.p2.1)\.
- \[16\]K\. Stechly, K\. Valmeekam, and S\. Kambhampati\(2025\)On the self\-verification limitations of large language models on reasoning and planning tasks\.InProceedings of the Thirteenth International Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2402.11686)Cited by:[§7](https://arxiv.org/html/2605.15333#S7.p1.1)\.
- \[17\]D\. Tesch, L\. Amado, and F\. Meneguzzi\(2024\)Real\-time goal recognition using approximations in Euclidean space\.In27th European Conference on Artificial Intelligence \(ECAI\),External Links:[Link](http://www.meneguzzi.eu/felipe/pubs/ecai-vector-gr-2024.pdf)Cited by:[§7](https://arxiv.org/html/2605.15333#S7.p2.1)\.
- \[18\]K\. Valmeekam, M\. Marquez, A\. Olmo, S\. Sreedharan, and S\. Kambhampati\(2023\)PlanBench: an extensible benchmark for evaluating large language models on planning and reasoning about change\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§6](https://arxiv.org/html/2605.15333#S6.SS0.SSS0.Px5.p1.1)\.
- \[19\]K\. Valmeekam, M\. Marquez, S\. Sreedharan, and S\. Kambhampati\(2023\)On the planning abilities of large language models – a critical investigation\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://arxiv.org/abs/2302.06706)Cited by:[item \(iv\)](https://arxiv.org/html/2605.15333#S1.I1.i4.1),[§1](https://arxiv.org/html/2605.15333#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.15333#S3.SS2.p2.2),[§7](https://arxiv.org/html/2605.15333#S7.p1.1),[§8](https://arxiv.org/html/2605.15333#S8.p1.6)\.
- \[20\]K\. Valmeekam, A\. Olmo, S\. Sreedharan, and S\. Kambhampati\(2022\)Large language models still can’t plan \(a benchmark for LLMs on planning and reasoning about change\)\.InNeurIPS 2022 Workshop on Foundation Models for Decision Making,External Links:[Link](https://arxiv.org/abs/2206.10498)Cited by:[item \(iv\)](https://arxiv.org/html/2605.15333#S1.I1.i4.1),[§1](https://arxiv.org/html/2605.15333#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.15333#S3.SS2.p2.2),[§7](https://arxiv.org/html/2605.15333#S7.p1.1),[§8](https://arxiv.org/html/2605.15333#S8.p3.1)\.
- \[21\]K\. Valmeekam, K\. Stechly, A\. Gundawar, and S\. Kambhampati\(2025\)A systematic evaluation of the planning and scheduling abilities of the reasoning model o1\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=FkKBxp0FhR)Cited by:[§1](https://arxiv.org/html/2605.15333#S1.p1.1),[§7](https://arxiv.org/html/2605.15333#S7.p1.1)\.
- \[22\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§3\.1](https://arxiv.org/html/2605.15333#S3.SS1.p3.3)\.
## Appendix APrompt Template
YouareanexpertatgoalrecognitioninAIplanningusingplanningdomaindescriptionsprovidedinPDDLformat\.
Youwillbeprovidedwiththefollowing:
\-PlanningdomaindescriptioninPDDL\.Thiscontainsthedefinitionofthepossibletypesofobjectsandthedescriptionoftheactionsthatcanbeperformedintheenvironment\.Eachactioniscomprisedbythefollowing:
\-Aname
\-Alistofparameters\(withtheirtypes\)
\-AlistofpreconditionrepresentedasalogicalANDoffacts,meaningfactsthatmustholdinthestatefortheactiontobeapplicable
\-AlistofeffectsrepresentedasalogicalANDoffacts,meaningfactsthatwillbeaddedtothestateorremovedfromthestateiftheyareprecededbythereservedword"not"afterapplyingtheactiontothestate
\-Problemtemplate,whichisthePDDLdescriptionofthecorrespondingplanningproblemwiththeinitialstateandaplaceholderforthegoalstate,comprisingthefollowing:
\-Alistofobjectsintheproblemwiththeirtypes
\-Aninitialstaterepresentedasasimplelistoffactsseparatedbynewlinecharactersthatholdintheinitialstateoftheproblem
\-AplaceholderforthegoalstatewhichwouldberepresentedasalogicalANDoffacts,meaningfactsthatmustholdinthefinalandgoalstateoftheproblem
\-Alistofgoalhypotheseswhereoneofthemisthecorrectgoal\.Inthislisteachgoalisalistoffacts\.Eachfactinagoalisseparatedbycomma","characters\.Eachgoalhypothesisinthelistisseparatedbynewlinecharacters\.
\-Alistofobservedactionsthatwereperformedbytheagent\.Thisisasimplelistofexecutedactionsseparatedbynewlinecharacters\.
Yourtaskistoanalyzetheproblemprovidedtoyouanddeterminewhichgoalsamongthegoalhypothesesarethemostlikelytobethecorrectones,giventheobservedactionsofanagent\.
Foreachgoalhypothesis,assignalikelihoodscore\(realnumberbetween0and1\)thatrepresentshowlikelyitisthattheobservedagentispursuingthatgoal,basedontheactionsobserved\.ThisscoreMUSTbenormalized,sothesumofthelikelihoodscoresofallgoalhypothesesmustbeequalto1\.Afterassigningalikelihoodscoretoeachgoalhypothesis,determinethesetofgoalswhicharetiedwiththehighestlikelihoodscoreandprovideanexplanationofthereasoningprocessthatledyoutoassignthoselikelihoodscoresanddeterminethatsetofmostlikelygoals\.
Theoutputshouldbethelistofgoalsalongwiththeirlikelihoodscores,thesetofgoalswhicharetiedwiththehighestlikelihoodscore,andafterthatanexplanationofthereasoningprocessthatledtothefinalanswer\.Theformatoftheoutputshouldbeexactlyastheonespecifiedbelowthe"\#\#OutputFormat"sectionofthisprompt,withoutanyadditionaltextorformatting\.
\#\#Domain
\{problem\.domain\}
\#\#ProblemTemplate
\{problem\.template\}
\#\#GoalHypotheses
\{goals\_text\}
\#\#ObservedActions\(inorder\)
\{obs\_text\}
\#\#OutputFormat\(followthisformatexactly,withoutanyadditionaltextorformatting\):
Hyps:
Hyp:<goalhypothesis\>\|Score:<likelihoodscorebetween0and1\>
MostLikelyGoals:
\-<goalhypothesis1\>
\-<goalhypothesis2\>
\.\.\.
Reasoning:
<explanationofthereasoningprocessthatledtothefinalanswer\>
Figure 2:Prompt template used for zero\-shot goal recognition\. Placeholders\{problem\.domain\},\{problem\.template\},\{goals\_text\}, and\{obs\_text\}are filled at runtime with the PDDL domain, problem template, goal hypotheses, and observed action sequence respectively\.
## Appendix BToken Usage and Cost
Table 3:Token usage and cost in USD \(avg\. per problem\) for LLMs on all problems\.Qwen3\.5GPT\-OSSGPT\-4oGPT\-5\.4Domain𝒪%\\mathcal\{O\}~\\%\|P\|\|P\|\|𝒢\|\|\\mathcal\{G\}\|\|𝒪\|\|\\mathcal\{O\}\|PTCTTTPTCTTTPTCTTTCostPTCTTTCostBLOCKSWORLD1024620\.31\.81862\.614603\.616466\.21569\.67441\.79011\.31766\.51107\.52874\.00\.021765\.51363\.93129\.40\.023024620\.34\.91903\.415119\.817023\.21865\.011664\.313529\.31796\.01060\.32856\.30\.021795\.01329\.03124\.00\.025024620\.37\.71933\.620369\.422303\.01890\.712266\.614157\.21821\.71042\.22863\.90\.011820\.71357\.73178\.40\.027024620\.311\.11972\.014926\.916898\.91923\.312590\.814514\.11854\.31065\.72920\.00\.021853\.31335\.53188\.80\.021009220\.314\.51995\.218649\.620644\.81940\.18975\.410915\.51871\.11035\.42906\.50\.021870\.11244\.73114\.80\.02CAMPUS10152\.01\.02606\.910672\.113279\.12492\.52474\.74967\.22432\.5350\.12782\.70\.012431\.5469\.22900\.70\.0130152\.02\.02621\.710285\.312907\.02505\.72994\.65500\.32445\.7373\.82819\.50\.012444\.7538\.32983\.10\.0150152\.03\.02633\.912413\.715047\.52516\.92644\.55161\.42456\.9376\.42833\.30\.012455\.9579\.23035\.10\.0170152\.04\.52655\.610565\.613221\.22537\.12651\.45188\.52477\.1367\.52844\.70\.012476\.1606\.43082\.50\.02100152\.05\.42667\.311792\.214459\.52548\.92389\.14937\.92488\.9432\.32921\.10\.012487\.9607\.43095\.30\.02DRIVERLOG10847\.12\.62768\.213239\.716007\.92726\.05542\.38268\.32666\.0850\.33516\.30\.022665\.01118\.93783\.90\.0230847\.17\.02825\.613143\.215968\.82795\.27671\.510466\.82735\.2824\.53559\.70\.022734\.21161\.83896\.00\.0250847\.111\.22989\.717850\.120839\.82862\.67111\.29973\.82802\.6803\.13605\.80\.022801\.61164\.83966\.40\.0270847\.115\.63068\.118628\.521696\.72933\.77235\.710169\.42873\.7813\.83687\.50\.022872\.71166\.84039\.50\.02100287\.121\.73014\.412628\.015642\.43030\.44676\.57706\.92970\.4800\.13770\.50\.022969\.41115\.24084\.60\.02DWR10847\.35\.72219\.611832\.114051\.72158\.87508\.49667\.22098\.8726\.32825\.10\.012097\.81015\.33113\.10\.0230847\.316\.02378\.723776\.626155\.22301\.210246\.712547\.92241\.2694\.52935\.70\.012240\.21082\.03322\.20\.0250847\.326\.22539\.816869\.619409\.42444\.110763\.613207\.72384\.1692\.03076\.10\.012383\.11096\.53479\.60\.0270847\.336\.92707\.515079\.917787\.42592\.212722\.215314\.42532\.2668\.63200\.80\.012531\.21108\.43639\.50\.02100287\.351\.92945\.424986\.227931\.52801\.310169\.812971\.02741\.3705\.93447\.20\.012740\.31034\.83775\.10\.02ALLDOMAINS1042914\.52\.72135\.913656\.415792\.32003\.36824\.38827\.62031\.0956\.02987\.00\.012030\.01216\.43246\.40\.023042914\.57\.42202\.216258\.818460\.92155\.010301\.812456\.72089\.8918\.53008\.30\.012088\.81220\.23309\.00\.025042914\.511\.82283\.618912\.621196\.22211\.310626\.412837\.62146\.1903\.53049\.60\.012145\.11241\.63386\.60\.027042914\.516\.82354\.515529\.117883\.72273\.611220\.513494\.02208\.4914\.23122\.60\.012207\.41232\.53439\.90\.0210016314\.121\.32395\.418072\.720468\.02331\.37836\.010167\.42266\.3882\.93149\.10\.012265\.31127\.83393\.00\.02
## Appendix CReasoning Excerpts
Thesingleobservedactionis\(PICK\-UPO\),whichmeansthattheagent’simmediateintentwastograsptheblock’O’fromthetable\.Thisoperationdirectlyaffectstheprecondition\(ONTABLEO\),removingitandcausingtheholdingconditionfor’O’tobetruewhilereducingthehandbeingemptystatustofalse\.Evaluatingthehypotheses,themostlikelygoalshouldinvolvesubsequentactionsthatmakesenseinthewakeofthecurrentaction,inthiscaseremovingtherestrictionsupon\(CLEARM\),\(ONTABLEW\),\(ONMO\),\(ONOW\)neededforfurtherstackingorordering,particularlythoseinvolving’O’\.Thehypothesis\(CLEARM\),\(ONTABLEW\),\(ONMO\),\(ONOW\)suggeststhattheactionofpickingup’O’fromthetableisafirstlogicalstep,supportingfuturemovestopotentiallycreatethisarrangement,includingplacing’M’on’O’or’W’,indicatingit’sahighlyplausiblesequenceofintent,hencescoringhighest\.
Figure 3:GPT\-4o’s reasoning for a problem with 10% observability in theBlocks Worlddomain\. The model correctly identifies the most likely goal hypothesis based on the single observed action\.Theobservedaction,\(PUT\-DOWND\),directlyaffectstheplacementofblockD\.However,noneofthegoalhypothesesexplicitlymentiontheblockDbeingonthetableorinanyspecificconfigurationthatwoulduniquelyindicatethepursuitofsuchagoalgivenonlythisstepsofar\.Sincenouniquedistinguishingactionwasobserved,weassumeequallikelihoodsforeachgoalhypothesisconsideringthereisjustoneobservedactionandnospecificadditionalevidencefavoringanysinglehypothesisoverothers\.Therefore,allhypothesesareequallyprobableundertheconditionsofthegivendata\.
Figure 4:GPT\-4o’s reasoning for a problem with 10% observability in theBlocks Worlddomain\. The model reasons that the single observed action does not provide enough evidence to discriminate among the candidate goals, leading it to assign equal likelihoods to all hypotheses\. This results in a high spread and low precision, even though the correct intended goal is included in the returned set\.Theobservedactionis"PICK\-UPP",whichindicatesthattheagent’sfirstmovewastopickupblockPfromitsinitialpositiononthetable\.ThismovedirectlyaffectshypotheseswhereblockPisinvolvedintransitioningfromitsinitialstate\.HypothesesthatreasonablyinvolvemovingPorrequiringPasapartofthestacktendtohavehigherscores\.Theactiondoesnotstronglydiscriminatethespecificarrangementofblockspost\-initiationbuthighlightstheimmediateimportanceofP\.HypothesesthatdonotinvolvePortheareadirectlyaroundPwerescoredlower\.Consequently,thegoalsmostcompatiblewiththisinitialactioninvolvePinastackorclearposition,suggestingvarioussequentialpossibilities,indicatedbythevarietyoftop\-scoringhypotheses\.NormalizingthesebasedonrelativitytoP’sstartingstateresultedinthefinalscores\.
Figure 5:GPT\-4o’s reasoning for a problem with 10% observability in theBlocks Worlddomain\. The model does narrow down to88selected hypotheses from2121available\. However, it is wrong in this case, as the true goal does not prominently involve block P\.Theinitialstateandtheobservedactionsuggestthattheagent’sfirstmovewastostackblockOontoblockR\.ThisactionisexecutedfromastatewhereOisclearandonthetable,andRisalsoclear\.AfterperformingSTACKOR,bothblocksOandRwillnolongerbeclearduetobeingpartofastack,andtheagent’shandwillbeemptyagain\.
Fromthegoalhypotheses,anygoalthatincludes\(ONOR\)orhasanarrangementthatcouldhaveOdirectlyonRcouldpossiblyhavebeentargetedbythisaction\.However,astheonlyobservedactionisSTACKOR,itdoesn’tprovideenoughinformationtodistinctlypreferonegoaloveranotherasitcouldbeasteptowardsmanydifferentconfigurationsdescribedinthehypotheses\.
Thus,everygoalhypothesisreceivesanequallikelihoodscore,reflectingtheuncertaintygiventhelackofmoredistinctiveactionsorevidencetosuggestaparticulargoaloverothers\.
Figure 6:GPT\-4o’s reasoning for a problem with 10% observability in theBlocks Worlddomain\. The model reasons about a possible way to select a smaller set of hypotheses but ends up selecting all of them\.Theobservedactionsare\(PICK\-UPA\),\(STACKAS\),and\(STACKCA\)\.ThesequenceofactionsbeginswithpickingupblockAandplacingitonblockS,followedbyplacingblockContopofblockA\.Thisindicatesabuild\-upwithCatthetopofthestackinvolvingAandS\.Amongthegoalhypotheses,\(CLEARC\),\(ONTABLEH\),\(ONCA\),\(ONAS\),\(ONSH\)directlyalignswiththissequenceofobservedactionssinceitresultsinastackofConA,andAonS,consistentwiththeobservedagentbehavior\.WhiletheremaybeotherhypothesesthatcouldeventuallyendupwithA,S,andCinsomeform,nonealignspecificallywiththestackingpatternoftheobservedactionsascloselyasthishypothesis\.Otherhypotheseshavelowerscoresduetotheirlackofdirectcorrelationwiththespecificorderandpurposeoftheobservedstackoperations\.
Figure 7:GPT\-4o’s reasoning for a problem with 30% observability in theBlocks Worlddomain\. The model correctly identifies the most likely goal hypothesis based on the observed sequence of actions, which clearly indicates a stacking pattern involving blocks A, S, and C\.Theobservedactionsindicateanintenttorearrangetheblocksinamannerthatcorrelatesmostcloselywithtwoofthehypotheses\.Theactionsincludeseveralunstackingandstackingoperationswhereblocksarerearrangedtoachieveaveryspecificconfigurationpattern\.Notably:
1\.TheinitialunstackingofblockBfromGandsubsequentstackingontoJsuggestsanefforttocreateasub\-structureinvolving\(ONAJ\),reflectiveofmovingtowardsagoalinvolvingstacksinitiatedbyA/J\.
2\.ActionssuchasstackingKonFandGsuggestanintenttoformasequencethatcanendwith\(ONFE\),\(ONEG\),correspondingdirectlywithhypothesispatternswhereFisinvolvedinthebasetransitionsofthestructure\.
3\.Moreover,therepeatedadjustmentsinvolvingDandJimplysignificantreorderinginvolvingtheseblocks,inlinewiththecorrespondinghypotheses\.
4\.Finally,observedtaskslikestackingAonJdirectlyalignwithasequencethatfollowsthethematictransitionintwoofthehypothesizedstackorders,indicatingaprioritizedsequenceintheseconstructs\.
ThetwohypothesesthatariseasmostlikelyarethosewithafocusoninitialD/J/AtoHprojectstransitioningintowiderstackcompositions,representedinthefinal,observedmovementsandconfigurationsoccurringdrasticallyatthefunctionalgoalend\.
Figure 8:GPT\-4o’s reasoning for a problem with 70% observability in theBlocks Worlddomain\. The model fails to reason on a more complex problem with a larger number of objects involved and narrows down to 2 wrong hypotheses\.Toevaluatewhichgoalhypothesisismostlikely,weexaminehowthesequenceofactionsalignswitheachgoalstate’srequiredconfiguration\.ThesequenceofactionsobservedissuchthatthefinalgoalstatemusthaveblockTonblockH,blockAonblockW,andblockRonthetable\.
1\.TheUNSTACKTWandSTACKTMoperationssuggestTisinvolvedinthefinalconfiguration\.
2\.Subsequently,UNSTACKRA,PUT\-DOWNR,PICK\-UPA,andSTACKAWindicatethatRmustbeonthetablewhileAisonW\.
3\.Finally,STACKTHplacesTonH,makingHasupportingblock,implyingHmustsupportotherblocks\.
Giventheexecutedactions,theonlyhypothesisproducingsuchastateis\(CLEARM\),\(ONTABLEW\),\(ONMO\),\(ONOW\)\.ThisconfigurationwouldhaveMandOstackedtogether,withMclearandWclear,andtherestoftheblocksnotviolatinganyconstraintsfromactions\.Theentireconfigurationneededtoexecutetheobservedactionsfitsperfectlywiththisparticulargoalhypothesis\.Hence,thisgoalhypothesisisassignedalikelihoodscoreof1\.0andisselectedasthesolemostlikelygoal\.
Figure 9:GPT\-4o’s reasoning for a problem with 70% observability in theBlocks Worlddomain\. The model wrongly selects the goal hypothesis \(CLEAR M\),\(ONTABLE W\),\(ON M O\),\(ON O W\) as the most likely one, based on the observed sequence of actions\. However, the true goal does not prominently involve block M or O, leading to an incorrect prediction based on a hallucination\.Theobservedactions\(UNSTACKRP\),\(PICK\-UPP\),and\(PICK\-UPR\)indicateafocusonmanipulatingblocksRandP\.TheunstackingofRfromPandthenpickinguptheseblockssuggeststhatthegoalinvolvesclearingthetopofblockR\.Amongthegoalhypotheses,thegoal\(CLEARR\),\(ONTABLEW\),\(ONRA\),\(ONAW\)impliessuchamanipulation,asitrequiresRtobemoved,possiblyinvolvingunstackingoperationsandensuringRisclear\.Thishypothesisalignswiththeobservedactions,whichseemedtoclearRandrepositionit,makingitthemostlikelygoal\.Otherhypothesesreceivelowerscoresduetolessalignmentwiththeobservedactions,eitherinvolvingdifferentblockarrangementsorpredictingdifferentsequencingofblockmanipulations\.
Figure 10:GPT\-4o’s reasoning for a problem with 30% observability in theBlocks Worlddomain\. The model fails to correctly identify the most likely goal due to limited observability even though there were hypotheses involving both R and P\.TheobservedactionisasingleSTACKDE\.Tobeapplicable,theagentmusthavebeenholdingblockDandhaveblockEclear\.ThisimpliesthatpriortotheactionDwasonsomeblockandhadbeenunstacked,whileEwasclear\.Aftertheaction,theresultingfactsare:CLEARD,ONDE,HANDEMPTY,andNOTCLEARE\.Foreachgoalhypothesiswecountedhowmanyofitsrequiredfactsaresatisfiedinthisresultingstate\.Thecountswere:
Hypothesis19:3factssatisfied\(CLEARW,ONTABLEE,ONDE\)
Allotherhypothesessatisfiedatmost2facts,someonly1or0\.
BecausetheactiondirectlycontributestosatisfyingONDE\(presentonlyinhypotheses12and19\)andalsocontributestoCLEARWandONTABLEE\(presentinsome\),hypothesis19attainsthehighestnumberofsatisfiedfacts\.Weuseasimpleproportionallikelihoodmodel:thelikelihoodofeachhypothesisisthenumberofsatisfiedfactsdividedbythetotalnumberofsatisfiedfactsacrossallhypotheses,ensuringthescoressumto1\.Thisgiveshypothesis19anormalizedscoreof3/35=0\.0857,whichishigherthananyotherhypothesis\.Hencehypothesis19isthemostlikelygoal\.
Figure 11:GPT\-OSS’s reasoning for a problem with 10% observability in theBlocks Worlddomain\. The model correctly applies its fact\-matching strategy to identify the most likely goal\.Ievaluatedeachgoalhypothesisbycheckinghowmanyofitsfactsarepresentinthestateafterexecutingtheobservedactions\(ignoringpreconditionconflicts\)\.ForeachgoalIcomputedamatchratio=\(\#ofmatchedfacts\)/\(totalfactsinthehypothesis\)\.Theseratioswerethennormalizedsothatallscoressumto1\.Thegoalwiththehighestratiowastheonewhereallitsrequiredfactsweresatisfiedinthefinalstate:\(CLEARA\),\(ONTABLEH\),\(ONAS\),\(ONSH\)\.Sinceitachievedaperfectmatch\(4/4\)thishypothesisreceivedthehighestlikelihoodscoreof0\.1091,makingitthesinglemostlikelygoalamongthoseprovided\.Theremaininghypothesesreceivedlowerscoresproportionallytohowmanyoftheirgoalfactsmatchedthefinalstate\.
Figure 12:GPT\-OSS’s reasoning for a problem with 10% observability in theBlocks Worlddomain\. The model fails in identifying the correct goal by considering an intermediate state instead of the final one, leading it to select an incorrect hypothesis\.All21goalhypothesesaretreatedasequallyadmissiblebecausetheobservedactionsequenceisinconsistentwiththedomainpre\-conditions,makingitimpossibletodeterminewhichgoaltheagentactuallypursues\.Sincenoneofthegoalscanberuledoutbasedontheobservations,eachhypothesisreceivesanequallikelihoodof1/21=0\.047619,andthereforeallgoalstieasthemostlikely\.
Figure 13:GPT\-OSS’s reasoning for a problem with 30% observability in theBlocks Worlddomain\. The model assigns a uniform likelihood to all hypotheses due to the observed action sequence being inconsistent with the domain preconditions, resulting in a failure to discriminate among the candidate goals\.Initialstate:AisonC,DisonA,RisonP,andO,E,Wareonthetableandclear;handisempty\.
Observedactions:
1\.UNSTACKAC\-applicable,agentholdsA,Cbecomesclear\.
2\.PICK\-UPP\-notapplicableimmediatelyafteraction1instrictBlocksWorldsemantics,becauseafterunstackingAthehandisnotempty\.
Still,thetwokeystructuralintentionsrevealedare:
\-theagentremovesAfromC,freeingAfrombelow;
\-theagentthenmanipulatesPandfinally
3\.STACKPA\-whichplacesPonA\.
ThefinalobservedrelationONPAishighlydiagnostic\.Amongallhypotheses,onlyhypothesis20contains\(ONPA\)\.Italsorequires\(CLEARP\)and\(ONTABLEW\)and\(ONAW\)\.AfterstackingPonA,Pisindeedclear,matchingthatgoalcomponent\.Thismakeshypothesis20uniquelyalignedwiththeobservation\.
Figure 14:GPT\-5\.4’s reasoning for a problem with 30% observability in theBlocks Worlddomain\. The model correctly infers the agent’s goal despite the observed action sequence violating the domain preconditions, demonstrating a more robust reasoning process compared to GPT\-OSS\.Applyingtheobservedactionsfromtheinitialstateyieldsthefollowingfinaltowerstructureovertherelevantblocks:\(ONJG\),\(ONHJ\),\(ONEH\),\(ONAE\),\(ONIA\),\(ONFI\),\(ONDF\),\(ONBD\),\(ONCB\),withGonthetable\.ThisexactlycorrespondstothechainC\-B\-D\-F\-I\-A\-E\-H\-J\-Gwhenreadfromtoptobottom\.
Theobservationsstronglyshowtheagentsystematicallydismantlingtheinitialtowersandrebuildingasingletower\.\.\.Severalactionsareredundantundo/redoactions,suchasstackingandunstackingeong,jone,honj,anddonf,buttheydonotchangethefinalstatereachedafterthefullobservationsequence\.Themodelconcludesuniquelythathypothesis5describesthisfinalstructure\.
Figure 15:GPT\-5\.4’s reasoning for a problem with full observability in theBlocks Worlddomain\. The model fails to identify the correct goal, missing the complete ordering sequence by switching one block out of place, showcasing its difficulty in handling complex problems with many objects and long observation sequences, similar to GPT\-OSS\.Initialstate:U,T,R,K,Careclearandonthetable;SisonA,AisonH,andHisonthetable;handisempty\.
Observedactions:1\.PICK\-UPU,2\.STACKUT,3\.PICK\-UPC,4\.STACKCU
TheseactionsbuildthetowerConUonT,yieldingastatewhereONUTandONCUaretrue,Tisonthetable,andCisclear\.
Theobservationsmatchhypothesis8exactly:\(CLEARC\),\(ONTABLET\),\(ONCU\),\(ONUT\)\.Afterthefourobservedactions,everyfactinthatgoalalreadyholds,sothisisaverystrongcandidate\.
Hypothesis20isalsostronglysupportedbecauseitcontainstheachievedsubstructure\(CLEARC\),\(ONTABLET\),\(ONCU\),butrequiresextendingthetowertoONURandONRT\.Thisiscompatiblewiththeobservationpattern:theagentmaybeassemblingpartofalargerfinaltowerfromthetopdownward/subtower\-first\.\.\.
Thusonlyhypotheses8and20receivenonzeroprobability\.Hypothesis20isassignedthehighestlikelihoodbecausetheobservationslooklikeapurposefulpartialconstructiontowardalargerstackratherthannecessarilystoppingimmediatelyoncehypothesis8isreached\.
Figure 16:GPT\-5\.4’s reasoning for a problem with full observability in theBlocks Worlddomain\. The model fails to identify the correct goal, assuming that the observations are on the path to reach the selected hypotheses instead of considering one that is already achieved\.Initialstate:O,E,Wareclearandonthetable;RisonPandclear;DisonA,AisonC,andCisonthetable;handisempty\.
Observedaction1is\(STACKOW\)\.Forthistobeapplicable,theagentmustalreadybeholdingOandWmustbeclear\.Wisclearinitially,sotheobservationstronglyindicatesthatimmediatelybeforetheobservedtracetheagenthadpickedupO\.Afterthisaction,anachievedrelationis\(ONOW\)\.Thisfactappearsonlyinhypotheses5,6,14,and16,sothosegoalsaresupportedbythefirstobservation\.Allotherhypothesesareinconsistentwiththeonlypositivestructuralfactcreatedbythefirstobservedaction\.
Observedaction2is\(UNSTACKRP\)\.Thisactionisapplicableintheinitialstateafteraction1,becauseRisonP,clear,andthehandisemptyafterstackingOonW\.Itseffectremoves\(ONRP\),makestheagentholdR,andmakesPclear\.ThisisusefulwhenthegoalneedsRmovedelsewhere,orwhenPmustbecomeclear\.
Amongthegoalssupportedbyaction1:
\-Hypothesis5:needs\(ONRO\)and\(ONOW\)\.SinceOisalreadyonW,unstackingRfromPisasensiblenextsteptowardstackingRonO\.
\-Hypothesis6:needs\(ONRO\)and\(ONOW\)\.Samesupportashypothesis5\.
\-Hypothesis16:needs\(ONCO\),\(ONOW\),\(ONWE\),\(ONER\)\.ThisisalongertowerinvolvingOonW,butunstackingRfromPcanalsoberelevantbecauseRistheintendedbaseforE\.
\-Hypothesis14:needs\(ONPO\),\(ONOW\),\(ONWE\),\(ONER\)\.Herethesecondactionisespeciallyinformative:unstackingRfromPdirectlyclearsP,whichisnecessaryifPislatertobestackedontoO\.ItalsostartsrelocatingRsoEcanbeplacedonR\.Thusbothobservationsalignwellwiththishypothesis\.
Thereisoneadditionalhypothesisstronglysupportedbythesecondobservation:
\-Hypothesis20:\(CLEARP\),\(ONTABLEW\),\(ONPA\),\(ONAW\)\.UnstackingRfromPachieves\(CLEARP\)directly,whichisoneofthegoalfacts\.Also,tobuildAonWandPonA,WmustremainonthetableandbeclearenoughtoreceiveA;theearlierstackOonWisnotitselfagoalfact,butitmaybeanintermediatestepthatwouldlaterneedtobeundone,makingthissomewhatlesscleanlyalignedthantheO\-on\-Wgoals\.Still,theexplicitachievementof\(CLEARP\)bythesecondobservedactionmakesthishypothesisveryplausible\.
Comparingsupportedhypotheses,hypothesis20isthestrongestbecausethesecondobservedactiondirectlyachievesoneofitsrequiredgoalfacts,whereasforhypotheses5,6,14,16,and21theactionsaremostlypreparatoryexceptthataction1achieves\(ONOW\)\.Amongthose,hypotheses5,6,16,and21receivesomesupport,butlessthanhypothesis20\.Theremaininghypothesesaregivenzerobecausetheyarenotsupportedbytheobservedkeyeffectsandoftenconflictwiththem,especiallythecreationof\(ONOW\)ortheuseofRandPinthesecondaction\.
Thescoresarenormalizedtosumto1,withthehighestlikelihoodassigneduniquelytohypothesis20\.
Figure 17:GPT\-5\.4’s reasoning for a problem with30%30\\%observability in theBlocks Worlddomain\. The model fails to identify the correct goal, overcommitting to a single hypothesis with lacking information\.Thesequenceisinvalid\.After\(pick\-upE\),therobot’shandholdsE,sohandemptyisfalse\.Theaction\(pick\-upC\)requireshandemptytobetrue\.Therefore,youcannotpickupCwhileholdingE\.Thismakesthesequenceimpossibletoexecuteasdescribed\.
Figure 18:Qwen 3\.5’s reasoning for a problem with10%10\\%observability in theBlocks Worlddomain\. The model identifies the precondition violation but refuses to continue from there\.Hypothesis1:‘\(ONTABLEB\)\(ONSA\)\(CLEARD\)\(ONTABLED\)\(ONST\)‘
Basedontheprovidedtext,thereissignificantconfusionregardingthestatusof‘\(ONST\)‘inHypothesis1andwhether‘A‘is‘T‘\.Thetext"Wait‘\(ONST\)‘isinHypothesis1\."suggeststhatHypothesis1contains‘\(ONST\)‘\.Additionally,thetext"Wait‘\(ONTABLED\)‘isinHypothesis1\."and"Ifboth‘\(ONTABLED\)‘and‘\(ONST\)‘areinHyp1\."impliesthatbotharelikelytrue\.Thehypothesis‘\(ONTABLEB\)\(ONSA\)\(CLEARD\)\(ONTABLED\)\(ONST\)‘containsboth‘\(ONTABLED\)‘and‘\(ONST\)‘\.Thefactthatthetextexplicitlystates"Wait‘\(ONST\)‘isinHypothesis1"andthatthisisastatementoffactwithintheprovidedlogic,supportstheinclusionof‘\(ONST\)‘inHypothesis1\.Thetext"If‘A‘is‘T‘"mightbearedherringoraconditionthatisassumed\.Giventheconflictingstatements,thehypothesisthatcontains‘\(ONST\)‘and‘\(ONTABLED\)‘isHypothesis1,asdescribedinthetext\.Therefore,themostconsistenthypothesisgiventheexplicit"Wait"statementsisHypothesis1\.
Hypothesis1:‘\(ONTABLEB\)\(ONSA\)\(CLEARD\)\(ONTABLED\)\(ONST\)‘
Figure 19:Qwen 3\.5’s reasoning for a problem with10%10\\%observability in theBlocks Worlddomain\. The model generates unsound reasoning, generates output outside the required format, and selects a hypothesis that was not even among the provided goal hypotheses, showcasing a complete failure in understanding the task and the provided information\.Youputxontable\.
Figure 20:Qwen 3\.5’s reasoning for a problem with30%30\\%observability in theBlocks Worlddomain\. The model fails completely and generates a nonsensical response\.\[:answerA\]
Figure 21:Qwen 3\.5’s reasoning for a problem with30%30\\%observability in theBlocks Worlddomain\. The model fails completely and generates a nonsensical response\.Afterthefirsttwoactions\(UNSTACKRPandPUT\-DOWNR\),theblockRisplacedonthetable,andPbecomesclear\.Thenexttwoactions\(PICK\-UPEandSTACKER\)resultinEbeingstackedonR\.Thisestablishesthestatewhere\(ONER\)holdsandRisonthetable\(ONTABLER\)\.
Amongthegoalhypotheses,onlyHypothesis14andHypothesis16containthefacts\(ONER\)and\(ONTABLER\)\.Hypothesis16requires\(CLEARC\)tobetrueand\(ONCO\)tohold\.However,blockCiscurrentlyunderblockA\(since\(ONAC\)isintheinitialstateandAhasnotbeenmoved\),implying\(CLEARC\)isfalse\.Thus,Hypothesis16isinconsistentwiththecurrentstateandrequiresmovingC\(supportedbyA\),whichisnotimpliedbytheactions\.Hypothesis14requires\(CLEARP\)and\(ONPO\)\.\(CLEARP\)istrueafteraction1\.SinceHypothesis14istheonlyhypothesiscompatiblewiththederivedstackEonR,itisthemostlikelygoal\.
Figure 22:Qwen 3\.5’s reasoning for a problem with50%50\\%observability in theBlocks Worlddomain\. The model correctly identifies the most likely goal hypothesis based on the observed actions and state changes\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The claims made in the abstract and introduction are backed by the results in the paper, in Sections[5](https://arxiv.org/html/2605.15333#S5)and[6](https://arxiv.org/html/2605.15333#S6)\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: The paper discusses the limitations of the work in the conclusion section \(Section[8](https://arxiv.org/html/2605.15333#S8)\)\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification: The paper does not include theoretical results\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: The paper provides detailed instructions and information necessary to reproduce the main experimental results\. The landmark\-based method used as baseline is available in public repository, and the prompt and models used in the LLM experiments are fully disclaimed\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: The paper provides detailed instructions and information necessary to reproduce the main experimental results\. Moreover, the code is made available as compressed file along the submission\. The code for the landmark\-based baseline approach is available in public repository\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Since no training is performed, only the usage of off\-the\-shelf LLMs, the prompt template fully disclosed and the benchmark dataset disclosed and available in public repository, yes\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[No\]
34. Justification: The paper does not include multiple runs of the same experiment, so error bars are not applicable\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: The paper provides detailed information about the compute resources used for the experiments, including the type of CPU and GPU, memory, and operating system\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: the work conducted in this research has in no form violated the NeurIPS Code of Ethics\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[N/A\]
49. Justification: No societal impacts are expected from the work performed in this research\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The paper poses no such risks\. All data made available to models are benchmarking problems that are already publicly available\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: The paper explicitly mentions the models used and the companies that created them\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.15333v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[N/A\]
64. Justification: The paper does not release new assets\. It just benchmarks existing models on an existing dataset\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The paper does not involve crowdsourcing nor research with human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The paper does not involve research with human subjects\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification: The main goal of the paper is benchmarking LLMs in a specific problem setting and that is disclosed throughout the paper\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.Similar Articles
Evaluating Large Language Models in a Complex Hidden Role Game
This paper introduces an open-source framework to evaluate LLMs' reasoning, persuasion, and deception capabilities in the hidden role game Secret Hitler, finding that current models fail at sustained multi-turn manipulation while rule-based agents outperform them.
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
PlanningBench is a framework for generating scalable, diverse, and verifiable planning data to evaluate and train large language models, featuring a constraint-driven synthesis pipeline with adaptive difficulty control and quality filtering. Experiments show that frontier LLMs struggle with coupled constraints, and reinforcement learning on PlanningBench data improves performance on unseen planning tasks.
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
HyperGVL introduces the first benchmark for evaluating Large Vision-Language Models on hypergraph understanding and reasoning, featuring 84,000 QA samples across 12 tasks and real-world applications. The paper also proposes WiseHyGR, a generalizable router that enhances LVLM performance through adaptive hypergraph representations.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
This paper introduces ReAD, a reinforcement-guided capability distillation framework that optimizes token budgets by accounting for cross-capability transfer in large language models. It demonstrates improved downstream utility and reduced harmful spillover compared to existing baselines.
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
This paper introduces a dimension-level evaluation method for measuring intent fidelity in large language models using structured prompt ablation.