Spatial Reasoning via Modality Switching Between Language and Symbolic Representation

arXiv cs.AI Papers

Summary

This paper explores grounding multi-hop textual-spatial stories into geometry-aware modalities like grids, showing a 42% performance improvement when switching from language-only to grid-based reasoning, and introduces a switching metric for modality selection in LLMs.

arXiv:2606.31285v1 Announce Type: new Abstract: Human reasoning is inherently multimodal: when problems become difficult, we rarely think in words alone. We often externalize our reasoning by sketching diagrams or drawing grids to understand the underlying conceptual structure and avoid mistakes. Building on this premise, our research investigates: (a) whether grounding multi-hop textual-spatial stories into geometry-aware modalities, such as layouts or grids, improves reasoning compared to natural language-based inference; and (b) whether a model can decide when to rely on natural language reasoning and when to switch to a structured modality. We address these questions by introducing a switching metric based on trustworthiness and complexity signals, which estimates when grounding a spatial story into structure is likely to improve performance. This takes a first step toward principled modality selection in Large Language Model (LLM) reasoning. Across our settings, switching from natural language-based reasoning to a grid-based representation improves LLM performance by up to 42\%, highlighting the importance of modality choice in shaping reasoning outcomes.
Original Article
View Cached Full Text

Cached at: 07/01/26, 05:37 AM

# Spatial Reasoning via Modality Switching Between Language and Symbolic Representation
Source: [https://arxiv.org/html/2606.31285](https://arxiv.org/html/2606.31285)
Shreya Rajpal Tanawan Premsri Parisa Kordjamshidi Department of Computer Science and Engineering Michigan State University Michigan, USA \{rajpalsh, premsrit, kordjams\}@msu\.edu

###### Abstract

Human reasoning is inherently multimodal: when problems become difficult, we rarely think in words alone\. We often externalize our reasoning by sketching diagrams or drawing grids to understand the underlying conceptual structure and avoid mistakes\. Building on this premise, our research investigates: \(a\) whether grounding multi\-hop textual\-spatial stories into geometry\-aware modalities, such as layouts or grids, improves reasoning compared to natural language\-based inference; and \(b\) whether a model can decide when to rely on natural language reasoning and when to switch to a structured modality\. We address these questions by introducing a switching metric based on trustworthiness and complexity signals, which estimates when grounding a spatial story into structure is likely to improve performance\. This takes a first step toward principled modality selection in Large Language Model \(LLM\) reasoning\. Across our settings, switching from natural language\-based reasoning to a grid\-based representation improves LLM performance by up to 42%, highlighting the importance of modality choice in shaping reasoning outcomes\.

Spatial Reasoning via Modality Switching Between Language and Symbolic Representation

Shreya Rajpal Tanawan Premsri Parisa KordjamshidiDepartment of Computer Science and EngineeringMichigan State UniversityMichigan, USA\{rajpalsh, premsrit, kordjams\}@msu\.edu

## 1Introduction

Spatial reasoning is becoming increasingly important across many research domains, including autonomous driving, navigation, and robotics\(Zhou et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib39); Zhang et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib37); Song et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib30)\)\. However, LLMs still exhibit significant limitations in spatial reasoning, especially as spatial context becomes more complex\(Liu et al\.,[2025a](https://arxiv.org/html/2606.31285#bib.bib13)\)\. Such tasks require models to compose multiple relations while maintaining a consistent understanding of where entities are located relative to one another\. In multi\-hop settings, these relations become harder to track, leading to accumulated errors and inconsistent spatial interpretations\(Shi et al\.,[2022](https://arxiv.org/html/2606.31285#bib.bib29); Premsri and Kordjamshidi,[2025](https://arxiv.org/html/2606.31285#bib.bib24)\)\. Even strong LLMs struggle on benchmarks like StepGame\(Shi et al\.,[2022](https://arxiv.org/html/2606.31285#bib.bib29)\), SpartQA\(Mirzaee et al\.,[2021](https://arxiv.org/html/2606.31285#bib.bib19)\), and ReSQ\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\)without specialized prompting or step\-by\-step reasoning traces\(Mirzaee and Kordjamshidi,[2023](https://arxiv.org/html/2606.31285#bib.bib18); Rizvi et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib27); Zhang et al\.,[2026](https://arxiv.org/html/2606.31285#bib.bib35)\)\. However, these methods still rely largely on natural language reasoning, forcing models to infer geometry and layout implicitly from text\.

This contrasts with human reasoning\. For complex spatial descriptions, people often use sketches, diagrams, or grids to externalize the scene into a simplified structure rather than reasoning in words alone\(Larkin and Simon,[1987](https://arxiv.org/html/2606.31285#bib.bib10); Rexigel et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib26)\)\. This reflects the idea of schematization, which makes spatial structure easier to perceive and reason over\(Talmy,[2003](https://arxiv.org/html/2606.31285#bib.bib31)\)\. Similarly, Figure[1](https://arxiv.org/html/2606.31285#S1.F1)shows that natural language\-based LLM reasoning must track relations implicitly across multiple steps, which can cause hallucinated relations or multi\-hop errors\(Rizvi et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib27)\), while a grid makes the structure explicit\. This motivates a central question, that is, can explicit spatial representations serve as a more effective reasoning medium for language models rather than natural language alone?

![Refer to caption](https://arxiv.org/html/2606.31285v1/x1.png)Figure 1:Natural language vs\. grid\-based reasoning in a multi\-hop spatial setting\.To investigate this, we compare reasoning directly from natural language stories against different reasoning modalities\. Here,*modality*refers to the text\-based representation of a spatial story used for inference\. These alternatives explicitly encode the scene’s spatial structure, including extracted relational triples and layout\-based forms\.111Code is available at:[https://github\.com/HLR/Spatial\-Modality\-Switching](https://github.com/HLR/Spatial-Modality-Switching)As illustrated in the example in Figure[1](https://arxiv.org/html/2606.31285#S1.F1), explicit structure \(e\.g\., grid\-based layouts\) can offer a more effective medium for multi\-hop spatial reasoning than text alone\. Based on this intuition, we propose a grid\-grounding framework that converts spatial stories into explicit 2D grids by encoding spatial relations between entities\. The resulting grids are then fed back to the language model for multi\-hop spatial question answering\.

However, reasoning with structured representations is not always necessary; in some cases, natural language\-based reasoning can be sufficient depending on model ability and problem complexity\. To address these challenges, we introduce a switching metric that estimates when a model should rely on natural language and when it should use structured spatial representations\. The metric combines trustworthiness and complexity signals from each instance to enable principled modality selection rather than reasoning with a fixed modality\.

Across multiple spatial reasoning benchmarks, our framework improves accuracy by large margins: up to 42% on StepGame\(Li et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib11)\), 8% on SpaRTUN\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\), and 5% on ReSQ\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\)\. These results highlight the importance of treating visualization and structured symbolic representations not just as outputs but as a reasoning medium\.

In summary, our contributions are: \(1\) We investigate whether changing the reasoning modality from natural language to explicit spatial structure improves multi\-hop spatial reasoning\. We also introduce a grid\-based grounding framework that converts spatial stories into structured 2D layouts for reasoning\. \(2\) We propose an adaptive switching framework that uses trustworthiness and complexity signals to decide when a model should reason with natural language and when it should switch to a structured representation, taking a step toward principled modality selection in LLM’s reasoning\.

## 2Related Work

Recent studies show that language models struggle with multi\-hop spatial reasoning across benchmarks such as StepGame\(Shi et al\.,[2022](https://arxiv.org/html/2606.31285#bib.bib29); Li et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib11)\), SpartQA\(Mirzaee et al\.,[2021](https://arxiv.org/html/2606.31285#bib.bib19)\), SpaRTUN\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\), ReSQ\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\), and SpaRP\(Rizvi et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib27)\)\. Even strong models such as GPT\-4 remain brittle in spatial inference, often making elementary reasoning errors and failing to compose spatial relations reliably\(Cohn,[2023](https://arxiv.org/html/2606.31285#bib.bib3); Cohn and Hernandez\-Orallo,[2023](https://arxiv.org/html/2606.31285#bib.bib4); Rizvi et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib27)\)\.

Prior work has addressed these failures through two broad directions\. One line of work incorporates external symbolic or structured reasoning components to support spatial reasoning, such as combining LLM\-based relation extraction with Answer Set Programming\(Yang et al\.,[2023](https://arxiv.org/html/2606.31285#bib.bib33); Kaur et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib8); Wang and Sun,[2026](https://arxiv.org/html/2606.31285#bib.bib32)\)or Prolog\(Mirzaee and Kordjamshidi,[2023](https://arxiv.org/html/2606.31285#bib.bib18)\), enforcing logical constraints during fine\-tuning\(Premsri and Kordjamshidi,[2025](https://arxiv.org/html/2606.31285#bib.bib24)\), or using graph neural networks to model multi\-step spatial dependencies\(Li et al\.,[2023](https://arxiv.org/html/2606.31285#bib.bib12); Zhou et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib38)\)\. These methods often require additional training or external solvers to improve spatial reasoning\. In contrast, we study how the representation provided to the language model itself affects spatial reasoning at inference time\.

A closer line of work studies whether spatial reasoning improves when intermediate reasoning is represented differently\. Tree\-of\-Thought prompting explores multiple reasoning paths for multi\-hop spatial inference\(Li et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib11)\), while Chain\-of\-Symbol\(Hu et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib7)\)and coordinate\-based formulations\(Liu et al\.,[2025b](https://arxiv.org/html/2606.31285#bib.bib14)\)show that symbolic or quantitative representations can improve spatial reasoning\. Our work aligns with this direction but compares multiple reasoning representations, including natural language, relational triples and coordinate layouts, and introduces grid grounding as a geometry\-preserving representation for LLM reasoning\.

Finally, to use computationally intensive reasoning strategies more efficiently, recent work has explored adaptive inference approaches that allocate additional reasoning only when needed\. Methods such as DOTS\(Yue et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib34)\), Route\-to\-Reason\(Pan et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib23)\), AdaptThink\(Zhang et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib36)\), Thinkless\(Fang et al\.,[2026](https://arxiv.org/html/2606.31285#bib.bib6)\), AdaCoT\(Lou et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib16)\), and RouteLLM\(Ong et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib20)\)select reasoning depth or route across models based on task difficulty, confidence, or expected utility\. CodeSteer\(Chen et al\.,[2025](https://arxiv.org/html/2606.31285#bib.bib2)\)similarly guides LLMs between text and code generation for symbolic tasks\. These works provide an important precedent for adaptive inference but primarily frame switching as routing between models, reasoning depths, or generation modes\. In contrast, our work studies training\-free switching at the level of spatial representation, using trustworthiness and complexity signals to decide when to reason in natural language and when to switch to an explicit geometry\-preserving representation such as a grid\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2606.31285v1/x2.png)Figure 2:Mapping the same story and question from Figure[1](https://arxiv.org/html/2606.31285#S1.F1)into four reasoning modalities: \(i\) relational triples, \(ii\) full grid, \(iii\) pruned grid, and \(iv\) coordinate\-based representation\. Each modality is used as a reasoning medium for the same question\.![Refer to caption](https://arxiv.org/html/2606.31285v1/x3.png)Figure 3:Overview of the proposed pipeline and switching mechanism, using the same story and question input as Figure[1](https://arxiv.org/html/2606.31285#S1.F1)\.##### Problem Definition:

We consider a spatial question answering task where, given a spatial narrative \(SS\) and a question \(QQ\), the model answers the question by inferring the spatial relations between two or more entities\. The answer can be either in Yes/No form or as a multi\-class label from a fixed set of spatial relations, such asleftandabove\. This is a challenging problem because relations between entities can be implicit, and the model often needs to reason across multiple relations to infer the correct answer\.

To address this problem, we hypothesize that, depending on the context, some modalities can simplify reasoning more effectively than natural language\. Figure[1](https://arxiv.org/html/2606.31285#S1.F1)illustrates this with a multi\-hop story where inferring the relation betweenGGandNNrequires tracking several intermediate entities in text and can be complex, while the grid makes the relevant relation more clear\. Therefore, we design a framework to analyze the input and model output and switch between text and structured modalities when needed\. The overall pipeline is shown in Figure[3](https://arxiv.org/html/2606.31285#S3.F3)\. It has two main components: \(i\) assessing problem complexity and model trustworthiness, and \(ii\) converting the narrative into alternative reasoning modalities\. The assessment module identifies cases where natural language\-based reasoning is unreliable or the narrative is too complex and triggers a switch to a structured modality\.

Building upon this pipeline, we examine relational triples, coordinate\-based layouts, and grids as alternative representations for reasoning and study how switching from natural language\-based reasoning to a suitable structured modality affects model performance\. We next describe the reasoning modalities and switching policy\.

### 3\.1Mapping to Multiple Reasoning Modalities

We transform each spatial narrative and question into multiple reasoning modalities, including qualitative and quantitative spatial representations\. Qualitative representations capture symbolic relations, such as relative direction or containment, while quantitative representations make spatial arrangements explicit through coordinate\-based layouts\. Grid\-based representations lie between these two by organizing entities in a discrete row\-column layout\. Each modality is then used independently for reasoning\. Figure[2](https://arxiv.org/html/2606.31285#S3.F2)illustrates the modalities considered in our framework, discussed as follows\.

1\. Original Text:This is the baseline where the model reasons directly over the original story, question, and answer options in natural language form\.

2\. Relational Triples:For this modality, we use few\-shot in\-context learning to extract pairwise spatial relations from the story as triples of the form\{head, relation, tail\}\. This process converts the narrative into a qualitative spatial representation, especially when a sentence contains multiple relations or requires coreference resolution\. For example, from "a medium purple object is inside and touching block H," we extract\{head: medium purple object, relation: tangential\-proper\-part, tail: block H\}\. These triples include directional relations, such asleft,right,aboveandbelow, as well as topological relations such asdisconnected,externally connected,partially overlapping,equal,tangential proper part,non\-tangential proper part,tangential proper part inverse,non\-tangential proper part inverse, when needed\(Kordjamshidi et al\.,[2010](https://arxiv.org/html/2606.31285#bib.bib9)\)\. As shown in Figure[2](https://arxiv.org/html/2606.31285#S3.F2)\(i\), the model receives the relational triples with the question and infers the final answer\.

3\. Grids:For this modality, we map extracted spatial relations into a two\-dimensional grid that assigns relative positions to entities while satisfying directional and topological constraints\. Topological relations are encoded with compact tags, such as\#dcfor disconnected\. We generate the grid using a Python program with predefined relation templates that translate extracted relation triples into grid placements\. The resulting*full grid*serves as a symbolic textual representation of the complete scene\. We also construct the*pruned grid*, which retains only question\-relevant entities identified by the model based on story and question, reducing noisy context and isolating the relations needed for the answer\. Figure[2](https://arxiv.org/html/2606.31285#S3.F2)\(ii\-iii\) shows both grid settings\. We provide the model with either the full grid or the pruned grid, containing only question\-relevant entities such asNNandGG, together with the question for reasoning\.

4\. Coordinate\-based Layout:This is a quantitative representation where we give the extracted spatial relations between entities as input to Prolog\(Colmerauer and Roussel,[1996](https://arxiv.org/html/2606.31285#bib.bib5)\)\. Prolog provides a numerical solution that obeys the constraints of the problem, yielding a coordinate assignment for all entities\. For example, ifxEx\_\{E\}andyEy\_\{E\}denote the horizontal and vertical coordinates of an entityEE, then as shown in Figure[2](https://arxiv.org/html/2606.31285#S3.F2)\(iv\),right\(G,S\)is encoded asxG=xS\+1,yG=ySx\_\{G\}=x\_\{S\}\+1,\\;y\_\{G\}=y\_\{S\}, placingGGone unit to the right ofSS\. Solving all constraints yields a relative coordinate layout, such as\(S, 1, 0\)and\(G, 2, 0\)\. The generated coordinates and question are then fed into the model to perform coordinate\-based reasoning\.

### 3\.2Switching Between Modalities

Natural language\-based reasoning can be effective for some spatial narratives, but its reliability depends on the model’s spatial reasoning ability and the instance complexity\(Rizvi et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib27); Cohn,[2023](https://arxiv.org/html/2606.31285#bib.bib3)\)\. As shown in Figure[3](https://arxiv.org/html/2606.31285#S3.F3), complex stories may require composing several relations across intermediate entities to infer the relation betweenGGandNN\. Natural language\-based reasoning must track these relations implicitly, while an effective structured representation, such as a grid, makes them explicit and supports more direct and potentially less error\-prone inference\.

Based on this, we guide modality choice using two factors:*trustworthiness*, which reflects the reliability of the model’s natural language\-based spatial reasoning, and*complexity*, which captures the difficulty of the spatial problem\. FollowingShailya et al\. \([2025](https://arxiv.org/html/2606.31285#bib.bib28)\); Agarwal et al\. \([2024](https://arxiv.org/html/2606.31285#bib.bib1)\), we decompose trustworthiness into*faithfulness*and*plausibility*, which we adapt to spatial reasoning as described next\.

#### 3\.2\.1Faithfulness

Faithfulness, denoted asFF, measures whether the model’s answer is based on the supporting sentences it identifies from the spatial narrative and whether its reasoning is consistent with those sentences\. For example, in the narrative"Y is positioned below L and to the right; D is upper right of L", a question about the relation ofYYtoLLshould rely only on the first sentence\. If the model uses irrelevant evidence or derives an unsupported relation, its reasoning is not faithful\. To computeFF, we ask the model to provide supporting sentences along with its answer and evaluate them using two checks:

\(i\)*Sufficiency*tests whether the supporting sentences alone recover the original answer\. We provide only the supporting sentences to the model and score 1 if the answer matches and 0 otherwise\.

\(ii\)*Necessity*tests whether the stated supporting sentences are absolutely required to derive the answer\. We remove them from the story and score 1 if the model responds with “don’t know,” and 0 otherwise\.

FFis the average of sufficiency and necessity\. AnFFscore of 1\.0 indicates that the model consistently relies on the cited supporting sentences to produce its answer, reflecting faithful reasoning rather than arbitrary guessing\.

#### 3\.2\.2Plausibility

Plausibility, denoted asPP, measures whether the model’s supporting sentences yield a logically consistent and stable interpretation under benign transformations\. We decompose it into two factors:*paraphrase stability*and*flip consistency*\.

\(i\)*Paraphrase Stability*\(P​SPS\) measures whether the model preserves the same underlying spatial logic under lexical variability and linguistic difficulty\. Similar toFF, we use the supporting sentences generated by the model and create three variants using an LLM \(GPT\-5\-mini\(OpenAI,[2025a](https://arxiv.org/html/2606.31285#bib.bib21)\)\): \(a\)linguistically similar, which paraphrases the sentence while retaining the original type of spatial expression, such as a clock\-face direction \(e\.g\., the supporting sentence "T is above I at 2 o’clock" becomes "T is at the center and I is at the 2 o’clock position"\); \(b\)Canonical, which rewrites relations into simplified qualitative relations from our predefined relation set \(e\.g\., "T upper\-right position to I"\); and \(c\)hinted, which adds a hint to ease interpretation \(e\.g\., hint: recognize the clockwise position here, 2 o’clock is usually upper\-right from the center\)\. We run the model on these variants and computeP​SPSthe frequency of the most common answer among the three responses\. This score lies in\[0,1\]\[0,1\], where higher values indicate more stable reasoning across paraphrased forms\.

\(ii\)*Flip Consistency*\(F​CFC\) tests whether the model applies inverse spatial reasoning consistently with its original answer\. Using an LLM, we generate a flipped question from the supporting sentences by reversing entity order for relation questions or negating the relation for yes/no questions\. We compare the flipped answer with the expected inverse of the original response\.F​CFCis 1 for a correct inverse, 0\.5 for a partially correct answer in cases when the expected answer is multi\-label, and 0 otherwise\. Details are provided in Appendix[B](https://arxiv.org/html/2606.31285#A2)\.

PPis computed as the average of paraphrase stability and flip\-consistency\. APPscore of 1\.0 indicates that the model’s response remains stable under paraphrases and logically consistent under equivalent inversions\.

#### 3\.2\.3Complexity

Complexity, denoted asCC, measures the difficulty of a given spatial reasoning instance\. We quantify this difficulty for each spatial reasoning instance using a weighted combination of seven factors that capture the linguistic and multi\-hop reasoning demands of the problem\. Given the story, support sentences, and question, we use an LLM \(GPT\-5\-mini\(OpenAI,[2025a](https://arxiv.org/html/2606.31285#bib.bib21)\)\) to evaluate support sentences, identify entities, flag hard language patterns, and estimate coreference\-based difficulty where applicable\. Based on these, we derive seven complementary factors\. \(i\) Support Burden \(S​BSB\) measures the number of supporting sentences derived by the model required for reasoning\. \(ii\) Chain Length \(C​LCL\) captures the actual multi\-hop depth of the instance independent of the current model\. This anchors complexity to the actual problem size\. \(iii\) Selection Difficulty \(S​DSD\) measures the fraction of story sentences that are not part of the supporting sentences, capturing the difficulty of filtering irrelevant context faced by the model\. \(iv\) Hard Language \(H​LHL\) captures the difficulty of linguistically complex spatial expressions, such as clock\-face directions, directional phrases, nested relations, and coreference\-dependent mentions\. The LLM first identifies such expressions and assigns each one a difficulty scoredi∈\[0,1\]d\_\{i\}\\in\[0,1\]\. Then we compute

H​L=0\.6⋅maxi⁡di\+0\.4⋅1n​∑i=1ndi\.HL=0\.6\\cdot\\max\_\{i\}d\_\{i\}\+0\.4\\cdot\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}d\_\{i\}\.This weighting emphasizes the hardest expression, capturing the risk of one difficult phrase affecting the reasoning chain, while the average term accounts for overall linguistic difficulty\. \(v\) Diagonal Burden \(D​BDB\) measures the fraction of extracted relations that are diagonal, i\.e\., composed of two simultaneous axis\-aligned directions like lower\-left\. \(vi\) Entity Load \(EL\) captures the burden of tracking multiple distinct entities\. \(vii\) Coreference Difficulty \(C​FCF\) estimates the burden of resolving entity references that require multi\-step relational interpretation\. Since no instance\-level complexity score is available, we estimate complexity using proxy signals\. For count\-based factors such as Support Burden, Chain Length, and Entity Load, we use the saturating functionsat​\(x,c\)=xx\+c\\mathrm\{sat\}\(x,c\)=\\frac\{x\}\{x\+c\}to map raw counts to \(\[0,1\]\)\. This normalization is needed because these factors are raw counts whose scale varies across datasets, and their difficulty contribution should increase sharply at lower values but saturate once the instance is already highly challenging\. Here,ccis a dataset\-specific reference value set to the median of the corresponding feature so that moderate counts receive intermediate scores, while larger counts approach 1 without increasing linearly\. These components are further combined into a normalized scalarC∈\[0,1\]C\\in\[0,1\], with dataset\-specific weights reported in Appendix[B](https://arxiv.org/html/2606.31285#A2)\.

Decision rule\.The aim of the switching policy is to avoid structured reasoning when the instance is simple and the model’s natural language\-based answer is reliable\. Reliability is measured using trustworthinessT=0\.6​F\+0\.4​PT=0\.6F\+0\.4P, whereFFdenotes faithfulness andPPdenotes plausibility\. The policy keeps natural language\-based reasoning when complexity is low\(C<τc\)\(C<\\tau\_\{c\}\)and trustworthiness is high\(T\>τt\)\(T\>\\tau\_\{t\}\); otherwise, it switches to grid\-based reasoning\. The thresholdsτc\\tau\_\{c\}for complexity andτt\\tau\_\{t\}for trustworthiness are tuned on the validation set for each dataset and model, with details in Appendix[B](https://arxiv.org/html/2606.31285#A2)\. For efficiency, we use short\-circuit computation to avoid estimating signals that do not directly trigger the switching decision\. More details are provided in Appendix[B\.1](https://arxiv.org/html/2606.31285#A2.SS1)\.

## 4Experimental Results

Datasets:We evaluate on three textual spatial reasoning benchmarks\.SpaRTUNincludes synthetic spatial narratives with Yes/No \(YN\) and Find Relations \(FR\) questions\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\)\.StepGametests controlled multi\-hop FR reasoning with increasing relational depth and distractors\(Li et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib11)\)\.ReSQcontains YN questions over human\-written real\-world spatial descriptions\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\)\. Table[1](https://arxiv.org/html/2606.31285#S4.T1)summarizes the statistics of the datasets\. Due to cost, we evaluate randomly sampled subsets with a fixed seed of 0, preserving the original test\-set distribution\.

Evaluation Metrics:We report exact\-match accuracy for all datasets\. For multi\-label FR questions, a prediction is correct only if it matches the complete ground\-truth relation set\(Mirzaee and Kordjamshidi,[2022](https://arxiv.org/html/2606.31285#bib.bib17)\)\. We report accuracy per hop\-depth for StepGame\. We evaluate across different model families, including LLaMA 3\.1\(Llama Team,[2024](https://arxiv.org/html/2606.31285#bib.bib15)\), Qwen3\(Qwen Team,[2025](https://arxiv.org/html/2606.31285#bib.bib25)\), and GPT\-5\.1\(OpenAI,[2025b](https://arxiv.org/html/2606.31285#bib.bib22)\)\.

DatasetEval\. SamplesTest SamplesMax\. HopSpaRTUN1,1215,5517StepGame5,000100,00010ReSQ6106102

Table 1:Dataset statistics for our evaluation\. Eval\. Samples denotes the number of test instances used in our experiments; Test Samples denotes the full dataset test\-set size; and Max\. Hop denotes the maximum multi\-hop reasoning depth\.Model / Setting12345678910OverallLLaMA3\.1\-70B Text9362504536373234302744\.7LLaMA3\.1\-70B Relational Triples8865483935313130252642\.1LLaMA3\.1\-70B Coordinates Full8579807674687168696773\.7LLaMA3\.1\-70B Coordinates Pruned8785848586808180787782\.3LLaMA3\.1\-70B Full Grid8887807376707774737076\.7LLaMA3\.1\-70B Pruned Grid9292888787838884838286\.7LLaMA3\.1\-70B ToT\-CoT8780818378757976757278\.8Qwen3\-32B Text94\.278\.272\.469\.865\.462\.463\.061\.054\.249\.867\.0Qwen3\-32B Relational Triples89\.277\.468\.663\.458\.253\.656\.251\.642\.845\.260\.6Qwen3\-32B Coordinate Full87\.680\.477\.671\.877\.872\.474\.471\.069\.673\.075\.6Qwen3\-32B Coordinate Pruned85\.883\.682\.881\.284\.480\.482\.480\.676\.877\.881\.6Qwen3\-32B Full Grid91\.487\.887\.884\.686\.082\.484\.082\.080\.679\.884\.6Qwen3\-32B Pruned Grid91\.890\.286\.883\.685\.681\.883\.680\.478\.478\.484\.1Qwen3\-32B ToT\-CoT94\.683\.075\.266\.466\.262\.462\.456\.659\.254\.268\.0GPT\-5\.1 Text9992858679716261505874\.3GPT\-5\.1 Relational Triples10093777257556149423964\.5GPT\-5\.1 Coordinates Full9492868688899080877887\.0GPT\-5\.1 Coordinates Pruned9492878688899080877987\.2GPT\-5\.1 Full Grid9292888586869285898387\.8GPT\-5\.1 Pruned Grid9292888586889285898488\.1GPT\-5\.1 ToT\-CoT9895929187898677757086\.0

Table 2:Accuracy \(%\) across \(k\)\-hop levels on StepGame for text\-, triple\-, coordinate\-, and grid\-based representations, compared with the Tree\-of\-Thought Chain\-of\-Thought \(ToT\-CoT\) prompting method\(Li et al\.,[2024](https://arxiv.org/html/2606.31285#bib.bib11)\)\. For all non\-GPT\-5\.1 models, we evaluate \(n=5,000\) examples, with 500 examples per hop level\. For GPT\-5\.1, we evaluate \(n=1,250\) examples for cost reasons, with 125 examples per hop level\. Overall denotes the mean accuracy across \(k=1\)–\(10\)\. The best result is shown in bold and the second\-best result is underlined within each model block\.### 4\.1Results and Discussion

RQ1\. Can mapping to grids simplify multi\-hop reasoning and improve performance?

Table[2](https://arxiv.org/html/2606.31285#S4.T2)demonstrates the results on StepGame\. We observe that the grid\-based representations yield the strongest overall performance, especially at higher hops, where text\-only reasoning declines sharply\.LLaMA3\.1\-70Bbenefits most from pruned grids, suggesting that reducing irrelevant spatial context helps non\-reasoning models\. Representations requiring more numerical manipulation, such as coordinate layouts, are more effective for stronger models such asQwen3\-32BandGPT\-5\.1\(Qwen Team,[2025](https://arxiv.org/html/2606.31285#bib.bib25); OpenAI,[2025b](https://arxiv.org/html/2606.31285#bib.bib22)\)\. Appendix[C](https://arxiv.org/html/2606.31285#A3)further shows that this pattern holds for small language models, where grid\-based representations improve overall accuracy by up to 25%\.

Table[3](https://arxiv.org/html/2606.31285#S4.T3)further demonstrates that structured spatial representations improve reasoning on SpaRTUN, especially for Find Relations \(FR\) questions, where models must recover multiple object relations\. Pruned grids also perform well on Yes/No \(YN\) questions, often achieving the best or second\-best accuracy for each model\. Reasoning models such asQwen3\-32Bbenefit substantially from grid\-based reasoning, improving overall accuracy by up to 8%, due to more accurate relation extraction during grid construction\. For GPT\-5\.1, text\-only reasoning is strong, but pruned grids still improve performance by focusing on the most relevant spatial objects\. Finally, Chain of Symbol \(CoS\) is also competitive with grid\-based reasoning in several settings, suggesting that symbolic intermediate forms can help when questions require tracking multiple relations\. Appendix[C](https://arxiv.org/html/2606.31285#A3)shows that for smaller models, grid\-based reasoning remains competitive with text\-based reasoning on SpaRTUN\. We also analyze representation quality and grid construction behavior in Appendix[E](https://arxiv.org/html/2606.31285#A5)\.

Table 3:Accuracy \(%\) on SpaRTUN and ReSQ\. SpaRTUN reports Yes/No \(YN\), Find Relations \(FR\), and overall accuracy; due to cost, we evaluate distribution\-preserving subsets withn=565n=565questions for GPT\-5\.1 andn=1,121n=1\{,\}121for all other models\. For ReSQ, we use the full test set\. GPT\-grid denotes GPT\-5\.1 grid construction, SM\-grid denotes same\-model grid construction, FG denotes full\-grid, PG denotes pruned\-grid, and Rel\. Trip\. denotes relational triples\. Best and second\-best results are bolded and underlined within each model block\.We evaluate ReSQ to test generalizability on realistic spatial descriptions and report results in Table[3](https://arxiv.org/html/2606.31285#S4.T3)\. Here, grids are constructed directly from extracted relations using LLMs\. GPT\-based construction performs best, while same\-model construction remains competitive with the baselines, showing that open\-source models can be used for grid construction with minimal accuracy loss\. Since ReSQ contains only up to two hops of reasoning, the accuracy gains are smaller than those on StepGame\. However, grid\-based representations consistently outperform the baselines by up to 5%\. Appendix[C](https://arxiv.org/html/2606.31285#A3)shows similar gains with grid\-based reasoning on ReSQ for small language models\. Overall, grids help most when reasoning requires composing many relations, while pruning reduces noisy context\. Text\-only reasoning often fails on multi\-step composition, while grid\-based reasoning can fail due to relation extraction or grid construction errors\. Because the two modalities fail for different reasons, switching between them can be useful\. We analyze relation extraction, grid construction, modality disagreement, and error types in Appendices[E](https://arxiv.org/html/2606.31285#A5)and[F](https://arxiv.org/html/2606.31285#A6)\.

![Refer to caption](https://arxiv.org/html/2606.31285v1/x4.png)Figure 4:Trustworthiness and complexity predict when switching helps\.Accuracy of Qwen3\-32B onStepGame\(N=1000N\{=\}1000\) across binned trustworthiness scores \(left\) and complexity scores \(right\)\. Smallnnon the x\-axis denotes the number of items in each bin\. Red lines show the switching thresholds,τt=0\.95\\tau\_\{t\}\{=\}0\.95andτc=0\.50\\tau\_\{c\}\{=\}0\.50, and peach regions mark where the policy switches to grid\-based reasoning\.RQ2\. Is the switching policy based on trustworthiness and complexity effective?Table[4](https://arxiv.org/html/2606.31285#S4.T4)demonstrates that across models, adaptive switching often matches or outperforms grid\-only reasoning while switching to the grid modality for only a subset of instances\. The oracle result provides an upper bound by selecting the correct route whenever either text\-only or grid\-based reasoning gives the correct answer\. This shows that larger gains are possible if switching decisions can more accurately identify when each modality is useful\. Figure[4](https://arxiv.org/html/2606.31285#S4.F4)further explains why trustworthiness and complexity signals help\. Text\-only accuracy declines in low\-trust and high\-complexity regions, while adaptive switching remains more stable by routing difficult cases to the grid modality\. Additional analysis in Appendix[B\.2\.1](https://arxiv.org/html/2606.31285#A2.SS2.SSS1)and[F](https://arxiv.org/html/2606.31285#A6)show that the two signals are complementary: trustworthiness and complexity individually result in weaker routing accuracy in most settings, while the combined adaptive policy gives the strongest performance across models\. Their correlations with adaptive switch accuracy also follow the expected pattern, with trust\-related factors generally positive and complexity\-related factors generally negative, indicating that the policy captures both unreliable answers and intrinsically difficult instances\.

Table 4:Accuracy and switching behavior onStepGame\(SG\) andSpaRTUN\(ST\)\. Switching results are reported onStepGamewithn=1000n=1000for LLaMA3\.1\-70B and Qwen3\-32B, andn=500n=500for GPT\-5\.1 due to cost\. ForSpaRTUN, we usen=891n=891for LLaMA3\.1\-70B and Qwen3\-32B, andn=561n=561for GPT\-5\.1, while preserving the original dataset distribution\. Txt: text\-only accuracy, G: grid\-only accuracy, AS: adaptive switching accuracy, and Orc: oracle accuracy\. S is the overall switch rate; T, C, and B denote the \(%\) of switched instances triggered by low trustworthiness, high complexity, and both signals, respectively\.Failure Modes\.To better understand when switching helps, we analyze failures based on the original problem and model reasoning\. Switching is most effective when a model fails to reason correctly in text on a multi\-hop reasoning problem\. In such cases, our switching mechanism can recover up to 82\.4% of text\-based reasoning failures\. By contrast, failures involving linguistically difficult cases remain harder to recover because many residual errors arise during relation extraction in grid construction\. A detailed discussion can be found in Appendix[F](https://arxiv.org/html/2606.31285#A6)\. Overall, these findings show that trustworthiness and complexity provide useful signals for modality switching\.

RQ3\. How can adaptive switching balance accuracy and computational cost?Adaptive switching provides a practical way to trade off accuracy and cost by invoking grid\-based reasoning only when text\-only reasoning is unreliable or the instance is complex\. Across both reasoning and non\-reasoning models, switching preserves or slightly improves accuracy over fixed routes while avoiding unnecessary structured reasoning calls\. The savings are most meaningful when grid construction is expensive, since the model can remain on the cheaper text path for easier or more trustworthy instances\. On SpaRTUN, adaptive switching, including the switching cost itself, reduces average token cost by up to 12% relative to always using the grid pipeline while maintaining accuracy competitive with grid\-only reasoning baselines\. Table[4](https://arxiv.org/html/2606.31285#S4.T4)shows that, on StepGame,GPT\-5\.1has the highest share of switches triggered by complexity alone among the compared models\. Since complexity estimation is relatively costly for StepGame, this increases the total cost by about 11%\. We provide the full token\-cost breakdown, including trust\-only, complexity\-only, and combined switching variants, in Appendix[E\.3](https://arxiv.org/html/2606.31285#A5.SS3)\. Overall, these results show that adaptive switching offers a useful cost–accuracy trade\-off when structured reasoning is costly, especially when switch\-decision triggers are more balanced across signals\.

## 5Conclusion

We study whether grounding spatial narratives into structured representations helps language models reason more effectively\. AcrossStepGame,SpaRTUN, andReSQ, our grid\-based grounding framework yields the largest and most consistent improvements over natural language\-based reasoning, especially as reasoning depth and spatial complexity increase\. We further investigate when structured grounding is most useful and build a switching policy that decides this based on model and problem instance, showing that adaptive use of structured reasoning leads to stronger performance and efficiency over baselines\. Together, these findings position structured grounding as an effective medium for spatial reasoning and highlight adaptive modality selection as a key ingredient for reliable reasoning\.

## Limitations

Our framework depends on reliable intermediate spatial structure construction\. Grid\-based reasoning is helpful only when relation extraction and grid construction accurately reflect the original narrative; otherwise, extraction errors can propagate and reduce downstream accuracy\. We partially mitigate this through line\-by\-line relation extraction, verification, and coreference\-aware processing, but future work could improve this component through spatial\-relation extraction fine\-tuning\.

Another limitation is that adaptive switching introduces computational overhead because trustworthiness and complexity signals must be computed before routing\. Although short\-circuit computation reduces unnecessary checks, the switching signal is not free and can sometimes offset the savings from avoiding grid construction\. Future work could develop cheaper routing signals or learned routing policies that better predict when structured reasoning will improve the answer\.

## 6Acknowledgements

This project is partially supported by the Office of Naval Research \(ONR\) under grant N00014\-23\-1\-2417, the Michigan State University Distinguished Fellowship, and Lambda Labs\. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Office of Naval Research, Michigan State University, or Lambda Labs\.

## References

- Agarwal et al\. \(2024\)Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju\. 2024\.[Faithfulness vs\. plausibility: On the \(un\)reliability of explanations from large language models](https://arxiv.org/abs/2402.04614)\.*Preprint*, arXiv:2402\.04614\.
- Chen et al\. \(2025\)Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, and Chuchu Fan\. 2025\.[Codesteer: Symbolic\-augmented language models via code/text guidance](https://openreview.net/forum?id=ezna4V4zHs)\.In*Forty\-second International Conference on Machine Learning*\.
- Cohn \(2023\)Anthony G Cohn\. 2023\.[An evaluation of chatgpt\-4’s qualitative spatial reasoning capabilities in rcc\-8](https://arxiv.org/abs/2309.15577)\.*Preprint*, arXiv:2309\.15577\.
- Cohn and Hernandez\-Orallo \(2023\)Anthony G Cohn and Jose Hernandez\-Orallo\. 2023\.[Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of llms](https://arxiv.org/abs/2304.11164)\.*Preprint*, arXiv:2304\.11164\.
- Colmerauer and Roussel \(1996\)Alain Colmerauer and Philippe Roussel\. 1996\.[*The birth of Prolog*](https://doi.org/10.1145/234286.1057820), page 331–367\.Association for Computing Machinery, New York, NY, USA\.
- Fang et al\. \(2026\)Gongfan Fang, Xinyin Ma, and Xinchao Wang\. 2026\.[Thinkless: LLM learns when to think](https://openreview.net/forum?id=ariVQf0KZx)\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*\.
- Hu et al\. \(2024\)Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yunze Song, Wai Lam, and Yue Zhang\. 2024\.[Chain\-of\-symbol prompting for spatial reasoning in large language models](https://openreview.net/forum?id=Hvq9RtSoHG)\.In*First Conference on Language Modeling*\.
- Kaur et al\. \(2025\)Navdeep Kaur, Lachlan McPheat, Alessandra Russo, Anthony G Cohn, and Pranava Madhyastha\. 2025\.[An empirical study of conformal prediction in llm with asp scaffolds for robust reasoning](https://arxiv.org/abs/2503.05439)\.*Preprint*, arXiv:2503\.05439\.
- Kordjamshidi et al\. \(2010\)Parisa Kordjamshidi, Martijn Van Otterlo, and Marie\-Francine Moens\. 2010\.[Spatial role labeling: Task definition and annotation scheme](https://aclanthology.org/L10-1584/)\.In*Proceedings of the Seventh International Conference on Language Resources and Evaluation \(LREC’10\)*, Valletta, Malta\. European Language Resources Association \(ELRA\)\.
- Larkin and Simon \(1987\)Jill H\. Larkin and Herbert A\. Simon\. 1987\.[Why a diagram is \(sometimes\) worth ten thousand words](https://doi.org/10.1111/j.1551-6708.1987.tb00863.x)\.*Cognitive Science*, 11\(1\):65–100\.
- Li et al\. \(2024\)Fangjun Li, David C\. Hogg, and Anthony G\. Cohn\. 2024\.[Advancing spatial reasoning in large language models: an in\-depth evaluation and enhancement using the stepgame benchmark](https://doi.org/10.1609/aaai.v38i17.29811)\.In*Proceedings of the Thirty\-Eighth AAAI Conference on Artificial Intelligence and Thirty\-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence*, AAAI’24/IAAI’24/EAAI’24\. AAAI Press\.
- Li et al\. \(2023\)Shuaiyi Li, Yang Deng, and Wai Lam\. 2023\.[DepWiGNN: A depth\-wise graph neural network for multi\-hop spatial reasoning in text](https://doi.org/10.18653/v1/2023.findings-emnlp.428)\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 6459–6471, Singapore\. Association for Computational Linguistics\.
- Liu et al\. \(2025a\)Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao\. 2025a\.[Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods](https://arxiv.org/abs/2511.15722)\.*Preprint*, arXiv:2511\.15722\.
- Liu et al\. \(2025b\)Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, and Yuzheng Zhuang\. 2025b\.[Spatialcot: Advancing spatial reasoning through coordinate alignment and chain\-of\-thought for embodied task planning](https://arxiv.org/abs/2501.10074)\.*Preprint*, arXiv:2501\.10074\.
- Llama Team \(2024\)Llama Team\. 2024\.[The llama 3 herd of models](https://arxiv.org/abs/2407.21783)\.*Preprint*, arXiv:2407\.21783\.
- Lou et al\. \(2025\)Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, and Shuangzhi Wu\. 2025\.[Adacot: Pareto\-optimal adaptive chain\-of\-thought triggering via reinforcement learning](https://arxiv.org/abs/2505.11896)\.*Preprint*, arXiv:2505\.11896\.
- Mirzaee and Kordjamshidi \(2022\)Roshanak Mirzaee and Parisa Kordjamshidi\. 2022\.[Transfer learning with synthetic corpora for spatial role labeling and reasoning](https://doi.org/10.18653/v1/2022.emnlp-main.413)\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 6148–6165, Abu Dhabi, United Arab Emirates\. Association for Computational Linguistics\.
- Mirzaee and Kordjamshidi \(2023\)Roshanak Mirzaee and Parisa Kordjamshidi\. 2023\.[Disentangling extraction and reasoning in multi\-hop spatial reasoning](https://doi.org/10.18653/v1/2023.findings-emnlp.221)\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 3379–3397, Singapore\. Association for Computational Linguistics\.
- Mirzaee et al\. \(2021\)Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi\. 2021\.[SPARTQA: A textual question answering benchmark for spatial reasoning](https://doi.org/10.18653/v1/2021.naacl-main.364)\.In*Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 4582–4598, Online\. Association for Computational Linguistics\.
- Ong et al\. \(2025\)Isaac Ong, Amjad Almahairi, Vincent Wu, Wei\-Lin Chiang, Tianhao Wu, Joseph E\. Gonzalez, M Waleed Kadous, and Ion Stoica\. 2025\.[RouteLLM: Learning to route LLMs from preference data](https://openreview.net/forum?id=8sSqNntaMr)\.In*The Thirteenth International Conference on Learning Representations*\.
- OpenAI \(2025a\)OpenAI\. 2025a\.[GPT\-5 Model Documentation](https://platform.openai.com/docs/models/gpt-5)\.Accessed: 2025\-12\-26\.
- OpenAI \(2025b\)OpenAI\. 2025b\.[GPT\-5\.1 Model Documentation](https://platform.openai.com/docs/models/gpt-5.1)\.Accessed: 2025\-12\-26\.
- Pan et al\. \(2025\)Zhihong Pan, Kai Zhang, Yuze Zhao, and Yupeng Han\. 2025\.[Route to reason: Adaptive routing for llm and reasoning strategy selection](https://arxiv.org/abs/2505.19435)\.*Preprint*, arXiv:2505\.19435\.
- Premsri and Kordjamshidi \(2025\)Tanawan Premsri and Parisa Kordjamshidi\. 2025\.[Neuro\-symbolic training for reasoning over spatial language](https://doi.org/10.18653/v1/2025.findings-naacl.128)\.In*Findings of the Association for Computational Linguistics: NAACL 2025*, page 2395–2414\. Association for Computational Linguistics\.
- Qwen Team \(2025\)Qwen Team\. 2025\.[Qwen3 technical report](https://arxiv.org/abs/2505.09388)\.*Preprint*, arXiv:2505\.09388\.
- Rexigel et al\. \(2024\)Eva Rexigel, Jochen Kuhn, Sebastian Becker\-Genschow, and Sarah Malone\. 2024\.[The more the better? a systematic review and meta\-analysis of the benefits of more than two external representations in stem education](https://doi.org/10.1007/s10648-024-09958-y)\.*Educational Psychology Review*, 36\.
- Rizvi et al\. \(2024\)Md Imbesat Rizvi, Xiaodan Zhu, and Iryna Gurevych\. 2024\.[SpaRC and SpaRP: Spatial reasoning characterization and path generation for understanding spatial reasoning capability of large language models](https://doi.org/10.18653/v1/2024.acl-long.261)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 4750–4767, Bangkok, Thailand\. Association for Computational Linguistics\.
- Shailya et al\. \(2025\)Krithi Shailya, Shreya Rajpal, Gokul S Krishnan, and Balaraman Ravindran\. 2025\.[Lext: Towards evaluating trustworthiness of natural language explanations](https://doi.org/10.1145/3715275.3732104)\.In*Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency*, FAccT ’25, page 1565–1587\. ACM\.
- Shi et al\. \(2022\)Zhengxiang Shi, Qiang Zhang, and Aldo Lipani\. 2022\.[Stepgame: A new benchmark for robust multi\-hop spatial reasoning in texts](https://doi.org/10.1609/aaai.v36i10.21383)\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 36, pages 11321–11329\.
- Song et al\. \(2025\)Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield\. 2025\.RoboSpatial: Teaching spatial understanding to 2D and 3D vision\-language models for robotics\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\.Oral Presentation\.
- Talmy \(2003\)L\. Talmy\. 2003\.[*Toward a Cognitive Semantics: Volume 1: Concept Structuring Systems and Volume 2: Typology and Process in Concept Structuring*](https://books.google.com/books?id=g7IoanNUNksC)\.Number v\. 2 in A Bradford book\. MIT Press\.
- Wang and Sun \(2026\)Rong Wang and Kun Sun\. 2026\.[DSPy\-based neural\-symbolic pipeline to enhance spatial reasoning in LLMs](https://doi.org/10.1016/j.neunet.2025.108022)\.*Neural Networks: The Official Journal of the International Neural Network Society*, 193:108022\.
- Yang et al\. \(2023\)Zhun Yang, Adam Ishay, and Joohyung Lee\. 2023\.[Coupling large language models with logic programming for robust and general reasoning from text](https://doi.org/10.18653/v1/2023.findings-acl.321)\.In*Findings of the Association for Computational Linguistics: ACL 2023*, pages 5186–5219, Toronto, Canada\. Association for Computational Linguistics\.
- Yue et al\. \(2025\)Murong Yue, Wenlin Yao, Haitao Mi, Dian Yu, Ziyu Yao, and Dong Yu\. 2025\.[DOTS: Learning to reason dynamically in LLMs via optimal reasoning trajectories search](https://openreview.net/forum?id=tn2mjzjSyR)\.In*The Thirteenth International Conference on Learning Representations*\.
- Zhang et al\. \(2026\)Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, and Jianye HAO\. 2026\.[Extracting and following paths for robust relational reasoning with large language models](https://openreview.net/forum?id=EbELaNKmZK)\.*Transactions on Machine Learning Research*\.Expert Certification\.
- Zhang et al\. \(2025\)Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li\. 2025\.[AdaptThink: Reasoning models can learn when to think](https://doi.org/10.18653/v1/2025.emnlp-main.184)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 3716–3730, Suzhou, China\. Association for Computational Linguistics\.
- Zhang et al\. \(2024\)Mike Zhang, Kaixian Qu, Vaishakh Patil, Cesar Cadena, and Marco Hutter\. 2024\.[Tag map: A text\-based map for spatial reasoning and navigation with large language models](https://openreview.net/forum?id=eU5E0oTtpS)\.In*8th Annual Conference on Robot Learning*\.
- Zhou et al\. \(2025\)Jiaming Zhou, Abbas Ghaddar, Ge Zhang, Liheng Ma, Yaochen Hu, Soumyasundar Pal, Bin Wang, Jianye HAO, Mark Coates, and Yingxue Zhang\. 2025\.[Enhancing logical reasoning in large language models through graph\-based synthetic data](https://openreview.net/forum?id=Kqp4325eXm)\.In*The First Workshop on the Application of LLM Explainability to Reasoning and Planning*\.
- Zhou et al\. \(2024\)Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Žagar, Walter Zimmer, Hu Cao, and Alois Knoll\. 2024\.[Vision language models in autonomous driving: A survey and outlook](https://doi.org/10.1109/TIV.2024.3402136)\.*IEEE Transactions on Intelligent Vehicles*, PP:1–20\.

## Appendix AExperimental Setup

All local experiments were conducted on a single NVIDIA GH200 GPU with 96GB memory\. Unless otherwise stated, local inference used greedy decoding with temperature0\.00\.0\. GPT\-5\.1\(OpenAI,[2025b](https://arxiv.org/html/2606.31285#bib.bib22)\)was accessed through the OpenAI Responses API with reasoning effort set tonone\. The switching thresholds are selected on the validation split and then fixed for test evaluation\. To ensure reproducibility, all sampled subsets use a fixed random seed of 0, and the same evaluated instances are used across comparable settings\. We use the Qwen3 family\(Qwen Team,[2025](https://arxiv.org/html/2606.31285#bib.bib25)\), including Qwen3\-8B, Qwen3\-14B, and Qwen3\-32B, with temperature 0\.7 to enable their explicit reasoning mode and LLaMA3\.1\(Llama Team,[2024](https://arxiv.org/html/2606.31285#bib.bib15)\), including LLaMA3\.1\-8B and LLaMA3\.1\-70B, with temperature 0\.0\. All prompts and evaluation scripts were identical across experimental conditions to ensure fair comparison between reasoning modalities; these are available in the code for reproducibility\.

## Appendix BSwitching Implementation

This section details how the adaptive switching policy is implemented, including the construction of the trustworthiness and complexity signals, the dataset\-specific weighting of complexity factors, and the threshold\-selection procedure\. We also describe the short\-circuit computation used to reduce switching overhead and analyze how the signals align with text\-only accuracy\.

Trust SignalWe define trustworthiness asT=0\.6​F\+0\.4​PT=0\.6F\+0\.4P, giving slightly higher weight to faithfulness than plausibility\. This choice reflects the role of the two signals in the switching decision\. Faithfulness directly checks whether the model’s answer is grounded in the story evidence, so it should dominate the trust score: if the answer is not supported by the relevant spatial facts, then a high plausibility or stability score may only indicate that the model is consistently confident in an unsupported answer\. Plausibility is still important because it tests whether the answer remains stable under benign transformations, but we treat it as a secondary signal that can reinforce, rather than override, evidence grounding\. Empirically, we use a conservative gap between the two weights: changing either weight by0\.10\.1would make the score closer to an equal weighting, while a larger gap would make plausibility too weak to affect borderline cases\. Thus, the0\.6/0\.40\.6/0\.4split encodes a minimal preference for faithfulness while still allowing plausibility to influence the final trust decision\.

Dataset\-specificCCconfiguration\.The components ofCCare selectively activated and weighted based on the dominant difficulty sources in each dataset\. Table[5](https://arxiv.org/html/2606.31285#A2.T5)reports the weight configuration used in our runs\. For StepGame, the weights emphasize structural multi\-hop difficulty: support burden, chain length, selection difficulty, diagonal burden, and entity load are all active because the dataset contains controlled relation chains with distractor sentences and explicit directional composition\. Selection Difficulty \(SD\) is therefore used for StepGame to capture how hard it is to identify the relevant supporting chain among distractor relations\.

For SpaRTUN, the weights emphasize linguistic and grounding difficulty rather than distractor\-chain selection\. SpaRTUN contains richer spatial language, logical quantifiers, mixed topological and directional relations, and entity references that can require coreference\-style interpretation\. Therefore, hard language \(HL\) and coreference difficulty \(CF\) receive the largest weights, while support burden, chain length, and entity load are retained with smaller weights\. Selection Difficulty \(SD\) is not used for SpaRTUN because the main difficulty is less about selecting a hidden support chain from controlled distractors and more about interpreting the linguistic and relational structure of the story\.

Table 5:CCweight configurations per dataset\. SB = Support Burden, SD = Selection Difficulty, CL = Chain Length, HL = Hard Language, DB = Diagonal Burden, EL = Entity Load, and CF = Coreference Difficulty\.Diagonal Burden \(DB\) is not considered for SpaRTUN because its relation inventory is based on RCC8 and directional primitives rather than explicit diagonal labels such as lower\-left or upper\-right\. Coreference Difficulty \(CF\) is not considered for StepGame because entities are explicitly named and do not introduce reference ambiguity\.

Saturating function caps\.SB, CL, and EL usesat​\(x,c\)=x/\(x\+c\)\\text\{sat\}\(x,c\)=x/\(x\+c\), whereccis the half\-saturation point\. For SB and CL on StepGame we setc=5c=5, as difficulty increases sharply up to 5 hops and exhibits diminishing returns beyond that\. For SpaRTUN, where SB uses the number of identified support sentences rather thankhopk\_\{\\text\{hop\}\}, we setc=3c=3to reflect shorter support chains\. For EL we setc=6c=6for StepGame andc=10c=10for SpaRTUN, aligned with the median entity count in each dataset\.

Threshold selection\.For each instance, the switching rule keeps natural language\-based reasoning only when the text answer is sufficiently trustworthy and the instance is not too complex, i\.e\.,T≥τtT\\geq\\tau\_\{t\}andC<τcC<\\tau\_\{c\}; otherwise, it switches to grid\-based reasoning\. We tune the thresholds\(τt,τc\)\(\\tau\_\{t\},\\tau\_\{c\}\)on a held\-out validation split and report results on a disjoint test split\. The validation split is stratified by question type and hop count where applicable, so the threshold search preserves the dataset distribution\.

We perform a grid search overτt,τc∈\{0\.05,0\.10,…,0\.95\}\\tau\_\{t\},\\tau\_\{c\}\\in\\\{0\.05,0\.10,\\ldots,0\.95\\\}\. For each candidate pair, we compute adaptive accuracy, switch rate, and the distribution of switch sources: switches caused only by low trustworthiness, switches caused only by high complexity, and switches caused by both\. Our selection follows four criteria\. First, we prefer candidates whose adaptive accuracy matches or exceeds the grid\-only validation accuracy; if none satisfy this, we keep the top\-accuracy candidates\. Second, we prefer thresholds where all three switch sources contribute meaningfully, requiring at least10%10\\%of switches from trust\-only, complexity\-only, and both\-signal triggers; if no candidate satisfies this balance condition, we skip it\. Third, among the remaining candidates, we prefer a lower switch rate to reduce unnecessary grid usage\. Finally, ties are broken by higher adaptive accuracy, then lower switch rate, and then smallerτt\+τc\\tau\_\{t\}\+\\tau\_\{c\}\. This procedure favors thresholds that are accurate, efficient, and use both signals rather than allowing either trustworthiness or complexity to become vestigial\.

The thresholds used for the main model settings are reported together with the routing\-cost analysis in Table[13](https://arxiv.org/html/2606.31285#A5.T13)\. In additional smaller\-model experiments, we use\(τt,τc\)=\(0\.85,0\.45\)\(\\tau\_\{t\},\\tau\_\{c\}\)=\(0\.85,0\.45\)for Qwen3\-8B and Qwen3\-14B, and\(0\.75,0\.30\)\(0\.75,0\.30\)for LLaMA3\.1\-8B\. The lower complexity threshold for LLaMA3\.1\-8B reflects its lower text\-only robustness on moderately complex instances\. Across exploratory runs, we observed that\(0\.85,0\.50\)\(0\.85,0\.50\)provides a reasonable default across datasets and models, and we suggest using it as the starting point for new datasets or extended experiments\.

### B\.1Short\-Circuit Computation for Switching

To reduce the cost of adaptive switching, we avoid computing all trustworthiness and complexity signals when an earlier signal is sufficient to determine the route\. We first check trustworthiness because low trust directly indicates that the model’s text\-based answer is unreliable, making additional complexity estimation unnecessary for deciding whether to switch\. Since trustworthiness is defined asT=0\.6​F\+0\.4​PT=0\.6F\+0\.4PandP∈\[0,1\]P\\in\[0,1\], the faithfulness scoreFFprovides an upper bound on the final trustworthiness value\. After computingFF, the maximum possible trustworthiness is0\.6​F\+0\.40\.6F\+0\.4\. If this upper bound is below the trust thresholdτt\\tau\_\{t\}, the instance is already untrustworthy and is switched to grid\-based reasoning without computingPPorCC\.

If trust cannot be ruled out fromFFalone, we compute complexityCC\. WhenC≥τcC\\geq\\tau\_\{c\}, the instance is considered complex and is switched to grid\-based reasoning without computingPP\. We compute plausibilityPPonly when the instance is not already switched by low faithfulness or high complexity\. In that case, the final trustworthiness scoreT=0\.6​F\+0\.4​PT=0\.6F\+0\.4Pis computed, and the policy keeps natural\-language reasoning only ifT≥τtT\\geq\\tau\_\{t\}andC<τcC<\\tau\_\{c\}; otherwise, it switches to grid\-based reasoning\.

### B\.2Switching Analysis

#### B\.2\.1Alignment Between Trustworthiness, Complexity, and Text Accuracy

Table[6](https://arxiv.org/html/2606.31285#A2.T6)reports the correlation between each switching factor and text\-only correctness on StepGame and SpaRTUN\. This analysis tests whether the proposed signals identify when natural language\-based reasoning is likely to succeed or fail\. Since the switching policy first decides whether the text answer can be trusted, trustworthiness factors should correlate positively with text correctness, while complexity factors should correlate negatively\.

On StepGame, the alignment is strong and consistent across models\. Trustworthiness has a clear positive correlation with text correctness, especially through faithfulnessFFand its sufficiency component\. This indicates that when the model’s answer can be recovered from its cited support, the text answer is much more likely to be correct\. PlausibilityPP, including paraphrase stability and flip consistency, is also positive, showing that stable and logically consistent answers tend to be more reliable\. Complexity shows the opposite trend: support burden, chain length, hard language, selection difficulty, and entity load are generally negatively correlated with text correctness\. This is expected because StepGame is designed around controlled multi\-hop spatial reasoning, where longer chains, distractor sentences, and heavier support requirements make text\-only reasoning harder\. Diagonal burden is weaker for some models, but still generally negative\.

SpaRTUN shows a more linguistically driven pattern\. Trustworthiness remains positive, especially forLLaMA3\.1\-70BandGPT\-5\.1, where faithfulness and sufficiency are strongly aligned with text correctness\. The necessity component is negative across models, which is desirable: when the model can still answer after the relevant support is removed, the original text answer is less likely to be reliable\. Complexity is also negative overall, but the strongest factors differ from StepGame\. Chain length is less dominant, while hard language and coreference difficulty are stronger predictors of text failure\. This reflects SpaRTUN’s richer linguistic structure, where errors often come from resolving entity references, quantifiers, and mixed topological/directional descriptions rather than only from hop depth\.

Overall, the two datasets validate different parts of the switching signal\. StepGame supports the structural complexity factors, because difficulty is mainly controlled by multi\-hop depth, support burden, and distractor burden\. SpaRTUN supports the linguistic and grounding factors, because correctness depends more on evidence grounding, hard language, and coreference resolution\. These trends justify using both trustworthiness and complexity: trustworthiness captures whether the text answer is faithful and stable, while complexity captures whether the instance itself is likely to make text\-only reasoning unreliable\.

Table 6:Correlation between switching factors and text\-only correctness\. Text acc\. reports exact\-match accuracy\. Trustworthiness factors include faithfulnessFF, sufficiencySS, necessityNN, plausibilityPP, paraphrase stabilityP​SPS, and flip consistencyF​CFC\. Complexity factors include support burdenS​BSB, chain lengthC​LCL, hard languageH​LHL, diagonal burdenD​BDB, selection difficultyS​DSD, entity loadE​LEL, and coreference difficultyC​FCF\. Positive correlations indicate factors associated with higher text correctness, while negative correlations indicate factors associated with text\-only failure\.

## Appendix CModalities in Small Language Models

Behavior of Small Language Models\.

Table[7](https://arxiv.org/html/2606.31285#A3.T7)compares smaller open\-source models on StepGame across text, relational triples, coordinates, grids, and ToT\-CoT prompting\. The results demonstrate that structured spatial representations become more useful as hop length increases\. Text\-only reasoning is competitive for short chains, especially at one hop, but drops as the model must compose multiple relations for reasoning\. Relational triples make the extracted relations explicit but still require internal symbolic composition\. In contrast, grids externalize the spatial layout, making them the most stable representation overall\. Pruned grids further reduce irrelevant entities and keep the reasoning context compact, while full grids preserve the full scene but may force smaller models to reason over unnecessary entities\. This matters less for stronger reasoning models: Qwen models show similar full\-grid and pruned\-grid performance, whereas LLaMA3\.1\-8B benefits more clearly from pruning\. Coordinate representations also help, but their reliance on positional comparison and numerical or symbolic manipulation makes them less consistently reliable than grids\.

Table[8](https://arxiv.org/html/2606.31285#A3.T8)extends the analysis to SpaRTUN, which includes both Yes/No and Find\-Relation questions\. Here, grid\-based methods are not always dominant because they depend on the quality of upstream relation extraction\. For LLaMA3\.1\-8B, CoS performs best, suggesting that very weak models may benefit more from guided symbolic reasoning than from grids built from noisy extracted relations\. For Qwen3\-8B, performance varies by question type: CoS is strongest on YN questions, relational triples perform best on FR and overall accuracy, while pruned grids remain competitive\. For Qwen3\-14B, pruned grids perform best overall and are especially strong on Find\-Relation questions, improving overall accuracy from 64\.90% to 72\.97%\. This suggests that compact structured layouts become more useful when the model can extract relations more reliably\.

To test generalization beyond StepGame and SpaRTUN, we also evaluate ReSQ, where grids are constructed using GPT\-5\.1 from relations extracted by each model\. Since ReSQ contains shorter Yes/No spatial reasoning chains, gains from grid grounding are smaller than on StepGame but still model\-dependent\. For LLaMA3\.1\-8B, text\-only reasoning remains strongest\. In contrast, Qwen3\-8B and Qwen3\-14B benefit from grid grounding, with full or pruned grids giving the best accuracy on ReSQ\. This suggests that when relation extraction is reliable enough, grid\-based layouts help focus the model on the spatial evidence relevant to the question\. Relational triples show less consistent gains, making them useful in some cases but less reliable than grids for ReSQ\.

Table 7:Accuracy \(%\) across k\-hop levels on StepGame for text\-, relation\-triple\-, grid\-, coordinate\-, and ToT\-CoT\-based representations\. Each k\-hop column is computed over 500 samples, and Overall is computed over 5k samples\. Best results are bolded and second\-best results are underlined within each model block\.Table 8:Accuracy \(%\) on SpaRTUN and ReSQ\. SpaRTUN reports Yes/No \(YN\), Find Relations \(FR\), and overall accuracy\. ReSQ reports overall Yes/No accuracy\. For ReSQ, grids are constructed using GPT\-5\.1 and then used by the QA model\. FG and PG denote full\-grid and pruned\-grid reasoning, respectively; Rel\. Trip\. denotes relational\-triple reasoning\. Best and second\-best results are bolded and underlined within each model block\.
## Appendix DSwitching on Small Language Models

Table 9:Accuracy and switching behavior on StepGame\. Results are reported on the StepGame test set with 1000 instances per hop\. Txt denotes text\-only accuracy, G denotes grid\-only accuracy, AS denotes adaptive switching accuracy, and Orc denotes oracle accuracy\. S is the overall switch rate; T, C, and B denote the percentage of switched instances triggered by low trustworthiness, high complexity, and both signals, respectively\.Table[9](https://arxiv.org/html/2606.31285#A4.T9)shows adaptive switching remains close to or improve over grid\-only performance for small open\-source models\. This indicates that the switching policy can preserve the benefit of structured reasoning while still selecting text when it is reliable enough\. The switch rate also reflects model strength\. Weaker models such as LLaMA3\.1\-8B switch most often, with a 94\.0% switch rate, suggesting that their text\-only answers are rarely reliable enough for multi\-hop spatial reasoning\. In contrast, Qwen3\-14B switches less often at 65\.6% and achieves the best adaptive accuracy\. This suggests that stronger reasoning models can rely on text\-based reasoning more selectively while still benefiting from grid\-based reasoning on harder or less trustworthy instances\. The remaining gap to oracle accuracy shows that better modality selection could further improve performance, especially in deciding when text is sufficient and when grid grounding is necessary\.

## Appendix EAblation Study

We conduct ablation studies to isolate the contribution and reliability of each component in our structured reasoning pipeline\. First, we evaluate ground\-truth intermediate representations to estimate the upper bound of relational triples, full grids, and pruned grids when upstream extraction noise is removed\. Second, we analyze construction reliability by separating errors from relation extraction and errors introduced during grid construction\. Third, we evaluate the spatial relation extraction module itself, comparing the baseline extraction setting with our sentence\-level, context\-aware extraction pipeline\. Finally, we analyze adaptive routing cost and signal\-specific contribution by comparing trust\-only, complexity\-only, and full switching policies\. Together, these ablations show where performance gains come from, when grid\-based reasoning is limited by upstream extraction, and how switching balances accuracy with computational cost\.

### E\.1Ground\-Truth Representation Upper Bound

To isolate upstream relation extraction errors, we evaluate reasoning directly on ground\-truth intermediate representations in Table[10](https://arxiv.org/html/2606.31285#A5.T10)\. Specifically, we compare reasoning over ground\-truth relational triples, ground\-truth full grids, and ground\-truth pruned grids\. Ground\-truth full grids are constructed directly from ground\-truth relational triples using our grid\-construction pipeline, while pruning is performed model\-wise before applying the same construction code\. We use model\-wise pruning rather than dataset\-provided reasoning annotations because the annotations are Prolog\-based and answer\-oriented, and do not translate directly into the entity set needed for grid construction\. This is especially important for Find\-Relations multi\-label questions, where pruning must retain the queried entities and enough surrounding context to preserve the answer\. Model\-wise pruning is therefore more consistent with the deployed pipeline, remains dataset\-agnostic, and attributes pruning errors to the same model used for reasoning\. Ground\-truth triples are passed directly to the model without any spatial layout construction\. This analysis measures the upper bound of each representation when extraction noise is removed\.

Table 10:Ground\-truth upper bound analysis\. Rel\. Trip\., FG, and PG denote ground\-truth relational triples, full grids, and pruned grids, respectively\.Ground\-truth grids substantially outperform relational triples, especially on both datasets where deeper multi\-hop spatial reasoning benefits from explicit structured layouts\. This suggests that even perfect qualitative relational triples are often insufficient on their own and that models benefit from representations that combine qualitative relations with an explicit quantitative spatial structure\.

For Qwen3\-32B, pruning remains nearly identical to full grids on both SpaRTUN and StepGame, likely because the model already performs strongly on structured reasoning tasks, consistent with observations in the Qwen3 technical report\(Qwen Team,[2025](https://arxiv.org/html/2606.31285#bib.bib25)\)\. On SpaRTUN, pruning itself is also a harder task because many Yes/No questions contain quantifiers while Find Relation questions often require complete block\-level reasoning\. As a result, the pruned and full grids produce identical answers for 91\.6% of examples overall on a subset ofn=561n=561samples \(50% of our test set while maintaining the same split proportions\), with agreement reaching 95\.8% for YN questions and 86\.7% for FR questions\. This suggests that much of the useful spatial structure in SpaRTUN must often be preserved even after pruning\.

### E\.2Construction Reliability

Since the grid\-based route depends on the quality of the intermediate structure, we separately analyze the reliability of relation extraction and grid construction\. This section reports how accurately spatial relations are extracted from text and how reliably those relations are converted into valid grid representations for downstream reasoning\.

Table 11:Reliability of relation extraction and grid construction\. Relation Extraction evaluates how many predicted spatial relations match the gold relation set\. Gold Rel\. with Grid evaluates grid encoding fidelity when the grid is constructed from gold relations, isolating the grid construction step from upstream extraction errors\. Pred\. Rel\. with Grid evaluates the full predicted pipeline, where extracted relations are first generated by the model and then encoded into a grid\.#### E\.2\.1Relation extraction and grid construction fidelity\.

Table[11](https://arxiv.org/html/2606.31285#A5.T11)evaluates two sources of error in the structured pipeline: relation extraction and grid construction\. Relation extraction is evaluated as a relation\-level matching problem, measuring how many predicted spatial triples match the gold triples\. Grid construction is different: it evaluates whether a complete grid layout faithfully encodes the target relation graph\. In this setting, recall measures how many intended relations are preserved in the grid, while precision can decrease if the concrete placement of objects induces additional spatial relations that were not explicitly specified in the ground\-truth relations\.

This distinction matters most forSpaRTUN\. UnlikeStepGame, which mainly contains controlled directional relations,SpaRTUNincludes directional, topological, distance\-based, and RCC\-style relations\. Encoding all of these constraints into a single discrete row\-column grid is a harder geometric realization problem and can become computationally difficult in the presence of mixed spatial constraints\. Even when the ground\-truth relations are used, the grid achieves very high recall but slightly lower precision, because assigning each entity a concrete grid position can introduce extra directional cues between pairs that are not explicitly constrained in the original relation graph\. Thus, the grid should be viewed as a compact reasoning interface rather than a perfect reconstruction of every spatial relation\.

The ground\-truth relations results show that the grid construction procedure itself is reliable: it is perfect onStepGameand preserves almost all gold relations onSpaRTUN\. When predicted relations are used, performance decreases, especially onSpaRTUN, showing that most remaining errors come from upstream relation extraction and from the difficulty of realizing richer spatial graphs in a single grid\. This supports our design choice: the goal is not to solve exact geometric realization for every spatial relation in a story but to test whether a structured grid representation can still provide a useful intermediate reasoning interface for LLMs\.

#### E\.2\.2Spatial Relation Extraction

Grid construction relies on extracted spatial relations heavily\. Therefore, we evaluate relation extraction as a preprocessing step\. The task converts each spatial story into triples of the form\(head, relation, tail\), which are then passed to the grid\-construction module\.

Our extraction pipeline processes the story sentence by sentence rather than extracting all relations in one pass\. For StepGame, we use line\-by\-line extraction module with in\-context learning, which helps the model focus on local directional relations\. For SpaRTUN, we use the same sentence\-level extraction followed by a coreference\-aware refinement pass, where each line is reprocessed using the previous two lines and the already extracted relations as context\. This is important for SpaRTUN because its stories contain richer language, pronouns, and entity references that require coreference resolution for better relation extraction\. To assess whether this decomposition is necessary, we compare our pipeline against a baseline that gives the model the full story at once and asks it to extract all spatial relations in a single call\. The baseline requires the model to resolve relation extraction, entity grounding, and coreference jointly, which leads to missed or noisy triples\. In contrast, our sentence\-level extraction with context\-aware refinement decomposes the task into smaller decisions and improves extraction accuracy by about 20% on both datasets\. This improvement is important because errors in relation extraction directly propagate to grid construction and downstream reasoning\.

Table 12:Spatial relation extraction accuracy before grid construction\. Baseline denotes extracting relations directly from the story, while Rel\. Extr\. Pipeline denotes our sentence\-level relation extraction pipeline\.

### E\.3Cost Analysis

Table[13](https://arxiv.org/html/2606.31285#A5.T13)reports the average token cost per record for adaptive routing across StepGame and SpaRTUN\. We compare the fixed text\-only route, the fixed grid pipeline, the cost of computing each switching signal, and the final adaptive route\. The switch total includes both the switching\-signal cost and the cost of the selected reasoning path\.

The main observation is that adaptive switching changes the cost–accuracy tradeoff differently across datasets\. On StepGame, adaptive switching improves or preserves accuracy relative to the fixed grid route, while token cost varies by model: it decreases slightly forQwen3\-32B, remains close forLLaMA3\.1\-70B, and increases forGPT\-5\.1\. This mixed pattern indicates that, for StepGame, the switching signal is useful mainly as a routing and accuracy mechanism rather than as a guaranteed token\-saving mechanism\. Since the grid pipeline is already highly effective on this dataset, the additional cost of computing the switch decision can offset the savings from routing some examples through the text path\.

SpaRTUN shows a different pattern\. Here, adaptive switching reduces average token cost relative to always using the grid pipeline across all three models, while keeping accuracy competitive with the stronger fixed route\. This suggests that the routing policy is more cost\-effective when the dataset contains a mixture of examples, some that benefit from structured reasoning and others that can remain on the cheaper text\-only path\. The reduction is largest forGPT\-5\.1; however, becauseGPT\-5\.1also has lower costs across most fixed and adaptive rows, we interpret this as partly a model\-specific efficiency effect rather than only an effect of routing\. The single\-signal rows help explain why the combined policy is preferable\. In these ablations, one signal is disabled while the other remains active: trust\-only routing zeroes out the complexity signal and uses the trust thresholdτt\\tau\_\{t\}, while complexity\-only routing zeroes out the trustworthiness signal and uses the complexity thresholdτc\\tau\_\{c\}\. Each signal provides useful but incomplete information\. Trustworthiness captures whether the model’s current text answer appears grounded and stable, while complexity captures factors that make the instance difficult before or during reasoning\. As shown in Table[6](https://arxiv.org/html/2606.31285#A2.T6), these factors correlate with adaptive correctness in the expected directions: trust\-related factors are mostly positive, while complexity\-related factors are mostly negative\. This suggests that the signals reflect different aspects of model behavior, including cases where the model can overcome apparent difficulty and cases where perceived difficulty corresponds to actual failure\.

Together, the two signals are stronger in terms of accuracy because they solve complementary cases\. Trust\-only routing can miss examples that look reliable but are structurally difficult, while complexity\-only routing can over\-switch examples that look difficult but are still answered correctly in text\. The full adaptive policy combines evidence reliability with instance difficulty, giving a more stable accuracy\-cost tradeoff than either signal alone\. Overall, adaptive switching should be interpreted as a helpful token\-reduction method in some settings and as a mechanism for allocating expensive structured reasoning to examples where the model is more likely to need it\.

Table 13:Average token cost and routing performance for adaptive switching\. Text reports the average token cost of text\-only reasoning\. Grid pipeline reports the average token cost of the full grid\-based reasoning route\. Trust signal only and Complexity signal only report the average cost of computing only the trustworthiness or complexity signal, respectively; their corresponding accuracy rows report the final routing accuracy when the policy uses only that signal\. Switch signal reports the average token cost of computing the full adaptive switching decision using both trustworthiness and complexity, with early stopping for the plausibility computation when the faithfulness score is already sufficient to determine the routing decision\. Switch total reports the total average token cost after adding the switching signal cost to the selected reasoning route\. Switch rate is the percentage of examples routed to the grid pipeline\. Adaptive accuracy is the final exact\-match accuracy of the full switching policy\. Thresholds\(τt,τc\)\(\\tau\_\{t\},\\tau\_\{c\}\)are tuned on the validation split and evaluated on the test split\.

## Appendix FError Analysis

This appendix analyzes the main failure modes behind text\-only reasoning, grid\-based reasoning, and adaptive switching\. We focus on three diagnostic views: the error taxonomy, disagreements between text and grid routes, and how often switching recovers different categories of text\-only failures\.

### F\.1Error Taxonomy

We use error analysis as a diagnostic tool to understand when switching from text\-only reasoning to grid\-based reasoning helps, and when it can still fail\. This analysis is separate from the larger switching experiments reported in the main results\. It is conducted on a diagnostic subset of theStepGametest set for which we inspect model traces, generated grids, and gold grids\.

We classify errors at two levels\.Input\-level labelsdescribe structural properties of the story that make an instance difficult independent of the model’s behavior\.Output\-level labelsdescribe the specific failure mode exhibited by the model\. Since a single example may involve both a difficult input structure and an incorrect model behavior, labels are not mutually exclusive\.

Input\-level labels\.*Composite spatial*marks stories that contain diagonal or composite directional relations, such asupper\-rightor clock\-face expressions like “at the 7 o’clock position\.” These cases require decomposing a relation across two spatial axes\.*Multi\-hop*marks stories where the question cannot be answered from a single stated relation and instead requires chaining two or more relations across sentences\. For example, the model may need to invert “H is right of Z” into “Z is left of H” and then compose it with another relation such as “H is upper\-right of G\.”

Table 14:Agreement and disagreement between text\-only and grid\-based reasoning on the diagnosticStepGamesubset\.Nis the number of evaluated instances\.Text✓\\checkmarkGrid×\\timescounts cases where text\-only reasoning is correct but grid\-based reasoning is incorrect\.Grid✓\\checkmarkText×\\timescounts cases where grid\-based reasoning is correct but text\-only reasoning is incorrect\.Both✓\\checkmarkcounts cases where both routes are correct\.Both×\\timescounts cases where both routes fail\. The large number ofGrid✓\\checkmarkText×\\timescases shows that grids often recover text\-only failures, while the non\-zeroText✓\\checkmarkGrid×\\timescolumn shows that grids can also introduce errors\. This motivates adaptive switching rather than always using the grid\.Output\-level labels\.For text\-only failures, we assign one or more of four labels using a GPT\-based classifier that observes the story, question, gold answer, and sentence\-level reasoning trace\.*Multi\-hop reasoning failure*is assigned when the model has access to the necessary relations but fails to compose them correctly\.*Composite failure*is assigned when the gold answer is diagonal but the model collapses it to a single axis, for example predictingbelowwhen the gold answer islower\-left\.*Linguistic difficulty*is assigned when the model misinterprets non\-canonical spatial language, including clock\-face references, informal directional phrases, or unusual prepositions\.*Hallucination*is assigned when the reasoning trace introduces a relation that is not grounded in the story\.

For grid failures, we use a parallel classifier that also observes the generated grid and the gold\-standard grid\. We attribute each grid failure to eitherrelation\-extraction failure, where the grid is incorrectly constructed because the extracted relations do not match the gold relations, orgrid\-reasoning failure, where the grid is correctly constructed but the model misreads or fails to reason well over the layout\.

### F\.2Reasoning Modality Disagreement

Before analyzing individual error types, we first compare whether text\-only and grid\-based reasoning succeed on the same subset of examples\. This helps separate two effects: cases where the grid recovers an error made by text\-only reasoning and cases where the grid introduces a new error even though text\-only reasoning was correct\. This disagreement analysis is computed on the same diagnosticStepGamesubset and is intended to explain the behavior of the switching policy rather than replace the larger\-scale switching results in the main experiments\.

Table[14](https://arxiv.org/html/2606.31285#A6.T14)shows that grid\-based reasoning frequently corrects text\-only failures across models\. This effect is especially visible for smaller or weaker models, whereGrid✓\\checkmarkText×\\timesis much larger thanText✓\\checkmarkGrid×\\times\. However, the grid is not universally beneficial\. In some cases, text\-only reasoning gives the correct answer while the grid route fails, either because relation extraction produces an imperfect grid or because the model misreads the generated layout\. These disagreements support the need for a selective switching policy rather than a fixed choice of reasoning modality\.

Table 15:Diagnostic error\-based switching analysis for text\-only failures on theStepGamesubset for stronger models\. Each model block reports the number of evaluated instances, the number of text\-only failures analyzed, and the final adaptive\-switch accuracy on this subset\.Levelindicates whether the label describes an input\-level story difficulty or an output\-level model failure\.Err\./Diff\. Typegives the specific difficulty or failure type\.%Failis the percentage of text\-only failures assigned that label\.Sw\.andNo Sw\.are the percentages of labeled cases that the policy switches or does not switch to the grid route\.Rec\.is the percentage of switched cases that are recovered after switching\.Grid Failis the percentage of switched cases that remain incorrect after switching\.RE Failis the percentage of remaining grid failures caused by relation\-extraction errors\.Grid Rsn\.is the percentage of remaining grid failures caused by grid\-reasoning errors after grid construction\. Dashes indicate that no cases fall into that category or that the corresponding failure type was not observed\. Labels are not mutually exclusive, so percentages should not be summed across rows\.Table 16:Diagnostic error\-based switching analysis for text\-only failures on theStepGamesubset for smaller/open models\. Each model block reports the number of evaluated instances, the number of text\-only failures analyzed, and the final adaptive\-switch accuracy on this subset\.Levelindicates whether the label describes an input\-level story difficulty or an output\-level model failure\.Err\./Diff\. Typegives the specific difficulty or failure type\.%Failis the percentage of text\-only failures assigned that label\.Sw\.andNo Sw\.are the percentages of labeled cases that the policy switches or does not switch to the grid route\.Rec\.is the percentage of switched cases that are recovered after switching\.Grid Failis the percentage of switched cases that remain incorrect after switching\.RE Failis the percentage of remaining grid failures caused by relation\-extraction errors\.Grid Rsn\.is the percentage of remaining grid failures caused by grid\-reasoning errors after grid construction\. Dashes indicate that no cases fall into that category or that the corresponding failure type was not observed\. Labels are not mutually exclusive, so percentages should not be summed across rows\.
### F\.3Diagnostic Switching Analysis

We next analyze how the switching policy behaves on text\-only failures in the diagnosticStepGamesubset\. Tables[15](https://arxiv.org/html/2606.31285#A6.T15)and[16](https://arxiv.org/html/2606.31285#A6.T16)report which text\-only failures are routed to the grid, which switched cases are recovered, and which failures remain after grid\-based reasoning\. We split the analysis into stronger models and smaller/open models for readability, while keeping the same error categories and metrics across both tables\. For this analysis, we randomly sample 250 instances from the switching test set\. We use this subset to keep the evaluation computationally tractable, as computing these diagnostics requires LLM\-based analysis with models such as GPT\.

Tables[15](https://arxiv.org/html/2606.31285#A6.T15)and[16](https://arxiv.org/html/2606.31285#A6.T16)show that multi\-hop reasoning failures are the most consistently recoverable category once the policy routes an instance to the grid\. Among stronger models, recovery for multi\-hop failures is 80\.6% for GPT\-5\.1, 82\.4% for Qwen3\-32B, and 83\.6% for LLaMA3\.1\-70B\. The same trend appears for the smaller/open models, with 82\.3% recovery for Qwen3\-14B, 83\.1% for Qwen3\-8B, and 82\.9% for LLaMA3\.1\-8B among switched multi\-hop failure cases\. This indicates that grid\-based reasoning is particularly helpful when the text\-only route fails because it cannot compose spatial relations across multiple steps\.

However, the overall benefit of switching still depends on whether the model can construct and use the grid reliably\. This is clearest for LLaMA3\.1\-8B: although switched multi\-hop failures are often recovered, the model has many text\-only failures overall and reaches only 36\.8% adaptive accuracy on the diagnostic subset, recovering 10\.2% of text\-only failures\. This suggests that switching frequently is not sufficient when the upstream relation extraction and grid\-use pipeline are weak\. In contrast, Qwen3\-8B and Qwen3\-14B achieve much higher adaptive accuracies, showing that smaller models can benefit substantially from switching when their structured pipeline is strong enough\.

Composite failures are less consistently recovered than multi\-hop failures\. Their recovery rates are lower for all models, ranging from 36\.4% for LLaMA3\.1\-70B to 57\.1% for Qwen3\-14B\. This suggests that composite or diagonal spatial relations remain difficult because the pipeline must first extract, decompose, and encode the relation correctly before grid reasoning can help\.

The residual failure columns further show that relation extraction is the main bottleneck in many switched failures\. For Qwen3\-8B and LLaMA3\.1\-8B, all observed remaining grid failures in this diagnostic subset are attributed to relation\-extraction errors\. For GPT\-5\.1, Qwen3\-32B, Qwen3\-14B, and LLaMA3\.1\-70B, some remaining failures are instead due to grid reasoning, meaning that the grid may be sufficiently informative but the model still misreads the layout or reverses the reference frame\. Overall, these diagnostics support the main conclusion that switching is most useful for recoverable multi\-hop composition errors, while its ceiling is limited by relation extraction quality and by the model’s ability to interpret the constructed grid\.

## Appendix GExamples

### G\.1StepGame

Figures[5](https://arxiv.org/html/2606.31285#A7.F5)and[6](https://arxiv.org/html/2606.31285#A7.F6)show StepGame examples using Qwen3\-8B for question answering and relation extraction, with grids constructed deterministically from extracted relations\. The examples include a case where grids recover a multi\-hop text\-only failure and a case where grid reasoning still fails because the model misreads row and column positions\.

Both Grids Recover a Text\-Only Multi\-Hop FailureDataset ID:5\_1103Hop level:khop=5k\_\{\\text\{hop\}\}=5Story:*P is to the bottom right of V\. E is over V\. X presents lower left to P\. Y is positioned in the top left corner of C\. Y is positioned right to X\.*Question:What is the relation of agent E to agent Y?Gold:upper\-leftText\-only prediction:above\(≠\(\\neqgold\)Relation Extraction:Correct\.Full Grid:```
Col(1) Col(2) Col(3)
Row(1)  E      -      -
Row(2)  V      -      -
Row(3)  -      P      -
Row(4)  X      Y      -
Row(5)  -      -      C
```

Pruned Grid:```
Col(1) Col(2)
Row(1)  E      -
Row(2)  -      Y
```

Full\-Grid Justification:upper\-leftThe full grid placesEEat Row\(1\), Col\(1\) andYYat Row\(4\), Col\(2\)\. SinceEEis in a smaller row and a smaller column thanYY, the model correctly infers thatEEisupper\-leftofYY\.Pruned\-Grid Justification:upper\-leftThe pruned grid keeps only the queried entities while preserving their relative positions\.EEremains above and left ofYY, so the same row/column comparison recovers the gold answer\.Takeaway:This example shows how grid grounding can recover a text\-only multi\-hop failure\. The text route collapses the 5\-hop chainE→V→P→X→YE\\rightarrow V\\rightarrow P\\rightarrow X\\rightarrow Yinto the answerabove, missing the horizontal component\. Both grid views externalize the composed relation as explicit coordinates, reducing the final inference to a direct row/column comparison\. The pruned grid further removes distractor entities without changing the relevant relation\.Figure 5:Success case where text\-only reasoning fails on a 5\-hop diagonal chain, while both full and pruned grids recover the gold answerupper\-leftthrough explicit row/column comparison\.Grid Reasoning Fails Due to Layout MisreadingDataset ID:6\_6123Hop level:khop=6k\_\{\\text\{hop\}\}=6Story:*W is over M\. U and T are side by side with U to the left and T to the right\. U is diagonally below M to the left at a 45 degree angle\. X and W are in a horizontal line with X on the left\. L is diagonally to the bottom right of R\. X is below L\.*Question:What is the relation of agent L to agent T?Gold:upper\-leftText\-only prediction:upper\-right\(≠\(\\neqgold\)Relation Extraction:Correct\.Full Grid:```
Col(1) Col(2) Col(3)
Row(1)  R      -      -
Row(2)  -      L      -
Row(3)  -      X      W
Row(4)  -      -      M
Row(5)  -      U      T
```

Pruned Grid:```
Col(1) Col(2)
Row(1)  L      -
Row(2)  -      T
```

Full\-Grid Justification:above\(≠\(\\neqgold\) The full grid correctly placesLLat Row\(2\), Col\(2\), but the model misreadsTT’s column as Col\(2\) instead of Col\(3\)\. This collapses the diagonal relation into a vertical comparison, leading to the incomplete predictionabove\.Pruned\-Grid Justification:lower\-left\(≠\(\\neqgold\) The pruned grid removes distractors and keeps onlyLLandTT, but the model reverses their row order during interpretation\. It therefore treatsLLas belowTT, producinglower\-leftinstead of the correctupper\-left\.Takeaway:This example illustrates that grid construction alone is not sufficient when the model misreads the layout\. The full grid fails because the reader misalignsTT’s column in a denser layout, while the pruned grid fails for a different reason: it inverts the vertical order of the two retained entities\. Thus, even with correct relation extraction, the grid route can fail when the model’s first step is an incorrect interpretation of row or column positions\.Figure 6:Failure case where both grid views are incorrect despite correct relation extraction\. The full grid misreadsTT’s column, while the pruned grid inverts the row order between the queried entities\.
### G\.2SpaRTUN

Figures[7](https://arxiv.org/html/2606.31285#A7.F7),[8](https://arxiv.org/html/2606.31285#A7.F8), and[9](https://arxiv.org/html/2606.31285#A7.F9)show SpaRTUN examples using Qwen3\-32B for question answering and relation extraction\. These examples focus on topological and containment reasoning, where grids encode nested boxes with bracketed structures and RCC8\-style relations such astpp,ntpp, anddc\.

Correct Topological Reasoning with Nested BlocksDataset ID:3999\-0Story:*There exist two blocks, called HHH and LLL\. Block HHH covers a medium grey hexagon\. This block covers block LLL\. A medium grey hexagon is inside and touching block LLL\.*Question:What is the position of LLL relative to HHH?Gold:\[’TPP’\]Relation Extraction:Correct\. All relevant containment relations are extracted, including thatblock LLLis contained inblock HHHand touches its boundary\.Pruned Grid:```
Col(1)
Row(1)  [block HHH:
Row(2)    medium grey hexagon #1_in(block HHH)   #touch-edge
Row(3)    [block LLL:                            #touch-edge
Row(4)      medium grey hexagon #2_in(block LLL) #touch-edge
Row(5)    ]
Row(6)  ]
```

Full Grid:```
Col(1)
Row(1)  [block HHH:
Row(2)    medium grey hexagon #1_in(block HHH)   #touch-edge
Row(3)    [block LLL:                            #touch-edge
Row(4)      medium grey hexagon #2_in(block LLL) #touch-edge
Row(5)    ]
Row(6)  ]
```

Pruned Justification:\[’tpp’\]The pruned grid preserves the nested structure by showingblock LLLinsideblock HHH\. Sinceblock LLLis contained inblock HHHand carries the boundary tag\#touch\-edge, the model correctly infers tangential proper\-part containment:block LLL∈block HHHand touches its boundary⇒tpp\.\\text\{block LLL\}\\in\\text\{block HHH\}\\quad\\text\{and touches its boundary\}\\Rightarrow\\texttt\{tpp\}\.Full Justification:\[’tpp’\]The full grid encodes the same containment structure\. It also representsblock LLLas nested insideblock HHH, with the\#touch\-edgetag indicating boundary contact\. Therefore, the full grid also recovers the correct topological relationtpprather than relying on row\-based directional comparison\.Takeaway:This example shows that when relation extraction and grid construction preserve nested containment, both full and pruned grids can support correct RCC8\-style topological reasoning\. The pruned grid does not lose the nesting relation; instead, it retains the relevant containment structure while removing unnecessary context\.Figure 7:A successful topological reasoning case where both full and pruned grids preserve nested containment\. Sinceblock LLLis represented insideblock HHHwith boundary contact, both grid readers correctly outputTPP\.Containment Partially Recovered, but Extra Direction AddedDataset ID:4042\-3Story:*Two boxes, called one and two, exist in an image\. Box one has box two\. A medium green thing is inside and touching box one\. Box two covers an orange apple\.*Question:Where is the orange thing relative to box one?Gold:\[’NTPP’\]Relation Extraction:Correct\. All 3 relations matched\.Pruned Grid:```
Col(1)
Row(1)  [box one:
Row(2)    medium green thing_in(box one)   #touch-edge
Row(3)    [box two_in(box one):            #inside-clear
Row(4)      orange apple_in(box two)       #touch-edge
Row(5)    ]
```

Full Grid:```
Col(1)
Row(1)  [box one:
Row(2)    medium green thing_in(box one)   #touch-edge
Row(3)    [box two_in(box one):            #inside-clear
Row(4)      orange apple_in(box two)       #touch-edge
Row(5)    ]
Row(6)  ]
```

Pruned Justification:\[’below’, ’right’\]The pruned run treatsbox twoas a separate entity rather than preserving the nested formbox two\_in\(box one\)\. As a result, it falls back to comparing box headers and row/column positions, inferring a spurious directional relation\. It predictsbelowandright, neither of which captures the correct topological containment relation\.Full Justification:\[’below’, ’ntpp’\]The full run preserves the nesting relation explicitly throughbox two\_in\(box one\)and correctly recovers multi\-hop containment:orange apple∈box two∈box one⇒ntpp\.\\text\{orange apple\}\\in\\text\{box two\}\\in\\text\{box one\}\\Rightarrow\\texttt\{ntpp\}\.However, because the orange apple is also listed at a lower row with\#touch\-edge, the model adds an unsupported directional relationbelowbased on row comparison\. Since the gold answer contains only\[’NTPP’\], the full output is still incorrect\.Takeaway:The full run recovers the correct topological relation but still fails because it adds an unsupported directional label\. The pruned run misses the containment structure entirely and falls back to row\-based comparison\.Figure 8:An example of containment partially recovered with extra direction added\.Full Grid Misses Containment Due to Extra ObjectsDataset ID:293\-0Story:*There are three boxes, named one, two, and three\. Box one contains a medium yellow apple and covers a small orange melon\. \[…\] Box two covers a big orange melon which is to the south of a medium green watermelon\. Box two covers the medium green watermelon\. It contains a small orange apple \[…\]\. Box three covers box two\. A big green melon and a medium green apple are covered by this box\. \[…\]*Question:What is the position of the small orange apple relative to box three?Gold:\[’NTPP’\]Relation Extraction:Incorrect\. Genuine missing RE:\(medium yellow apple number two, ntpp, box one\)— a non\-query relation\.Pruned Grid:```
Col(1)
Row(1)  [box three:
Row(2)    medium green apple number two   #touch-edge
Row(3)    big green melon                 #touch-edge
Row(4)    [box two_in(box three):         #touch-edge
Row(5)      small orange apple            #inside-clear
Row(6)      medium green watermelon #1    #touch-edge
Row(7)    ]
```

Full Grid:```
Col(1)
Row(1)  [box three:
Row(2)    medium green apple number two   #touch-edge
Row(3)    big green melon                 #touch-edge
Row(4)    [box two_in(box three):         #touch-edge
Row(5)      small orange apple            #inside-clear
Row(6)      medium green watermelon #1    #touch-edge
Row(7)      big orange melon              #touch-edge
Row(8)      medium green watermelon #2    #inside-clear
...         [box one_in(box three/two): ...]
```

Pruned Justification:\[’ntpp’\]The pruned grid preserves the relevant nesting chain: the small orange apple is insidebox two, andbox twois insidebox three\. Therefore, the model can apply multi\-step containment reasoning:small orange apple∈box two∈box three⇒ntpp\.\\text\{small orange apple\}\\in\\text\{box two\}\\in\\text\{box three\}\\Rightarrow\\texttt\{ntpp\}\.Because the pruned layout is sparse and does not provide a reliable directional comparison, it correctly avoids adding an unsupported directional label\.Full Justification:\[’below’\]The full grid contains additional boxes and a longer layout, so the model focuses on row positions instead of propagating containment throughbox twoandbox three\. It treatsbox twoas lower thanbox threeand predicts the directional relationbelow\. This misses the correct topological relation, since the gold answer depends on nested containment rather than row\-based comparison\.Takeaway:This is a clear case where the pruned run succeeds because it preserves the essential nesting structure and applies multi\-step containment:apple∈box two∈box three⇒ntpp\.\\text\{apple\}\\in\\text\{box two\}\\in\\text\{box three\}\\Rightarrow\\texttt\{ntpp\}\.The full run, however, includes additional objects and boxes, making row\-order reasoning more salient\. As a result, it outputs onlybelowand misses the correct topological relation\.Figure 9:An example of full grid misses containment due to extra objects\.
### G\.3ReSQ

Figures[10](https://arxiv.org/html/2606.31285#A7.F10)and[11](https://arxiv.org/html/2606.31285#A7.F11)show qualitative examples from ReSQ using Qwen3\-32B for question answering and relation extraction, and GPT\-5\.1 for grid construction\. The grid is represented as a5×55\\times 5markdown table with directional tags such as\#above\(X\),\#below\(X\),\#left\(X\),\#right\(X\),\#front\(X\),\#behind\(X\), and\#near\(X\)\.

Above/Below Relation Correct in Text and Both GridsImage ID:1088Ground Truth:YesText\-based:YesFull Grid:YesGrid:YesStory:*A photo of a room with white walls, two single beds with a night table in between and a picture on the wall above the beds\.*Question:Are the beds below the picture?Selected Entities \(after pruning\):\[’picture on the wall’, ’two single beds’\]Full Grid:```
|          | **Col1**                                  | **Col2**    | **Col3** |
| **Row1** | picture on the wall #above(two single beds), #near(two single beds)
, #near(night table), #front(white walls) | — | — |
| **Row2** | white walls #behind(two single beds), #behind(night table),
#behind(picture on the wall) | — | — |
| **Row3** | two single beds #below(picture on the wall),
#near(night table), #near(picture on the wall) | night table | — |
```

Pruned Grid:```
|          | **Col1** | **Col2** | **Col3** |
| **Row1** |  picture on the wall #above(two single beds) | —      | —        |
| **Row2** | —        | —        | —      |
| **Row3** | two single beds #below(picture on the wall), #near(picture on the wall)
| — | — |
```

Text\-Only Justification:YesThe story states that there is a picture on the wall “above the beds,” which directly implies that the beds are below the picture\.Full\-Grid Justification:YesThe two single beds are in Row3, Col1 and the picture on the wall is in Row1, Col1\. Since row indices increase downward, the beds are below the picture\. The tag\#below\(picture on the wall\)also makes this relation explicit\.Pruned\-Grid Justification:YesThe pruned grid preserves the two relevant entities and their vertical relation\. The beds remain in Row3 and the picture remains in Row1, with the\#below\(picture on the wall\)tag attached to the beds\.Takeaway:This is a clean success case across all three modes\. The above/below relation is stated directly in the story, encoded positionally in the full grid, and preserved by pruning\. The pruned grid removes distractor entities such as*white walls*and*night table*while retaining the relation needed to answer the question\.Figure 10:An example from ResQ dataset correctly answered by text\-only reasoning and by both grid views\.Commonsense Recovery in the Pruned GridID:1145Ground Truth:YesText\-based :NoFull Grid:NoPruned Grid:YesStory:*A courtyard with stairs on the left, a big palm tree in the centre, a few tables and chairs and a light brown house with red rooftiles in the background\.*Question:Is the house below the rooftiles?Selected Entities \(after pruning\):\[’light brown house with red rooftiles’\]Full Grid:```
|          | **Col1**     | **Col2** | **Col3**        |
| **Row1** | —       |  light brown house with red rooftiles
#behind(stairs), #behind(big palm tree), #behind(tables), #behind(chairs) | — |
| **Row2** | —       |big palm tree #right(stairs), #near(tables), #near(chairs),
#front(light brown house with red rooftiles) | — |
| **Row3** | stairs #left(big palm tree), #left(tables),
#left(chairs), #front(light brown house with red rooftiles) | — | — |
| **Row4** | —        | tables #right(stairs), #near(big palm tree),
#near(chairs), #front(light brown house with red rooftiles) | — |
| **Row5** | —         | chairs #right(stairs), #near(big palm tree),
#near(tables), #front(light brown house with red rooftiles) | — |
```

Pruned Grid:```
|          | **Col1** | **Col2** | **Col3**
| **Row1** | light brown house with red rooftiles  | —   | —  |
| **Row2** | —        | —        | —       |
| **Row3** | —        | —        | —       |
```

Text\-Only Justification:NoThe text\-only route treats “light brown house with red rooftiles” as a single compound entity and therefore does not infer a separate above/below relation between the house and the rooftiles\.Full\-Grid Justification:NoThe full grid also represents “light brown house with red rooftiles” as a single object in Row1, Col3\. Because the house and rooftiles are not separated into distinct grid entities, the model does not derive a vertical relation between them\.Pruned\-Grid Justification:YesThe pruned grid contains only the compound entity\. With distractor objects removed and no competing row\-based comparison, the model relies on commonsense knowledge that rooftiles are located on top of a house, correctly answering that the house is below the rooftiles\.Takeaway:This example highlights a limitation and a benefit of pruning\. The relation is not represented explicitly because the house and rooftiles are fused into one entity mention\. Text\-only and full\-grid reasoning both reject the relation due to this surface\-form constraint\. The pruned grid removes irrelevant spatial anchors and allows the model to recover the gold answer through commonsense about the structure\.Figure 11:An example from ResQ dataset incorrectly answered by text\-only and full grid reasoning and correctly by pruned grid\.
### G\.4Switching

Figure[12](https://arxiv.org/html/2606.31285#A7.F12)shows a StepGame switching example using Qwen3\-8B as the candidate model andgpt\-5\-minifor complexity signals\. The example shows that an answer can appear faithful to its cited support while still being implausible under stability checks\. Together with high complexity, this triggers a switch to grid\-based reasoning, which recovers the gold answer\.

Worked Example of Trust, Complexity, and SwitchingModel / thresholds:Qwen3\-8B,τt=0\.85\\tau\_\{t\}=0\.85,τc=0\.40\\tau\_\{c\}=0\.40\.Hop level:khop=8k\_\{\\text\{hop\}\}=8\.Story:*U is placed in the right direction of V\. M is positioned in the lower left corner of A\. K is on the right side and top of M\. T is to the left of C with a small gap between them\. U is sitting at the 9:00 position of T\. E is positioned in the lower right corner of X\. X is below and to the left of C\. The object K is positioned below and to the left of the object E\.*Question:What is the relation of the agent T to the agent E?Gold:upper\-leftFaithfulnessFF\.The model identifies the relevant support chain throughCCandXX:```
"T is to the left of C with a small gap between them."
"X is below and to the left of C."
"E is positioned in the lower right corner of X."
```

The support\-only answer matches the original answer, and removing this support makes the answer unavailable\. Thus, both sufficiency and necessity hold:FS=1,FC=1,F=12​\(FS\+FC\)=1\.0\.F\_\{S\}=1,\\qquad F\_\{C\}=1,\\qquad F=\\tfrac\{1\}\{2\}\(F\_\{S\}\+F\_\{C\}\)=1\.0\.SinceF=1\.0F=1\.0, trust cannot be ruled out from faithfulness alone: the maximum possible trust is0\.6​F\+0\.4=1\.00\.6F\+0\.4=1\.0, but plausibility can still reduce the final trust score\.PlausibilityPP\.The paraphrase and flip checks reveal instability\. Two paraphrased variants preserve the answerupper\-left, but one simplified variant drops the vertical component and answers onlyleft\. The flipped question should invert the relation fromEEtoTTaslower\-right, but the model again predictsupper\-left\. Therefore,P​S=23,F​C=0,P=12​\(P​S\+F​C\)=12​\(0\.667\+0\)=0\.333\.PS=\\tfrac\{2\}\{3\},\\qquad FC=0,\\qquad P=\\tfrac\{1\}\{2\}\(PS\+FC\)=\\tfrac\{1\}\{2\}\(0\.667\+0\)=0\.333\.ComplexityCC\.The instance is structurally difficult: it requires a three\-link chainT→C→X→ET\\rightarrow C\\rightarrow X\\rightarrow E, contains diagonal relations such as lower\-left and lower\-right, and involves eight entities\. The StepGame complexity components are:S​B=0\.375,S​D=0\.625,C​L=0\.615,H​L=0\.300,D​B=0\.667,E​L=0\.600\.SB=0\.375,\\quad SD=0\.625,\\quad CL=0\.615,\\quad HL=0\.300,\\quad DB=0\.667,\\quad EL=0\.600\.Using the StepGame weights from Table[5](https://arxiv.org/html/2606.31285#A2.T5),C\\displaystyle C=0\.20​\(S​B\)\+0\.15​\(S​D\)\+0\.20​\(C​L\)\+0\.25​\(H​L\)\+0\.10​\(D​B\)\+0\.10​\(E​L\)\\displaystyle=20\(SB\)\+15\(SD\)\+20\(CL\)\+25\(HL\)\+10\(DB\)\+10\(EL\)=0\.20​\(0\.375\)\+0\.15​\(0\.625\)\+0\.20​\(0\.615\)\+0\.25​\(0\.300\)\\displaystyle=20\(375\)\+15\(625\)\+20\(615\)\+25\(300\)\+0\.10​\(0\.667\)\+0\.10​\(0\.600\)\\displaystyle\\quad\+10\(667\)\+10\(600\)=0\.493\.\\displaystyle=493\.Switch Decision\.The final trust score isT=0\.6​F\+0\.4​P=0\.6​\(1\.0\)\+0\.4​\(0\.333\)=0\.733\.T=0\.6F\+0\.4P=0\.6\(1\.0\)\+0\.4\(0\.333\)=0\.733\.SinceT=0\.733<τt=0\.85andC=0\.493≥τc=0\.40,T=0\.733<\\tau\_\{t\}=0\.85\\qquad\\text\{and\}\\qquad C=0\.493\\geq\\tau\_\{c\}=0\.40,both low trust and high complexity trigger the switch:SWITCH to grid\\boxed\{\\text\{SWITCH to grid\}\}Text\-only answer:left\(≠\\neqgold\)Grid answer:upper\-left\(= gold\)Efficiency note\.In this example,F=1\.0F=1\.0, so the policy cannot short\-circuit after faithfulness and must evaluate plausibility\. If faithfulness had already bounded trust belowτt\\tau\_\{t\}, the policy would switch immediately and skip plausibility; similarly, if complexity alone had exceededτc\\tau\_\{c\}before plausibility was needed, the policy could switch without computing the remaining trust checks\.Takeaway\.This example shows why faithfulness alone is insufficient\. The answer appears grounded in the correct support chain, but plausibility reveals that the model’s internal spatial interpretation is brittle: one paraphrase loses the diagonal component, and the flipped question fails completely\. Because the instance is also diagonal\-heavy and multi\-hop, both trustworthiness and complexity indicate that text\-only reasoning is unreliable\. The text\-only route predicts onlyleft, while the grid representation makes the composed relation explicit and recovers the gold answerupper\-left\.Figure 12:Worked switching example on an 8\-hop StepGame instance\. Faithfulness is high \(F=1\.0F=1\.0\), but low plausibility reduces trust toT=0\.733T=0\.733; together with high complexity \(C=0\.493C=0\.493\), this triggers a switch to the grid route, which recovers the gold answerupper\-left\.

Similar Articles

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Hugging Face Daily Papers

This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

Thinking with Visual Grounding

Hugging Face Daily Papers

This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Hugging Face Daily Papers

SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.