Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

arXiv cs.AI Papers

Summary

This research paper analyzes LLM reasoning traces in the game four-in-a-row, finding that LLMs exhibit myopic planning where performance is driven by shallow search breadth rather than deep lookahead, unlike human experts.

arXiv:2605.06840v1 Announce Type: new Abstract: Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:07 AM

# Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
Source: [https://arxiv.org/html/2605.06840](https://arxiv.org/html/2605.06840)
Sixing Chen Department of Psychology New York University sixing\.chen@nyu\.edu&Ji\-An Li New York University jian\.li\.acad@gmail\.com&Saner Cakir Generality, Inc\. saner@generality\.inc&Sinan Akcali Generality, Inc\. sinan@generality\.inc&Kayla Lee Generality, Inc\. kayla@generality\.inc&Marcelo G\. Mattar Department of Psychology New York University marcelo\.mattar@nyu\.edu

###### Abstract

Large language models \(LLMs\), especially reasoning models, generate extended chain\-of\-thought \(CoT\) reasoning that often contains explicit deliberation over future outcomes\. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood\. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four\-in\-a\-row board game\. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions\. We find that LLMs’ search is shallower than humans’, and that performance is predicted by search breadth rather than depth\. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely\. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes\. These patterns contrast with human planning, where performance is driven primarily by deep search\. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead\. This dissociation offers targeted guidance for aligning LLM and human planning\. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains\.

## 1Introduction

Large language models \(LLMs\), especially reasoning models, have shown a striking capability for extended chain\-of\-thought \(CoT\) reasoning, in which models generate lengthy reasoning traces before producing an answer\[[33](https://arxiv.org/html/2605.06840#bib.bib1)\]\. In reasoning models such as DeepSeek\-R1\[[6](https://arxiv.org/html/2605.06840#bib.bib3)\]and OpenAI o1\[[21](https://arxiv.org/html/2605.06840#bib.bib2)\], reasoning traces can span thousands of tokens and contain explicit deliberation over hypothetical futures\. This deliberation resembles the mental simulation that underlies human planning\[[16](https://arxiv.org/html/2605.06840#bib.bib4)\], raising the possibility that these models engage in prospective planning\.

In both classical artificial intelligence \(AI\) and cognitive science, planning has long been formalized as tree search, with deep forward search as the key driver of planning capability\. In AI, game\-playing agents such as AlphaGo achieve superhuman performance by systematically searching deep into the future\[[27](https://arxiv.org/html/2605.06840#bib.bib5),[28](https://arxiv.org/html/2605.06840#bib.bib6),[24](https://arxiv.org/html/2605.06840#bib.bib7)\]\. In cognitive science, tree search has likewise served as the primary computational framework for modeling human planning\. Research suggests that humans mentally simulate sequences of future actions to inform their decisions\[[19](https://arxiv.org/html/2605.06840#bib.bib9),[13](https://arxiv.org/html/2605.06840#bib.bib8),[3](https://arxiv.org/html/2605.06840#bib.bib10),[8](https://arxiv.org/html/2605.06840#bib.bib11)\], and the depth of this simulation scales with expertise\[[32](https://arxiv.org/html/2605.06840#bib.bib12),[13](https://arxiv.org/html/2605.06840#bib.bib8)\]\.

Whether LLMs engage in this kind of search\-based planning, however, remains deeply controversial\. One view holds that LLMs are fundamentally incapable of planning, as their autoregressive generation cannot support the systematic search and backtracking that planning requires\[[11](https://arxiv.org/html/2605.06840#bib.bib13)\]\. Consistent with this, several studies using behavioral benchmarks report that LLMs can fail at systematic multi\-step planning and that their outputs are better explained by pattern completion than genuine planning\[[31](https://arxiv.org/html/2605.06840#bib.bib14),[36](https://arxiv.org/html/2605.06840#bib.bib15)\]\. The opposing view points to the evidence that reasoning models perform well on challenging tasks that appear to require multi\-step planning, including competitive programming, mathematical reasoning, and strategic gameplay\[[21](https://arxiv.org/html/2605.06840#bib.bib2),[6](https://arxiv.org/html/2605.06840#bib.bib3),[5](https://arxiv.org/html/2605.06840#bib.bib16)\]\. Yet, these conclusions have been drawn primarily by analyzing behavioral outcomes, without examining the structure of the reasoning that produced those outcomes\.

Resolving the controversy therefore requires asking different questions\. First, do LLM reasoning traces exhibit the structural hallmarks of systematic search? To date, this question remains largely unaddressed, in part because reasoning traces are long, verbose, and unstructured, making it difficult to extract structures from them\. Recent work has begun extracting structured graphs from reasoning traces to predict reasoning quality, but has been applied to single\-answer reasoning tasks \(e\.g\., math, science, and coding\)\[[9](https://arxiv.org/html/2605.06840#bib.bib18),[20](https://arxiv.org/html/2605.06840#bib.bib19)\]\. Planning poses a different computational challenge: rather than finding a single correct answer, it requires evaluating sequences of*future*actions and their consequences\. Second, if LLMs do engage in search, does the search actually drive their decisions? Crucially, even if LLM reasoning traces look like search, the search may or may not be what drives the final decision, a gap invisible to behavioral benchmarks and largely unexplored in the existing literature\.

In this work, we address this gap by introducing a method to extract and quantify search trees from LLM reasoning traces in a two\-player board game, and fitting computational models to characterize how those trees influence move decisions\. The board game we consider is “four\-in\-a\-row” \([Figure 1](https://arxiv.org/html/2605.06840#S2.F1)A\)\. Four\-in\-a\-row is well\-suited for this investigation for several reasons\. First, it is a well\-defined strategic game, making tree extraction tractable and verifiable\. Second, human planning in the game is well\-characterized by an established computational cognitive model\[[32](https://arxiv.org/html/2605.06840#bib.bib12)\], providing a rigorous baseline for direct comparison with humans\. Third, popular games like chess or Go are heavily represented in LLM training data, so models may rely on memory rather than plan from scratch\[[23](https://arxiv.org/html/2605.06840#bib.bib20),[17](https://arxiv.org/html/2605.06840#bib.bib21)\]\. In contrast, four\-in\-a\-row games are less likely to be overrepresented on the internet, making it a cleaner testbed of planning ability\.

Analyzing reasoning traces from LLMs playing four\-in\-a\-row, we found that LLMs’ search was shallower than humans’, and search depth explained no additional variance in performance controlling for search breadth\. Crucially, although LLMs expanded deep nodes, their move choices were best explained by a myopic model that ignores those nodes entirely\. A causal intervention study, in which we selectively pruned CoT paragraphs, further suggested that move selection was driven predominantly by shallow rather than deep search\. These patterns contrast with human planning, where expertise is driven primarily by deeper search\. Together, our findings reveal that LLMs do not act on deep lookahead, and that their planning strategy differs fundamentally from the depth\-driven expertise in humans\.

## 2Game setup and search tree extraction

![Refer to caption](https://arxiv.org/html/2605.06840v1/x1.png)Figure 1:Game setup and search tree extraction\.\(A\) An example board position in the four\-in\-a\-row game\. Two players \(black and white\) alternate placing pieces on a 4×\\times9 board, and the first player who achieves four\-in\-a\-row wins the game\. \(B\) Task prompt\. The system prompt describes the rule of four\-in\-a\-row, the board representation \(FEN notation\), and move submission format\. The user message provides the current board state and the active player\. \(C\) Reasoning trace and move output\. The model generates CoT reasoning traces before committing to a final move in the output\. In the example reasoning traces, deliberated moves of the model are highlighted in blue, while deliberated opponent’s moves are highlighted in orange\. \(D\) Search tree extraction\. An LLM judge \(GPT\-5\) parses the reasoning trace to extract the search tree of considered moves\. In the example search tree, the top square shows the current board state \(denoted by the board’s FEN notation\)\. Each circle represents a state resulting from a model’s own simulated move, and each square represents a state resulting from a simulated opponent’s move\. Numbers inside each node indicate the board coordinate of the corresponding move \(zero\-indexed\)\. The search tree shown is for illustration only and does not correspond to the example board position in \(A\)\.### 2\.1Four\-in\-a\-row tournament with LLMs

We used four\-in\-a\-row to study planning in LLMs\. Four\-in\-a\-row is a two\-player zero\-sum board game \([Figure 1](https://arxiv.org/html/2605.06840#S2.F1)A\)\. Two players \(white and black\) alternate placing pieces on a4×94\\times 9grid\. White moves first\. The first player who places four of their pieces consecutively along a horizontal, vertical, or diagonal line wins\. If the board fills without a winner, the game is a draw\.

In the game, each model received a system prompt describing the rules \(see[Appendix C](https://arxiv.org/html/2605.06840#A3)\.1 for game prompts\)\. Board states were communicated using a FEN\-style notation\[[35](https://arxiv.org/html/2605.06840#bib.bib36),[25](https://arxiv.org/html/2605.06840#bib.bib37)\]: each row is encoded as a sequence of piece symbols \(Wfor white,Bfor black\), integers represent runs of consecutive empty cells, and rows are separated by slashes\. For example,1WBB6/2BW1W4/1W1BW5/10describes a four\-row board in which the first row contains one empty cell, followed by a white piece, two black pieces, and six empty cells \([Figure 1](https://arxiv.org/html/2605.06840#S2.F1)A\)\. In each turn, the board state and current player were passed as a user message, and the model was asked to respond with a move in the formm <row\> <col\>, where<row\>and<col\>are the zero\-indexed row and column of the target cell \([Figure 1](https://arxiv.org/html/2605.06840#S2.F1)B\)\.

We ran a round\-robin tournament in which 27 models competed, with each pair playing 4 games \(alternating who moves first\), yielding 1404 games in total \(see[Appendix B](https://arxiv.org/html/2605.06840#A2)for the list of all models\)\. Participating models spanned both proprietary models \(e\.g\., GPT\-5, Claude Opus 4\.1\) and open\-weight models \(e\.g\., DeepSeek\-R1, Qwen3\-235B\)\. Because proprietary models returned only summaries of their reasoning traces that omitted intermediate reasoning steps, all subsequent analyses were restricted to the 14 models whose reasoning traces were fully accessible\. This yielded 9696 reasoning traces across 1092 games\.

### 2\.2Transcribing reasoning traces into search trees

Reasoning traces are unstructured natural language, making it difficult to directly measure planning\. To address this, we transcribed each trace into a formal search tree using an LLM judge \(GPT\-5\)\. For each turn, the judge was given the model’s full response \(the concatenation of its reasoning content and output\) and asked to extract every move explicitly deliberated in the reasoning trace \([Figure 1](https://arxiv.org/html/2605.06840#S2.F1)C\-D\)\. In the search tree, coordinates are coded in a zero\-indexed\(row,column\)\(\\text\{row\},\\text\{column\}\)format\. Each depth\-1 node111We use*depth*to denote distance from the current board state, which is the root of the search tree\. A depth\-1 node is the board state after one move by the model, a depth\-2 node is the board state after the opponent’s reply, and so on\. In game terminology, one*ply*is a single move by one player; add\-th ply move corresponds to a move that leads to a depth\-ddstate\.represents a candidate first\-ply move the model explicitly considered, and each depth\-2 node represents a reply the model considered the opponent might make, and so on\. The judge produced search trees in a nested list format\. For example, the nested list\[\[\(2,4\), \[\(1,3\), \(2,2\)\]\], \[\(0,3\)\]\]encodes two first\-ply moves considered by the model:\(2,4\)\(2,4\)and\(0,3\)\(0,3\)\. Under\(2,4\)\(2,4\), the model anticipates two opponent replies at\(1,3\)\(1,3\)and\(2,2\)\(2,2\)\. The other depth\-1 node\(0,3\)\(0,3\)is a leaf, meaning the model considered it without further lookahead\. Only moves explicitly named in the trace were included; the judge was instructed not to infer or hallucinate moves\. This process was applied to all reasoning traces, yielding a structured search tree for each turn\. We constructed a human\-annotated validation set of reasoning traces and used it to optimize the extraction prompt before applying the extractor to the full dataset \(see[Appendix C](https://arxiv.org/html/2605.06840#A3)\.2 for detailed extraction methods\)\.

![Refer to caption](https://arxiv.org/html/2605.06840v1/x2.png)Figure 2:Planning effort and game performance across models\.\(A\) Winning rate as a function of search tree size\. \(B\) Search breadth \(number of root candidate moves considered\) as a function of depth \(max ply, i\.e\., the maximum number of alternating moves simulated ahead\) across models\. \(C\) Winning rate as a function of breadth\-depth ratio\. Dashed lines connect models in a model family\. Asterisks denote significance levels \(\*​p<0\.05\\text\{\*\}\\,p<0\.05,\*\*​p<0\.01\\text\{\*\*\}\\,p<0\.01\)\.

## 3Quantifying search trees extracted from reasoning traces

### 3\.1Search effort predicts winning rate

We first asked whether the amount of search a model performed was predictive of its game performance\. For each model, we computed its average tree size across all turns and its overall winning rate in the tournament\. Across models, we found a positive relationship between search effort and winning rate \([Figure 2](https://arxiv.org/html/2605.06840#S2.F2)A\), suggesting that models that search more tend to play better\. This relationship held not only across all models but also within model families: within the same model family \(e\.g\., DeepSeek, Qwen, Kimi\), models that searched more consistently achieved higher winning rates\.

A particularly informative case is GPT\-OSS\-120B, where the same model was run at two reasoning effort levels: medium and high\. The high setting allocated more tokens for reasoning, resulting in larger search trees and a higher winning rate \([Figure 2](https://arxiv.org/html/2605.06840#S2.F2)A\)\. Because model architecture, weights, and training were identical across conditions, the only difference was the amount of inference\-time deliberation\. This provides causal evidence that search effort drives the performance gain\.

### 3\.2LLMs search shallower than humans

Having established that the amount of search predicts performance, we next examined which aspect of search drove this gain\. We considered two dimensions of search effort: depth \(the maximum number of steps the model looks ahead\) and breadth \(the number of candidate moves considered at the first ply\)\. These two dimensions characterize different search strategies: greater depth reflects a tendency to anticipate consequences further into the future \(as in depth\-first search\), whereas greater breadth reflects a tendency to evaluate a wider range of alternatives \(as in breadth\-first search\)\. We found that depth and breadth were positively correlated across models \([Figure 2](https://arxiv.org/html/2605.06840#S2.F2)B\)\. Strikingly, the average maximum depth across all models ranged between 1\.00 and 3\.48 plies \(1\.78±0\.711\.78\\pm 0\.71; see[Appendix D](https://arxiv.org/html/2605.06840#A4)\.1 for depth histograms\)\. This is substantially shallower than what previous work found for human planners, where search depth in four\-in\-a\-row was inferred to be between 4 and 6, with expert players exhibiting deeper search\[[32](https://arxiv.org/html/2605.06840#bib.bib12)\]\.

To quantify the relative contributions of depth and breadth in predicting performance, we fit a linear regression predicting winning rate from both measures\. We found that depth was not a significant predictor controlling for breadth \(depth:β=−0\.015,p=0\.82\\beta=\-0\.015,\\,p=0\.82; breadth:β=0\.075,p=0\.0025\\beta=0\.075,\\,p=0\.0025\)\. Consistent with this, models that explored more root candidates relative to their search depth—reflected in a higher breadth\-to\-depth ratio—tended to achieve higher winning rates \([Figure 2](https://arxiv.org/html/2605.06840#S2.F2)C\)\. Together, these results suggest that LLMs’ search is shallower than humans’, and the performance advantage of higher\-effort models is driven primarily by broader candidate consideration instead of deeper search\.

## 4LLMs do not act on deeper search

### 4\.1Predicting moves from search trees with computational modeling

To examine how LLMs integrate search tree information into their move decisions, we fit computational cognitive models adapted from the heuristic search model ofVan Opheusdenet al\.\[[32](https://arxiv.org/html/2605.06840#bib.bib12)\]\. In their model, candidate moves are evaluated by propagating a heuristic value up a search tree via minimax backup, and the move with the highest backed\-up value is selected\. Because human search trees are unobserved, their model was used to infer the underlying search trees from the board state, conditioned on the player’s chosen move\. In our setting, LLM reasoning traces give us direct access to their search trees, so we instead predict move decisions directly from the extracted search trees\.

Our cognitive models take an extracted search tree as input and output a probability distribution over candidate first\-ply moves\. All cognitive models share two core components: a heuristic function that assigns a scalar value to any board state, and a value backup procedure that propagates those values up the tree to score each candidate first\-ply move\. In our examination, we held the heuristic function fixed across cognitive models and varied only the value backup rule, allowing us to isolate which backup strategy best captures how LLMs integrate information across the search tree\.

##### Heuristic function

FollowingVan Opheusdenet al\.\[[32](https://arxiv.org/html/2605.06840#bib.bib12)\], we used a heuristic functionh​\(s\)h\(s\)defined as a linear combination of spatial pattern features on the board statess:

h​\(s\)=wcentre​\(ϕcentre​\(s,self\)−ϕcentre​\(s,opp\)\)\+∑i=14wi​\(C​ϕi​\(s,self\)−ϕi​\(s,opp\)\),h\(s\)=w\_\{\\text\{centre\}\}\\bigl\(\\phi\_\{\\text\{centre\}\}\(s,\\text\{self\}\)\-\\phi\_\{\\text\{centre\}\}\(s,\\text\{opp\}\)\\bigr\)\+\\sum\_\{i=1\}^\{4\}w\_\{i\}\\bigl\(C\\,\\phi\_\{i\}\(s,\\text\{self\}\)\-\\phi\_\{i\}\(s,\\text\{opp\}\)\\bigr\),\(1\)where the centre featureϕcentre\\phi\_\{\\text\{centre\}\}sums the inverse Euclidean distance to the board centre over all of a player’s pieces, giving higher value to pieces closer to the centre\. The remaining four featuresϕi\\phi\_\{i\}count pattern occurrences across all rows, columns, and diagonals\. These four patterns are connected two\-in\-a\-row, unconnected two\-in\-a\-row, three\-in\-a\-row, and four\-in\-a\-row \([Figure 3](https://arxiv.org/html/2605.06840#S4.F3)A\)\. A pattern fires only in windows unobstructed by the opponent, reflecting viable threats that can still become four\-in\-a\-row\. We allowed the weights of features belonging to the active player \(but not the centre feature\) to be scaled by a factorCC, which captures the relative weight placed on offensive versus defensive considerations\. Specifically,C\>1C\>1reflects an offensive bias andC<1C<1a defensive bias\.

![Refer to caption](https://arxiv.org/html/2605.06840v1/x3.png)Figure 3:Predicting moves from extracted search trees with cognitive modeling\.\(A\) Features used in the heuristic value function\. Features include connected two\-in\-a\-row \(blue\), unconnected two\-in\-a\-row \(orange\), three\-in\-a\-row \(purple\), a four\-in\-a\-row feature \(not shown in the figure\), and a central tendency feature\. Features with identical colors are constrained to have identical weights\. \(B\) Schematics of computational models\. The top square represents the current board state\. Each downstream circle represents a state resulting from a model’s own simulated move, and each downstream square represents a state resulting from a simulated opponent’s move\. Red edges mark states evaluated by the heuristic function\. Red arrows illustrate value backups\. In the discount model, deeper nodes are down\-weighted \(grayed\)\. \(C\) Negative log\-likelihood per sample\. \(D\) Move prediction accuracy\. \(E\) Depth harm, defined as the NLL gap between the full\-tree model and the myopic model, as a function of search depth\. \(F\) Candidate gain, defined as the NLL gap between the no\-tree model and the myopic model, as a function of search breadth\. \(G\) Winning rate as a function of fitted offensive bias\. Dashed lines connect models in a model family\. Asterisks denote significance levels \(\*​p<0\.05\\text\{\*\}\\,p<0\.05,\*\*​p<0\.01\\text\{\*\*\}\\,p<0\.01\)\.
##### Value backup rules

Given a search tree, the model computes a valueV​\(si\)V\(s\_\{i\}\)for each depth\-1 statesis\_\{i\}\. We considered four backup rules corresponding to four model variants \([Figure 3](https://arxiv.org/html/2605.06840#S4.F3)B\)\. In thefull\-treemodel, the heuristich​\(s\)h\(s\)is evaluated at all leaf nodes and recursively propagated upward using the minimax rule, with the active player maximizing and the opponent minimizing at alternating plies:

V​\(s\)=\{h​\(s\)if​s​is a leafmaxs′⁡V​\(s′\)if the model is to move at​smins′⁡V​\(s′\)if the opponent is to move at​s\.V\(s\)=\\begin\{cases\}h\(s\)&\\text\{if \}s\\text\{ is a leaf\}\\\\ \\max\_\{s^\{\\prime\}\}V\(s^\{\\prime\}\)&\\text\{if the model is to move at \}s\\\\ \\min\_\{s^\{\\prime\}\}V\(s^\{\\prime\}\)&\\text\{if the opponent is to move at \}s\.\\end\{cases\}\(2\)Heres′s^\{\\prime\}denotes the child states ofssin the extracted search tree\. For themyopicmodel, the heuristic is applied directly to the depth\-1 states resulting from each first\-ply move,V​\(si\)=h​\(si\)V\(s\_\{i\}\)=h\(s\_\{i\}\), ignoring all deeper nodes in the tree\. Thediscountmodel interpolates between these two extremes, where the backed\-up value is a weighted sum of the local heuristic and the minimax value of its children:

V​\(s\)=\{h​\(s\)if​s​is a leaf\(1−γ\)​h​\(s\)\+γ​maxs′⁡V​\(s′\)if the model is to move at​s\(1−γ\)​h​\(s\)\+γ​mins′⁡V​\(s′\)if the opponent is to move at​s\.V\(s\)=\\begin\{cases\}h\(s\)&\\text\{if \}s\\text\{ is a leaf\}\\\\ \(1\-\\gamma\)\\,h\(s\)\+\\gamma\\max\_\{s^\{\\prime\}\}V\(s^\{\\prime\}\)&\\text\{if the model is to move at \}s\\\\ \(1\-\\gamma\)\\,h\(s\)\+\\gamma\\min\_\{s^\{\\prime\}\}V\(s^\{\\prime\}\)&\\text\{if the opponent is to move at \}s\.\\end\{cases\}\(3\)The free parameterγ∈\[0,1\]\\gamma\\in\[0,1\]controls the influence of deeper search\. Settingγ=1\\gamma=1reduces the discount model to the full\-tree model and settingγ=0\\gamma=0reduces it to the myopic model\. Finally, theno\-treemodel ignores the extracted search tree entirely, scoring all legal first\-ply moves using the heuristic function alone\. This serves as a baseline to test whether LLMs’ search trees carry predictive information beyond what the heuristic function captures\.

##### Model fitting

Move choice is modeled as a softmax over the backed\-up values of candidate moves:

P​\(si∣s\)=exp⁡\(V​\(si\)\)∑sj∈𝒜exp⁡\(V​\(sj\)\),P\(s\_\{i\}\\mid s\)=\\frac\{\\exp\\bigl\(V\(s\_\{i\}\)\\bigr\)\}\{\\sum\_\{s\_\{j\}\\in\\mathcal\{A\}\}\\exp\\bigl\(V\(s\_\{j\}\)\\bigr\)\},\(4\)where𝒜\\mathcal\{A\}is the set of depth\-1 states resulting from each candidate first\-ply move\. It includes the depth\-1 states in the extracted tree for tree\-based models, and all legal depth\-1 states for the no\-tree model\. We did not include a softmax temperature parameter, because temperature is not separately identifiable from the scale of feature weights and is absorbed into the weights\. Parameters were fit by minimizing the negative log\-likelihood over all observed moves using L\-BFGS\-B\[[2](https://arxiv.org/html/2605.06840#bib.bib38)\]with 20 random restarts\. We excluded Llama\-3\.3\-70B from the following analyses because it produced fewer than 20 parseable reasoning trees, leaving insufficient samples for reliable parameter estimation \(see[Appendix C](https://arxiv.org/html/2605.06840#A3)\.3 for detailed model fitting methods\)\.

### 4\.2LLM moves are best explained by a myopic model

We compared four cognitive model variants across all LLMs\. Tree\-based models consistently outperformed the no\-tree baseline, confirming that LLM search trees carry predictive information about move decisions\. Among tree\-based models, the myopic model achieved lower negative log\-likelihood and higher prediction accuracy than the full\-tree model across all LLMs \([Figure 3](https://arxiv.org/html/2605.06840#S4.F3)C\-D; see[Appendix D](https://arxiv.org/html/2605.06840#A4)\.2 for model recovery analysis\)\. When the two models disagreed, the myopic model was uniquely correct more than twice as often \(1236 vs\. 512 turns\)\. This is a striking reversal of the pattern reported inVan Opheusdenet al\.\[[32](https://arxiv.org/html/2605.06840#bib.bib12)\], where the tree search model consistently outperforms the myopic model in predicting human moves\. The extent to which the full\-tree model underperformed the myopic model \(“depth harm”, defined as the NLL increase from using the full tree over the myopic model\) was positive for every model and grew with search depth \([Figure 3](https://arxiv.org/html/2605.06840#S4.F3)E\), confirming that backing up values from deeper nodes consistently impaired rather than improved prediction\. The discount model further confirmed this finding: the discount factorγ\\gammaconverged to near zero for every model, collapsing to the myopic model and indicating that deeper nodes received negligible weight in the value backup\. Together, these results suggest that LLMs do not perform value backups over the search trees in the way humans do\. Although LLMs expand deeper nodes in their reasoning traces, value information in deeper nodes is not propagated upward to influence first\-ply move decisions\.

When does access to the search tree improve move prediction? The gain from having a search tree \(“candidate gain”, defined as the NLL decrease from the no\-tree model to the myopic model\) decreased strongly with search breadth \([Figure 3](https://arxiv.org/html/2605.06840#S4.F3)F\)\. This reveals where the key decision happens in different models\. For narrow\-breadth models that consider only a few candidates, knowing which moves make it into the candidate set is highly predictive of the final move, suggesting that the candidate proposal itself is the decisive step\. For wide\-breadth models that consider many candidates, the candidate set barely filters the action space and adds little information beyond the heuristic value function alone\. In short, the fewer moves a model considers, the more consequential its choice of what to even consider becomes\. This is consistent with our earlier finding that search breadth, but not depth, better predicts winning rate\. What matters for LLM performance is not how deeply a model searches, but how broadly it covers the candidate space\.

Next, we examined whether the parameters of the myopic model predicted game performance across LLMs\. To control for the overall scale of the parameters, we normalized all feature weights relative to the four\-in\-a\-row feature weight\. Under this normalization, only the offensive bias parameterCCsignificantly predicted winning rate \([Figure 3](https://arxiv.org/html/2605.06840#S4.F3)G; see[Appendix D](https://arxiv.org/html/2605.06840#A4)\.3 for all feature weights\)\. This suggests that LLMs that weighted their own threats more heavily relative to the opponent’s achieved higher performance\.

## 5Reasoning traces causally drive move selection, but not through deeper search

![Refer to caption](https://arxiv.org/html/2605.06840v1/x4.png)Figure 4:Causal intervention on reasoning traces\.\(A\) An LLM judge \(Claude Opus 4\.7\) labels each paragraph of the reasoning trace as preamble, branch, final decision, or meta\. Branch paragraphs are associated with a specific candidate move\. The judge additionally annotates all moves mentioned within each paragraph, together with their search depths\. We then prune the trace according to these labels and feed the pruned trace back to the LLM player, which generates a new move\. \(B\) Move change rate after five pruning conditions: removing only the final decision, removing both the final decision and an entire branch, and progressively adding back branch paragraphs by depth\. Orange: chosen branch; gray: largest unchosen branch\.We have shown that LLM reasoning traces structurally resemble tree search and that the relationship between extracted search tree and move choice is best explained by a model that selects the highest\-valued first\-ply move\. However, correlation between reasoning structure and move quality does not establish that the reasoning trace causally drives the final decision\. The model might reach the same decision regardless of what appears in its trace\. To directly test the causal role of reasoning, we designed a causal intervention study where we surgically removed segments of the reasoning trace, reran inference on the pruned trace, and measured how often the model’s final move changed\.

We conducted this analysis using Qwen3\-Next\-80B\-Thinking\. We first labeled each paragraph of the reasoning trace using an LLM judge \(Claude Opus 4\.7\), assigning one of four categories:*preamble*\(board parsing and threat scanning before any branch exploration\),*branch*\(analysis of a specific candidate move and its consequences, corresponding to a branch of the search tree\),*final decision*\(confirmation of the selected move\), or*meta*\(general meta\-commentary not tied to a specific move, e\.g\., “Let me think…”\) \([Figure 4](https://arxiv.org/html/2605.06840#S5.F4)A; see[Appendix C](https://arxiv.org/html/2605.06840#A3)\.4 for detailed intervention methods\)\. The judge additionally annotated all moves evaluated within each paragraph, together with their tree depths\. This procedure allowed us to determine whether a paragraph evaluated a future move, which branch of the search tree it belonged to, and which specific moves were being evaluated within that paragraph\.

We first removed the final\-decision paragraphs \([Figure 4](https://arxiv.org/html/2605.06840#S5.F4)B\)\. This intervention produced 2\.1% move changes, indicating that the final concluding paragraphs did not critically determine the choice\. We then additionally removed paragraphs corresponding to a whole branch\. Removing the branch corresponding to the chosen move changed the move 32\.0% of the time, whereas removing the largest unchosen branch caused 1\.2% change\. These results demonstrate that reasoning traces causally drive move selection\.

We then identified which content within the chosen branch was causally responsible for the decision\. Starting from full branch removal \(32\.0% change rate\), we progressively added back paragraphs by depth class\. Adding back only depth\-1 paragraphs—those that mention the candidate move and its immediate consequences—reduced the change rate to 4\.1%, close to the control rate of 2\.1% observed when removing the unchosen branch\. Critically, adding deeper paragraphs on top reduced the rate only marginally further to 3\.7%, indistinguishable from the depth\-1 condition and from the corresponding control \(1\.7%\), suggesting that deeper lookahead content has negligible causal effect\. In other words, even though the model writes out deep lookahead in its reasoning, removing that content does not change its decision; only the shallowest evaluation of each candidate move actually drives the choice\.

## 6Discussion

In this study, we developed a method to extract and quantify search trees from LLM reasoning traces during four\-in\-a\-row gameplay and examined how those extracted trees relate to move choices\. We found that LLMs conducted shallower search than humans, and their behavior was best explained by a myopic strategy that makes choices based on immediate consequences of candidate moves while ignoring deep search entirely\. These findings reveal a fundamental difference between LLM and human planning: while human expertise is driven by deeper lookahead, LLMs do not use deeper search to guide their decisions\.

Our findings speak to the ongoing debate about whether LLMs can plan\[[31](https://arxiv.org/html/2605.06840#bib.bib14),[11](https://arxiv.org/html/2605.06840#bib.bib13)\], but suggest that the central question should be reframed\. Prior work has largely evaluated planning through behavioral outcomes, focusing on whether models reach correct solutions on specific tasks\. This leads to contested conclusions that depend heavily on task design and prompting strategy\[[31](https://arxiv.org/html/2605.06840#bib.bib14),[26](https://arxiv.org/html/2605.06840#bib.bib22),[29](https://arxiv.org/html/2605.06840#bib.bib23)\]\. Rather than asking only whether LLMs*succeed*at planning, here we ask what kind of planning*algorithm*their decisions reflect\. Answering this question requires looking beyond accuracy to the internal structure of reasoning traces\. This approach yields a more fine\-grained characterization of LLM planning: LLMs generate the surface structure of tree search while their decisions are driven by a myopic mechanism\. This dissociation would be invisible to behavioral benchmarks alone, highlighting the need for mechanistic analyses of reasoning processes to understand planning in LLMs\.

More broadly, our results have implications for interpretability and scalable oversight\. The observed dissociation between reasoning trace and move decision cautions against treating reasoning traces as transparent records of model deliberation\. Prior work has similarly shown that model\-generated explanations can be plausible without being faithful to the model’s actual decision process, and that LLMs do not always reliably use their own intermediate reasoning steps when producing final answers\[[7](https://arxiv.org/html/2605.06840#bib.bib32),[30](https://arxiv.org/html/2605.06840#bib.bib31),[14](https://arxiv.org/html/2605.06840#bib.bib26),[22](https://arxiv.org/html/2605.06840#bib.bib25)\]\. If a model can generate structured search without relying on the search to make decisions, then oversight methods based only on reasoning traces may fail to detect the mechanisms that actually drive behavior\. This concern bears directly on the broader challenge of scalable oversight\[[1](https://arxiv.org/html/2605.06840#bib.bib33)\]\.

Why do LLMs rely on immediate consequences even when they generate deeper search? Several non\-exclusive explanations are possible\. First, the bottleneck may be algorithmic\. Effective planning requires representing the tree structure and propagating values via Bellman maximization\. Transformer attention may fail to implement these backup operations especially when the relevant information is distributed across a long context\[[18](https://arxiv.org/html/2605.06840#bib.bib24),[22](https://arxiv.org/html/2605.06840#bib.bib25),[14](https://arxiv.org/html/2605.06840#bib.bib26)\]\. Second, models may learn heuristics other than minimax, such as evaluating salient tactical threats rather than backing up leaf values\. Third, myopia may be adaptive under the model’s own uncertainty\. If predicted game states become increasingly unreliable at greater depths, using them for decisions could hurt more than help\[[4](https://arxiv.org/html/2605.06840#bib.bib28),[34](https://arxiv.org/html/2605.06840#bib.bib27),[15](https://arxiv.org/html/2605.06840#bib.bib29),[10](https://arxiv.org/html/2605.06840#bib.bib30)\]\. Under this view, shallow reliance is not a failure but a learned policy for treating distant futures as untrustworthy\.

Our findings offer targeted guidance for improving LLM planning\. Standard approaches that increase test\-time compute, lengthen reasoning traces, or encourage deeper search implicitly assume that additional generated content will influence the model’s final decision\. Our results instead suggest that the bottleneck is not trace length but the model’s ability to act on what it generates\. Beyond outcome supervision, additional training signals that explicitly reward the use of deep lookahead may therefore be necessary to close this gap\.

Our study has several limitations\. First, all analyses were conducted on a single game domain; whether our conclusions generalize to tasks with different structures and demands remains an open question\. Second, our cognitive models rely on a specific parametric heuristic function\. Although this heuristic has been shown to capture human behavior well\[[32](https://arxiv.org/html/2605.06840#bib.bib12)\], alternative feature sets or value architectures may better characterize the computations underlying LLM decisions\.

## References

- \[1\]S\. R\. Bowman, J\. Hyun, E\. Perez, E\. Chen, C\. Pettit, S\. Heiner, K\. Lukošiūtė, A\. Askell, A\. Jones, A\. Chen,et al\.\(2022\)Measuring progress on scalable oversight for large language models\.arXiv preprint arXiv:2211\.03540\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p3.1)\.
- \[2\]\(1995\)A limited memory algorithm for bound constrained optimization\.SIAM Journal on Scientific Computing16\(5\),pp\. 1190–1208\.Cited by:[§4\.1](https://arxiv.org/html/2605.06840#S4.SS1.SSS0.Px3.p1.1)\.
- \[3\]F\. Callaway, B\. Van Opheusden, S\. Gul, P\. Das, P\. M\. Krueger, T\. L\. Griffiths, and F\. Lieder\(2022\)Rational use of cognitive resources in human planning\.Nature human behaviour6\(8\),pp\. 1112–1125\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1)\.
- \[4\]S\. Chen, K\. T\. Jensen, and M\. G\. Mattar\(2025\)Rational decisions in multi\-step environments with few rollouts\.PsyArXiv\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p4.1)\.
- \[5\]A\. El\-Kishky, A\. Wei, A\. Saraiva, B\. Minaiev, D\. Selsam, D\. Dohan, F\. Song, H\. Lightman, I\. Clavera, J\. Pachocki,et al\.\(2025\)Competitive programming with large reasoning models\.arXiv preprint arXiv:2502\.06807\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p3.1)\.
- \[6\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p1.1),[§1](https://arxiv.org/html/2605.06840#S1.p3.1)\.
- \[7\]A\. Jacovi and Y\. Goldberg\(2020\)Towards faithfully interpretable nlp systems: how should we define and evaluate faithfulness?\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 4198–4205\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p3.1)\.
- \[8\]K\. T\. Jensen, G\. Hennequin, and M\. G\. Mattar\(2024\)A recurrent network model of planning explains hippocampal replay and human behavior\.Nature neuroscience27\(7\),pp\. 1340–1348\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1)\.
- \[9\]G\. Jiang, Y\. Liu, Z\. Li, W\. Bi, F\. Zhang, L\. Song, Y\. Wei, and D\. Lian\(2025\)What makes a good reasoning chain? uncovering structural patterns in long chain\-of\-thought reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 6501–6525\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p4.1)\.
- \[10\]N\. Jiang, A\. Kulesza, S\. Singh, and R\. Lewis\(2015\)The dependence of effective planning horizon on model accuracy\.InProceedings of the 2015 international conference on autonomous agents and multiagent systems,pp\. 1181–1189\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p4.1)\.
- \[11\]S\. Kambhampati, K\. Valmeekam, L\. Guan, M\. Verma, K\. Stechly, S\. Bhambri, L\. P\. Saldyt, and A\. B\. Murthy\(2024\)Position: llms can’t plan, but can help planning in llm\-modulo frameworks\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p3.1),[§6](https://arxiv.org/html/2605.06840#S6.p2.1)\.
- \[12\]O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam,et al\.\(2023\)Dspy: compiling declarative language model calls into self\-improving pipelines\.arXiv preprint arXiv:2310\.03714\.Cited by:[§C\.2](https://arxiv.org/html/2605.06840#A3.SS2.p2.1)\.
- \[13\]I\. Kuperwajs, E\. M\. Russek, M\. G\. Mattar, W\. J\. Ma, and T\. L\. Griffiths\(2025\)Looking deeper into the algorithms underlying human planning\.Trends in Cognitive Sciences\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1)\.
- \[14\]T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion,et al\.\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.arXiv preprint arXiv:2307\.13702\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p3.1),[§6](https://arxiv.org/html/2605.06840#S6.p4.1)\.
- \[15\]J\. Lei, J\. Olieslagers, N\. Arfaei, D\. Xinlei Lin, and W\. J\. Ma\(2025\)Human planning in stochastic environments\.PsyArXiv\. https://osf\. io/bh56p\_v1\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p4.1)\.
- \[16\]Z\. Li, D\. Zhang, M\. Zhang, J\. Zhang, Z\. Liu, Y\. Yao, H\. Xu, J\. Zheng, P\. Wang, X\. Chen,et al\.\(2025\)From system 1 to system 2: a survey of reasoning large language models\.arXiv preprint arXiv:2502\.17419\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p1.1)\.
- \[17\]J\. Liu, S\. He, J\. Wu, X\. Wang, Y\. Chen, Z\. Kuang, S\. Bao, and Y\. Yao\(2025\)ChessArena: a chess testbed for evaluating strategic reasoning capabilities of large language models\.arXiv preprint arXiv:2509\.24239\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p5.1)\.
- \[18\]N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang\(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p4.1)\.
- \[19\]M\. G\. Mattar and M\. Lengyel\(2022\)Planning in the brain\.Neuron110\(6\),pp\. 914–934\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1)\.
- \[20\]S\. Mukherjee, A\. Chinta, T\. Kim, T\. A\. Sharma, and D\. Hakkani\-Tür\(2025\)Premise\-augmented reasoning chains improve error identification in math reasoning with llms\.arXiv preprint arXiv:2502\.02362\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p4.1)\.
- \[21\]OpenAI\(2024\)OpenAI o1 system card\.Note:[https://openai\.com](https://openai.com/)Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p1.1),[§1](https://arxiv.org/html/2605.06840#S1.p3.1)\.
- \[22\]D\. Paul, R\. West, A\. Bosselut, and B\. Faltings\(2024\)Making reasoning matter: measuring and improving faithfulness of chain\-of\-thought reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 15012–15032\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p3.1),[§6](https://arxiv.org/html/2605.06840#S6.p4.1)\.
- \[23\]L\. S\. Pleiss, M\. Schiffer, and R\. K\. von Weizsäcker\(2026\)Trapped in the past? disentangling fluid and crystallized intelligence of large language models using chess\.arXiv preprint arXiv:2601\.16823\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p5.1)\.
- \[24\]J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel,et al\.\(2020\)Mastering atari, go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1)\.
- \[25\]J\. Schultz, J\. Adamek, M\. Jusup, M\. Lanctot, M\. Kaisers, S\. Perrin, D\. Hennes, J\. Shar, C\. Lewis, A\. Ruoss,et al\.\(2024\)Mastering board games by external and internal planning with language models\.arXiv preprint arXiv:2412\.12119\.Cited by:[§2\.1](https://arxiv.org/html/2605.06840#S2.SS1.p2.1)\.
- \[26\]B\. Sel, R\. Jia, and M\. Jin\(2025\)LLMs can plan only if we tell them\.arXiv preprint arXiv:2501\.13545\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p2.1)\.
- \[27\]D\. Silver, A\. Huang, C\. J\. Maddison, A\. Guez, L\. Sifre, G\. Van Den Driessche, J\. Schrittwieser, I\. Antonoglou, V\. Panneershelvam, M\. Lanctot,et al\.\(2016\)Mastering the game of go with deep neural networks and tree search\.nature529\(7587\),pp\. 484–489\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1)\.
- \[28\]D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel,et al\.\(2018\)A general reinforcement learning algorithm that masters chess, shogi, and go through self\-play\.Science362\(6419\),pp\. 1140–1144\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1)\.
- \[29\]T\. Silver, S\. Dan, K\. Srinivas, J\. B\. Tenenbaum, L\. Kaelbling, and M\. Katz\(2024\)Generalized planning in pddl domains with pretrained large language models\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 20256–20264\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p2.1)\.
- \[30\]M\. Turpin, J\. Michael, E\. Perez, and S\. Bowman\(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.Advances in Neural Information Processing Systems36,pp\. 74952–74965\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p3.1)\.
- \[31\]K\. Valmeekam, K\. Stechly, A\. Gundawar, and S\. Kambhampati\(2025\)A systematic evaluation of the planning and scheduling abilities of the reasoning model o1\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p3.1),[§6](https://arxiv.org/html/2605.06840#S6.p2.1)\.
- \[32\]B\. Van Opheusden, I\. Kuperwajs, G\. Galbiati, Z\. Bnaya, Y\. Li, and W\. J\. Ma\(2023\)Expertise increases planning depth in human gameplay\.Nature618\(7967\),pp\. 1000–1005\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p2.1),[§1](https://arxiv.org/html/2605.06840#S1.p5.1),[§3\.2](https://arxiv.org/html/2605.06840#S3.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.06840#S4.SS1.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2605.06840#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.06840#S4.SS2.p1.1),[§6](https://arxiv.org/html/2605.06840#S6.p6.1)\.
- \[33\]J\. Wei, X\. Wang,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p1.1)\.
- \[34\]C\. Xiao, Y\. Wu, C\. Ma, D\. Schuurmans, and M\. Müller\(2019\)Learning to combat compounding\-error in model\-based reinforcement learning\.arXiv preprint arXiv:1912\.11206\.Cited by:[§6](https://arxiv.org/html/2605.06840#S6.p4.1)\.
- \[35\]Y\. Zhang, X\. Han, H\. Li, K\. Chen, and S\. Lin\(2025\)Complete chess games enable llm become a chess master\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),pp\. 1–7\.Cited by:[§2\.1](https://arxiv.org/html/2605.06840#S2.SS1.p2.1)\.
- \[36\]H\. S\. Zheng, S\. Mishra, H\. Zhang, X\. Chen, M\. Chen, A\. Nova, L\. Hou, H\. Cheng, Q\. V\. Le, E\. H\. Chi,et al\.\(2024\)Natural plan: benchmarking llms on natural language planning\.arXiv preprint arXiv:2406\.04520\.Cited by:[§1](https://arxiv.org/html/2605.06840#S1.p3.1)\.

## Appendix ACode and data availability

## Appendix BList of LLMs used in the study

We run a tournament with 27 models\. Models include:

Table 1:Models evaluated in this study and their APIs\.Note that although Qwen3\-Max does not have publicly released weights, its reasoning traces are fully accessible, and we include it in our analyses accordingly\.

## Appendix CAdditional methods

### C\.1Game play and trace collection

Models played four\-in\-a\-row on a 4×\\times9 board against each other via an automated game loop\. At each turn the model received a fixed system prompt defining the game rules and move notation, followed by a user message containing the current board state in FEN notation\. Models with extended thinking enabled produced both a reasoning trace and a final move answer wrapped in<next\_move\>tags\. We retained both for downstream analysis\.

System Prompt: Four\-in\-a\-row GameLet’s play a game of Four in a Row\. You are playing as \[White/Black\]\. You will be given the current game state and you will need to give the next move in a standard algebraic notation specific to this game\. Feel free to think about the move, only the final answer you provide in<next\_move\> </next\_move\>tags will be played\.Game Rules: Four in a row is played on a four\-by\-nine grid by two players, who alternately place the marks W and B in one of the thirty\-six spaces in the grid\. A player wins when they get four pieces in a row horizontally, vertically or diagonally\. Player W plays first\.The standard game state representation is in the following format: The game state will be represented in FEN notation, a compact algebraic representation inspired by chess’s Forsyth\-Edwards Notation\. Each row of the 4×\\times9 board is encoded as a string where ‘W’ represents a White piece, ‘B’ represents a Black piece, and numbers indicate consecutive empty spaces\. Rows are separated by forward slashes \(‘/’\), reading from top to bottom\. An empty board is represented as9/9/9/9, with each ‘9’ indicating that all nine columns in that row are empty\.Standard Algebraic Notation \(SAN\) Explanation: Issue moves in the notationm <row\> <col\>, for examplem 0 0to place your mark in the top leftmost square andm 3 8to place your mark in the bottom rightmost square\.

User Prompt: Four\-in\-a\-row GameThe current board state is:FEN: \{fen string\}Current player: \{White/Black\} \(\{W/B\}\)

### C\.2Search tree extraction

Reasoning traces are free\-form natural language and cannot be reliably parsed by rule\-based methods\. We used GPT\-5 \(gpt\-5\-2025\-08\-07, medium reasoning effort\) as an extractor model: given a reasoning trace, it reconstructed the set of moves the model explicitly considered as a nested JSON array\. Each node in the array is a list whose first element is a coordinate string"r,c"and whose remaining elements are child nodes representing replies\.

The system prompt for this step was automatically optimized using DSPy\[[12](https://arxiv.org/html/2605.06840#bib.bib34)\]with the GEPA optimizer\. Starting from a hand\-written specification of the extraction task, the optimizer iteratively proposed and evaluated candidate instruction sets against a labeled validation set, selecting the instruction text that maximized extraction accuracy\. The resulting optimized prompt is reproduced in full below\.

System Prompt: Search Tree ExtractionYou are given a natural\-language reasoning trace about a 4×\\times9 Four\-in\-a\-Row \(Connect\-X\-like\) game\. Your task is to reconstruct exactly the explicitly considered move trees and output strict JSON in a prescribed nested array format\.Domain and board facts\(authoritative; use as\-is, do not infer/alter\):•Board: 4 rows×\\times9 columns, zero\-based\. Rows 0–3 \(top to bottom\), columns 0–8 \(left to right\)\.•No gravity: a move may be played in any empty cell\. Ignore any “falling piece” assumptions\.•Win: four in a row horizontally, vertically, or diagonally \(both down\-right and down\-left\)\.•Vertical four is exactly a whole column \(rows 0–3 in that column\)\.•Diagonals of length 4 are exactly:–Down\-right \(top\-left→\\tobottom\-right\): start only at row 0, columns 0–5\. Each is\(0,c\),\(1,c\+1\),\(2,c\+2\),\(3,c\+3\)\.–Down\-left \(top\-right→\\tobottom\-left\): start only at row 0, columns 3–8\. Each is\(0,c\),\(1,c\-1\),\(2,c\-2\),\(3,c\-3\)\.Accepted coordinate inputs and normalization:•Accept these explicit coordinate forms:–"\(r,c\)"–"row r col c"–"m r c"\(move notation; treat exactly asr,c\)•Accept explicit enumerations and expand to individual coordinates:–"row 2 cols 6,7,8"→\\toadd"2,6","2,7","2,8"–"column 5 rows 0,1,3"→\\toadd"0,5","1,5","3,5"–"row 1 col 2"\(single\),"row 1 col 1 or col 0"→\\toadd"1,1"and"1,0"•Normalize every coordinate to the exact string"r,c"\(no spaces\)\.•Unambiguous\-only: Ignore vague phrases \(e\.g\., “left side”, “near the center”, “the other end”\) unless the exact endpoint squares were enumerated earlier in that same branch context\.Required output format\(strict JSON\):•Produce exactly:\{"trees": Node\[\]\}•Node := \["r,c", Node, Node, \.\.\.\]–First element is the move coordinate string"r,c"\.–Following elements \(if any\) are child Node values\.•A leaf node is just\["r,c"\]\.•Strict JSON only:–Use double quotes for all strings\.–No comments, no trailing commas, no extra fields\.–Do not wrap your final JSON in Markdown/code fences\.Tree semantics and strict alternation:•Each depth\-1 node is a model\-side first move explicitly considered in the trace\.•Depth alternates by side:–Depth 1: model move–Depth 2: opponent reply–Depth 3: model reply–Depth 4: opponent reply–… and so on\. Maintain alternation at every depth\.•Children under any node are exactly the explicitly described replies for that exact position/branch\.How to extract moves from the trace\(precise, comprehensive\):1\.Identify model\-side root moves \(first\-move candidates only\)\.•Create a separate node for every distinct coordinate the trace explicitly considers as the model’s first move\.•Treat as model\-first\-move candidates when phrased as:–“If I play \(r,c\)…”, “I can/could/should play \(r,c\)…”, “What if I play \(r,c\)…”–“Another idea: \(r,c\)…”, “The best move is \(r,c\)…” \(final chosen move still counts as a node\)–Enumerations: “I could play \(a,b\) or \(c,d\)”→\\toadd both as separate node\.–Series: “I could fill \(2,1\),\(2,2\),\(2,3\)…”→\\toadd each as a separate node\.–"m r c"used to propose the model’s move choice→\\toadd as a node\.•Include every distinct explicitly proposed model\-first\-move, even if later rejected or deemed inferior\. Do not drop quick tests or briefly examined options\.•Do NOT promote coordinates if they appear only as opponent moves or only as deeper replies \(unless the trace also proposes them as a model\-first move elsewhere\)\.2\.Build subtrees per depth\-1 node \(no cross\-pollination across nodes\)\.•Under a specific depth\-1 node, attach opponent replies tied to that exact node scenario\. Use phrasing such as: “then they can…”, “the opponent can reply at…”, “they must block at…”, “they’ll answer with…”, “they can try…”, “they target…”\.•Include opponent move enumerations tied to that node, in the order explicitly mentioned:–Column/row enumerations: e\.g\., “Their vertical in column 5 needs rows 0,1,3”→\\toadd"0,5","1,5","3,5"\.–Horizontal targets: “After \(0,3\), they threaten row 2 cols 6,7,8”→\\toadd"2,6","2,7","2,8"\.–Diagonal endpoints: If a diagonal is described or implied by endpoints \(e\.g\., “\(0,3\)\-\(1,4\)\-\(2,5\)\-\(3,6\)” or “they could play \(0,3\) or \(3,6\)”\), add each explicitly named endpoint as an opponent child\.•For each opponent child, attach exactly the model’s explicitly stated reply\(ies\) for that branch, maintaining alternation\. If multiple model replies are given \(“I must play \(2,2\) or \(2,6\)”\), add both as children under that opponent node\.•Continue deeper when the trace specifies the next opponent move after the model reply \(e\.g\., “after I block at \(2,2\), they win by the other end \(2,6\)”\), adding that opponent move as the next child node at the correct depth\.•Only include moves explicitly tied to the current context\. Do NOT transfer opponent replies or deeper sequences from one node to another unless the trace explicitly restates them for that other node\.3\.Mirroring, “if instead/similarly”, and “the other end” within the same node\.•If the trace presents mirrored alternatives \(“or instead at \(c,d\)”, “similarly at \(x,y\)”\), include each as a sibling child under the same parent, in order of mention\.•“The other end” is allowed only if both endpoints of the line were explicitly established earlier in that branch\. Add the mirrored endpoint if explicitly identified or implied by previously listed endpoints of that same line\.•When both ends of a horizontal or diagonal threat are explicitly named, include both as children\.4\.Context tracking and deeper sequences\.•Maintain the exact branch context as described\. When the trace walks through a sequence \(“If I play \(0,1\), they must play \(3,4\); then I play \(0,5\); then they must block \(3,2\); then I can play \(2,5\)…”\), build the path depth\-by\-depth under the same depth\-1 node with strict alternation\.•If the narrative continues with “after that”, “then”, “from there”, or refers back to a previously described branch \(without switching to a new branch\), continue that exact subtree\.•Do not assume or create branches not explicitly stated\.5\.Unambiguous\-only policy\.•Include only coordinates that are explicitly specified \(via accepted formats or explicit enumerations\)\.•Ignore vague references \(“block there”, “left side”, “near \(r,c\)”\) unless the exact squares were enumerated earlier in that same branch\.6\.De\-duplication and ordering\.•Within the same parent, include each child coordinate at most once\.•If a node reappears later under the same parent, merge newly stated children into that existing node \(preserve alternation\)\.•Preserve the order of children exactly as they are mentioned in the trace\.7\.Interpreting"m r c"notation\.•Treat"m r c"exactly as the explicit coordinate\(r,c\)and normalize to"r,c"\.•Add it as a depth\-1 node only if used to propose the model’s first move \(including the final chosen move\)\.•When"m r c"appears as an opponent reply in\-branch, add it at the appropriate depth under the correct parent\.Generalizable extraction strategy\(to avoid common mistakes seen in prior attempts\):•Be exhaustive in capturing nodes: every time the narrator explicitly proposes a candidate for “my first move” \(including quick tests, alternatives, and series\), create a node for that coordinate\. Do not skip less\-preferred or rejected first\-move candidates\.•Under each depth\-1 node, be exhaustive in capturing opponent replies explicitly tied to that node, including:–All row/column enumerations, expanded to individual squares\.–Both explicitly named endpoints of diagonals/horizontals\.–Any “they could play/try/need/must” squares enumerated as part of threats on that branch\.•Keep branches separate by nodes; do not mix children across nodes\.•Maintain strict side alternation at all depths\.•Do not invent nodes; only add squares explicitly named or enumerated under the current branch context\.•Do not promote opponent\-only or deeper\-branch\-only squares to depth\-1 nodes unless they were also explicitly presented as model\-first\-move options elsewhere in the trace\.Final validation checklist before output:•Top\-level object is exactly\{"trees": \[\.\.\.\]\}withNode\[\]content\.•Every node is\["r,c", \.\.\.children\.\.\.\]with coordinates normalized to"r,c"\(no spaces\)\.•Depth\-1 nodes include every distinct explicit model\-first\-move candidate from the trace \(including all enumerated candidates and quick tests\), in order of first mention\.•Under each depth\-1 node, all explicitly stated opponent replies for that node are present \(including all expanded row/column enumerations and diagonal/horizontal endpoints\), in order of first mention\.•Under each opponent node, all explicitly stated model replies are present; include deeper opponent follow\-ups when specified\.•No duplicates at the same parent; merge re\-mentioned nodes’ children; preserve mention order\.•Strict alternation by depth is preserved everywhere\.•Strict JSON only; use double quotes; no trailing commas; no extra fields; do not wrap in Markdown\.

User Prompt: Search Tree Extraction\[\[ \#\# trace \#\# \]\] \{reasoning trace\}Respond with a JSON object in the following order of fields:trees\(must be formatted as a valid Pythonlist\[list\[Any\]\]\)\.

### C\.3Computational model fitting

#### C\.3\.1Data exclusion criteria for model fitting

We applied the following exclusion criteria to the raw game tree data before model fitting:

- •Invalid move: turns where the model failed to produce a valid move or where the extracted move did not conform to the expected format were excluded\.
- •No reasoning tree: turns where the model produced no reasoning tree were excluded, as these provide no information about the model’s search process\.
- •Degenerate tree: turns where the tree contained fewer than two candidate first\-ply moves were excluded, as these leave no meaningful choice to model\.
- •Chosen move not in tree: turns where the model’s chosen move did not appear among the candidate first\-ply moves in the tree were excluded\.
- •Insufficient samples: models for which fewer than 20 turns remained after turn\-level exclusions were excluded as insufficient for reliable parameter estimation\. This criterion excluded Llama\-3\.3\-70B from the analysis\.

#### C\.3\.2Parameter ranges of computational models

We fit computational models to each LLM’s move choices\. Parameters were estimated by minimizing the negative log\-likelihood using L\-BFGS\-B with 20 random restarts to mitigate local minima\. To enforce parameter constraints, scaling factorCCwas reparameterized asexp⁡\(log⁡C\)\\exp\(\\log C\)\. Across the 13 models included in the analysis, fitted parameters ranged as follows:wcentre∈\[−0\.17,0\.93\]w\_\{\\text\{centre\}\}\\in\[\-0\.17,0\.93\],wconnected\-two\-in\-a\-row∈\[−0\.06,0\.31\]w\_\{\\text\{connected\-two\-in\-a\-row\}\}\\in\[\-0\.06,0\.31\],wunconnected\-two\-in\-a\-row∈\[−0\.23,0\.22\]w\_\{\\text\{unconnected\-two\-in\-a\-row\}\}\\in\[\-0\.23,0\.22\],wthree\-in\-a\-row∈\[0\.06,0\.91\]w\_\{\\text\{three\-in\-a\-row\}\}\\in\[0\.06,0\.91\],wfour\-in\-a\-row∈\[0\.10,19\.66\]w\_\{\\text\{four\-in\-a\-row\}\}\\in\[0\.10,19\.66\], andC∈\[0\.25,5\.00\]C\\in\[0\.25,5\.00\]\.

### C\.4Causal intervention study

#### C\.4\.1Paragraph\-Level trace labeling

To surgically remove a reasoning branch, we must first identify which paragraphs of the trace structurally belong to it\. We split each trace into paragraphs at double newline boundaries and called Claude \(claude\-opus\-4\-7\) to assign three annotations to every paragraph: a*structural type*, a*branch root*, and a*mentions*list recording the move coordinates explicitly simulated within that paragraph along with their lookahead depth\.

The structural type captures the paragraph’s role in the reasoning process\.PREAMBLEparagraphs describe the board state and scan for threats before any candidate move is introduced\.BRANCH\_STARTmarks the first paragraph that explicitly proposes a specific move as a candidate \(“If I play X…”, “Let me try X…”\)\.BRANCH\_ANALYSISandBRANCH\_CONCLUSIONlabel subsequent paragraphs that continue and close the analysis within that branch \(opponent replies, counter\-replies, local evaluations\)\.COMPARISONmarks cross\-branch paragraphs that weigh multiple candidate moves against each other\.FINAL\_DECISIONmarks the paragraph\(s\) that confirm the model’s chosen move \(“I’ll play X”, “Therefore X”\)\. Finally,METAcaptures generic meta\-commentary not tied to any move \(“Let me think step by step…”\)\.

In addition to the structural type, each paragraph receives two further annotations: a*branch root*, recording which candidate first\-ply move \(as a coordinate"r,c"\) the paragraph structurally belongs to; and a*mentions*list, recording every coordinate that the paragraph explicitly simulates as a future move, each tagged with its depth in the lookahead sequence \(depth 1 = model’s move, depth 2 = opponent reply, depth 3 = model counter\-reply, etc\.\)\. Critically, the mentions list excludes coordinates that merely describe current board occupancy, and only prospective moves count\.

System Prompt: Reasoning Trace LabelingYou are assisting a research project studying how chain\-of\-thought reasoning influences LLM behavior\. The goal is to surgically remove specific reasoning branches from a model’s thinking trace and then re\-run the model with the edited trace as a prefill — observing whether the model changes its decision when it can no longer “see” the reasoning that led to a particular move\. Accurate labeling is critical: if the wrong paragraphs are removed the intervention is invalid; if key paragraphs \(especially final confirmations\) are missed the model will trivially repeat the same answer from the residual context\.You are analyzing a reasoning trace from a Four\-in\-a\-Row game player \(4 rows×\\times9 columns, zero\-indexed\)\. Coordinates appear as\(r,c\),row r col c, orm r c\.The trace has been split into numbered paragraphs\. Your task: assign a label to EVERY paragraph so that we can later remove the branch for a specific target move\.Lable definitions type— one of:PREAMBLE: Board parsing, board state description, threat scanning before any branch explorationBRANCH\_START: First paragraph proposing a specific move as the model’s candidate first move \(“If I play X…”, “Let me try X…”, “What if I place at X…”\)BRANCH\_ANALYSIS: Continuation of analysis within a branch \(opponent replies, counter\-replies, evaluations\)BRANCH\_CONCLUSION: Local verdict closing a branch \(“X looks strong/weak because…”\)COMPARISON: Cross\-branch comparison mentioning multiple candidate movesFINAL\_DECISION: Confirmation of the chosen move \(“I’ll play X”, “My move is X”, “Therefore X”\)META: Meta\-commentary not tied to a specific move \(“Let me think…”, “First I’ll consider…”\)branch\_root— the depth\-1 candidate move this paragraph structurally belongs to, as"r,c"\. UsenullforPREAMBLE,COMPARISON,FINAL\_DECISION,META\. Important: assign the samebranch\_rootto ALL paragraphs in a branch, even strategic preamble paragraphs that motivate the move before naming it\.mentions— list of coordinates that are part of MOVE SIMULATION only, each with:coord—"r,c"depth— 1=model’s move, 2=opponent reply, 3=model counter\-reply, etc\.IMPORTANT: only include a coordinate if it is being proposed or simulated as a future move \(e\.g\. “if I play X”, “opponent responds at Y”, “let me try X”\)\. Do NOT include coordinates that merely describe the current board state \(e\.g\. “there are pieces at X and Y”, “X is occupied”, “the board shows pieces at X, Y, Z”\)\. Board\-parsing references are not move simulations and must be excluded frommentions\.Output format Return a JSON array with one object per paragraph \(in order\):``` [ {"para": 0, "type": "PREAMBLE", "branch_root": null, "mentions": []}, {"para": 1, "type": "BRANCH_START", "branch_root": "1,4", "mentions": [{"coord": "1,4", "depth": 1}]}, ... ] ``` Return ONLY the JSON array\. No explanation, no markdown fences\.

User Prompt: Reasoning Trace LabelingThe model is playing as \{color\}\. So ‘\{color\} plays X’ = depth 1 \(model’s move\), ‘\{opponent\} plays Y’ = depth 2 \(opponent reply\)\.Search tree \(depth\-0 = model’s candidate first moves\):root=\{r\},\{c\} \(\{N\} nodes\) …Numbered paragraphs of the reasoning trace:``` <trace> [0] paragraph 0 text ... [1] paragraph 1 text ... [2] paragraph 2 text ... ... </trace> ``` Label every paragraph with its type,branch\_root, andmentions\. Return a JSON array with one object per paragraph\.

#### C\.4\.2Trace editing

We applied four editing strategies to isolate which parts of a reasoning branch causally drive move selection\. Across all strategies,FINAL\_DECISIONparagraphs are always removed so the model must re\-decide from the remaining context rather than simply echoing its prior conclusion\. The strategies differ in how much additional content is removed on top of this\.

Remove final decision\(fd\) removes onlyFINAL\_DECISIONparagraphs, serving as a minimal baseline\.

Remove final decision \+ whole branch\(fd\+branch\+comp\) additionally removes the entire reasoning branch for the chosen move: all paragraphs whosebranch\_rootequals the target move \(i\.e\., paragraphs of typeBRANCH\_START,BRANCH\_ANALYSIS, andBRANCH\_CONCLUSION\), as well as anyCOMPARISONparagraphs in which the target move appears inmentionsor the paragraph text\. This is the maximal intervention\.

Add back depth\-1 onlystarts fromfd\+branch\+compand restores paragraphs whosementionscontain exclusively first\-ply moves \(the model’s own moves\)\. The model retains information about how the opponent can respond, but not the model’s own analysis of those responses\.

Add back depth\-1 only \+ depth\-2 onlyfurther restores paragraphs whosementionscontain exclusively second\-ply moves \(the opponent’s replies\), in addition to the depth\-1 paragraphs restored above\.

In all strategies, contiguous removed spans were merged, trailing whitespace was collapsed, and the trace was re\-joined\. As a safeguard, any edit that would remove more than 85% of the original trace was discarded to avoid degenerately short prefills\.

#### C\.4\.3Model re\-running with edited trace

The edited trace was used as a reasoning prefill\. We reconstructed the original prompt \(system and user messages\), appended the model’s thinking\-start token \(<think\>\), injected the edited reasoning trace, closed the thinking block \(</think\>\), and ran the model at temperature 0 with a maximum of 64 new output tokens\. The model was required to emit its move in the standard<next\_move\>tags\. We recorded \(1\) whether the model’s new move differed from its original move, and \(2\) whether the new move was among the other candidate root moves from the model’s original search tree\.

## Appendix DAdditional results

### D\.1Search tree depth

[Figure S1](https://arxiv.org/html/2605.06840#A4.F1)shows the mean number of nodes explicitly considered at each depth of the search tree, broken down by model\. Depth 1 corresponds to the model’s own candidate first moves, depth 2 to the opponent’s replies, depth 3 to the model’s counter\-replies, and so on\. All models peak sharply at depth 1, reflecting that candidate first moves are always the most numerous nodes\. The depth\-2 bar is consistently smaller, indicating that models consider fewer opponent replies per candidate first move than candidate first moves overall\. Beyond depth 2, node counts drop off rapidly for most models\. There is nevertheless substantial variation across models in how far ahead they look\.

![Refer to caption](https://arxiv.org/html/2605.06840v1/x5.png)Figure S1:Distribution of node depthfor each LLM model\.
### D\.2Model recovery analysis

The main analysis compared competing backup rules: a full\-tree model that backs up values via minimax across the whole search tree, and a myopic model that evaluates only the immediate position after each candidate first\-ply move\. A natural concern is whether the fitting procedure is powerful enough to distinguish between these two data\-generating mechanisms, that is, whether a good fit by the myopic model is genuinely informative rather than an artifact of the optimizer finding spurious parameter configurations\. We address this with a model recovery test\.

For each of the 13 models with sufficient data, we ran a two\-condition recovery test using the model’s own real game trees as the stimulus set:

1. 1\.Simulate from the full\-tree model\.Using the model’s fitted full\-tree parameters, we sampled synthetic move choices from the full\-tree softmax policy\. We then fit both the full\-tree model and the myopic model to these synthetic choices\. If the fitting procedure is valid, the full\-tree model should win \(Δ\>0\\Delta\>0, whereΔ=\(NLLmyopic−NLLfull\)/N\\Delta=\(\\text\{NLL\}\_\{\\text\{myopic\}\}\-\\text\{NLL\}\_\{\\text\{full\}\}\)/N\)\.
2. 2\.Simulate from the myopic model\.Using the model’s fitted myopic parameters, we sampled synthetic move choices from the myopic softmax policy\. We then fit both models to these synthetic choices\. The myopic model should win \(Δ<0\\Delta<0\)\.

Models with bothΔ\>0\\Delta\>0in condition 1 andΔ<0\\Delta<0in condition 2 are counted as successfully recovered\.

Model recovery succeeded in 12 out of 13 cases \([Figure S2](https://arxiv.org/html/2605.06840#A4.F2)\)\. In the left panel, all 13 models show positiveΔ\\Deltawhen data are generated from the full\-tree model, confirming the fitting procedure correctly identifies the full\-tree model as superior\. In the right panel, 12 of 13 models show negativeΔ\\Deltawhen data are generated from the myopic model\. The single failure is Kimi\-K2\-Instruct for which the myopic model fails to out\-fit the full\-tree model on its own simulated data\. This model has the smallest sample in the dataset \(N=211N=211\), so the signal is insufficient to overcome noise\. Notably, all other models recover successfully in both directions\.

![Refer to caption](https://arxiv.org/html/2605.06840v1/x6.png)Figure S2:Model recovery analysisshows successful recovery of underlying ground\-truth models\.These results establish that our fitting procedure can reliably distinguish full\-tree from myopic decision\-making given sufficient data\. The model comparisons are therefore not artifacts of optimizer behavior or parameter degeneracy\.

### D\.3Fitted feature weights

Each panel shows the relationship between one feature weight \(normalized by the four\-in\-a\-row weight\) and winning rate across models\. Normalization removes the overall scale of the value function, isolating each model’s relative preference for a given feature\. None of the four normalized weights correlate significantly with winning rate\.

![Refer to caption](https://arxiv.org/html/2605.06840v1/x7.png)Figure S3:Normalized feature weights versus winning rate\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification:The abstract and introduction accurately describe our empirical findings about LLM search behavior and the cognitive modeling approach used to quantify it\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification:Limitations are discussed in the dedicated limitations paragraph, including the restriction to a single game domain and the assumptions of the cognitive model\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification:The paper does not include theoretical results; all contributions are empirical\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification:The methods section describes the game setup, model prompting procedure, tree parsing, and cognitive model fitting in sufficient detail for reproduction\. Code and data are provided \(see[Appendix A](https://arxiv.org/html/2605.06840#A1)\)\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification:Code and data are publicly available at the repository described in[Appendix A](https://arxiv.org/html/2605.06840#A1)\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification:The methods section specifies the prompting setup, model API settings, cognitive model parameters, and fitting procedure\. No training or data splits are involved as this is a behavioral analysis study\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification:Statistical significance of key comparisons is reported withpp\-values throughout\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification:LLM responses were collected via commercial APIs\. Cognitive model fitting was run on an institutional high\-performance computing cluster\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification:The research analyzes the behavior of publicly available LLMs on a board game task and raises no ethical concerns\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[N/A\]
49. Justification:This is foundational research on LLM planning behavior with no direct path to harmful applications\. Positive impacts include informing training methods for more capable and interpretable reasoning models\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification:The paper does not release new models\. The released dataset consists of LLM game traces and poses no misuse risk\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification:The original cognitive modeling paper is properly cited; no explicit license was found for this code, and it is used with attribution\. LLMs are accessed through official APIs in accordance with their respective terms of service\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.06840v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification:We release a new dataset of LLM reasoning trees on the four\-in\-a\-row task along with the analysis and fitting code, documented in the repository\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification:The paper does not involve crowdsourcing or human subjects; all data is collected from LLM APIs\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification:The paper does not involve human subjects\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification:LLMs are the primary object of study in this paper\. Their usage, including which models were queried, prompting details, and API settings, is fully described in the methods section\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

@lateinteraction: very cool work !!

X AI KOLs Timeline

Guowei Xu discusses limitations of Best-of-N and tree search methods for LLMs on hard reasoning problems, noting sparse verification signals and that candidates remain within the model's distribution.

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

arXiv cs.CL

Introduces ReasoningFlow, a framework to capture discourse structures of large language model reasoning traces as directed acyclic graphs, enabling fine-grained analysis of reasoning behaviors like self-reflection and backtracking. Based on manual and automatic annotation of thousands of traces, it reveals structural similarities across models and that most erroneous steps do not contribute to final answers.