Visual Graph Scaffolds for Structural Reasoning in Large Language Models

arXiv cs.AI 06/03/26, 04:00 AM Papers
graph-reasoning multi-hop-qa knowledge-distillation visual-scaffolds llm vlm
Summary
This paper explores using visual graph mind maps as reasoning scaffolds for LLMs, finding that visual guidance remains effective even without direct answer hints, while textual flattening of graphs loses benefits.
arXiv:2606.02673v1 Announce Type: new Abstract: Graphs have been used to enhance large language models (LLMs) for structured reasoning, mostly as external knowledge sources are provided to models at test time. In this paper, we take a different view: the value of graphs for LLMs lie not only in supplying information, but also in organizing reasoning. Inspired by how humans use graph-structured mind maps to organize branching and converging thoughts, we ask whether graphs can serve as an internal form of reasoning assistance. We study this question on multi-hop question answering tasks, where teacher-provided reasoning traces are rewritten as graph mind maps and used to guide a student model. Our experiments reveal a clear modality gap. When graph structures are flattened into text, their benefits become limited once direct answer hints are removed. Under this abstract guidance setting, both reasoning efficiency and answer quality degrade substantially. In contrast, visual graph guidance remains effective without direct answer clues, and its advantage persists after supervised fine-tuning and KL-based distillation. The above findings support the claim that graphs should be studied not only as external knowledge structures for LLMs, but also as visual scaffolds for organizing reasoning.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:41 AM
# Visual Graph Scaffolds for Structural Reasoning in Large Language Models
Source: [https://arxiv.org/html/2606.02673](https://arxiv.org/html/2606.02673)
###### Abstract

Graphs have been used to enhance large language models \(LLMs\) for structured reasoning, mostly as external knowledge sources are provided to models at test time\. In this paper, we take a different view: the value of graphs for LLMs lie not only in supplying information, but also in organizing reasoning\. Inspired by how humans use graph\-structured mind maps to organize branching and converging thoughts, we ask whether graphs can serve as an internal form of reasoning assistance\. We study this question on multi\-hop question answering tasks, where teacher\-provided reasoning traces are rewritten as graph mind maps and used to guide a student model\. Our experiments reveal a clear modality gap\. When graph structures are flattened into text, their benefits become limited once direct answer hints are removed\. Under this abstract guidance setting, both reasoning efficiency and answer quality degrade substantially\. In contrast, visual graph guidance remains effective without direct answer clues, and its advantage persists after supervised fine\-tuning and KL\-based distillation\. The above findings support the claim that graphs should be studied not only as external knowledge structures for LLMs, but also as visual scaffolds for organizing reasoning\.

Machine Learning, ICML

## 1Introduction

Graphs can serve as useful tools for enhancing Large Language Models \(LLMs and Vision Language Models \(VLMs\) in reasoning tasks\. In most existing settings, graphs are used as external support, where they retrieve evidence, ground answers, or organize memories that the model may not possess\(Hanet al\.,[2025a](https://arxiv.org/html/2606.02673#bib.bib1); Heet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib3); Zhanget al\.,[2025](https://arxiv.org/html/2606.02673#bib.bib4)\)\. While effective, this view captures only part of what graphs can offer\. In human reasoning, graphs often function not only as information structures, but also as cognitive scaffolds\. For example, human write mind maps that make branching, convergence, hierarchy, and local relations easier to inspect than linear text\. This motivates the central question of this paper:can graphs help LLMs not only access knowledge, but also organize reasoning?

We study this question in a teacher–student setting\. A stronger teacher model first solves a multi\-hop question answering problem, and its reasoning process is rewritten into a graph\-structured scaffold for a weaker student model\. The goal is not to retrieve additional facts, but to transfer the organization of a successful reasoning process\. If such guidance can improve the student and later be internalized through fine\-tuning or distillation, then graphs may serve as not only external knowledge\. Instead, they become a medium for teaching structured thought\.

A natural way to implement graph\-structured reasoning is through text\. Prior work such as Graph\-of\-Thoughts and related methods has explored how non\-linear reasoning structures can be represented within language\-based prompting frameworks\(Bestaet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib5); Yaoet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib6); Hanet al\.,[2025b](https://arxiv.org/html/2606.02673#bib.bib7)\)\. However, text remains a linear medium\. Once a graph is flattened into sentences, its topology must be described indirectly, often making the guidance more redundant, and harder to learn from\. This motivates our alternative interface:graph as image\. In our pipeline, teacher reasoning is rendered as a graph\-structured mind map and provided to a student VLM, while textual guidance serve as controlled baselines\. This design lets us ask whether the benefit comes from reasoning content alone, or from preserving reasoning topology in a visual form\.

![Refer to caption](https://arxiv.org/html/2606.02673v1/x1.png)Figure 1:Overview of the Graph\-Guided Reasoning framework\.\(a\)Teacher Reasoning & Answer: A strong teacher model solves a multi\-hop question and generates a detailed reasoning trace\.\(b\)Guidance Generation: The teacher’s reasoning is transformed into four types of guidance artifacts: Textual vs\. Visual modalities, and Direct vs\. Abstract styles\.\(c\)Student Answer: A student model utilizes the generated guidance to arrive at the correct answer\.\(d\)Distillation: The student internalizes the structured reasoning via SFT or KL\.To make this comparison meaningful, we distinguish between two guidance settings\. In the direct setting, guidance may contain answer\-local hints such as key facts or intermediate conclusions\. In the abstract setting, such hints are forbidden: the guidance may only describe general reasoning strategies and structural relations, without leaking the final answer, answer\-specific facts, or intermediate conclusions\. The abstract setting is central to our study because it tests whether the student can use the graph as a reasoning scaffold rather than as a shortcut to the answer\. Our experiments reveal a clear modality gap\. When answer\-local hints are allowed, visual graph guidance and textual guidance perform similarly\. When guidance must remain abstract, however, visual graph guidance remains effective while textual guidance degrades sharply\. This advantage also persists after supervised fine\-tuning \(SFT\) and KL\-based distillation, and is accompanied by shorter reasoning outputs\.

These findings suggest a broader role for graphs in graph–LLM\. Beyond serving as external knowledge, graphs can act as interfaces for transferring the organization of reasoning itself\. By preserving branching, convergence, and local dependencies in a compact form, visual graphs expose structure that is difficult to maintain when the same reasoning process is flattened into text\. Our position is that visual graph guidance should be treated as a topology\-preserving interface for structured reasoning, not merely as another form of prompting\. In this interface, vision is a particularly promising modality because it can preserve and present graph topology instead of serializing it linearly into text\.

## 2Related Work

Many graph\-LLM works treat graphs as external structures for retrieval or grounding\. GraphRAG constructs entity–relation graphs from corpora for retrieval over local facts and corpus\-level structure\(Edgeet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib12)\)\. G\-Retriever retrieves compact, reasoning\-relevant subgraphs before generation\(Heet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib3)\)\. ToG\-2 alternates knowledge\-graph traversal with textual context retrieval for deeper multi\-hop reasoning\(Maet al\.,[2025](https://arxiv.org/html/2606.02673#bib.bib13)\), while GNN\-RAG uses graph neural retrieval to identify question\-relevant nodes and paths before prompting the LLM\(Mavromatis and Karypis,[2025](https://arxiv.org/html/2606.02673#bib.bib14)\)\. A related line uses graphs to organize reasoning itself: Graph of Thoughts represents intermediate reasoning states as nodes and dependencies as edges, enabling non\-linear reasoning trajectories\(Bestaet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib5); Yaoet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib6)\)\. Recent work on visual graph reasoning further shows that vision\-language models can benefit from graph images\. Wang et al\.\(Wanget al\.,[2023](https://arxiv.org/html/2606.02673#bib.bib8)\)show that LLMs struggle with text\-based graph reasoning, Zhao et al\.\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.02673#bib.bib11)\)find that vision encoders can outperform GNNs on global structural understanding, and Wei et al\.\(Weiet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib10)\)and Zhu et al\.\(Zhuet al\.,[2025](https://arxiv.org/html/2606.02673#bib.bib9)\)use image\-based representations of raw graph inputs to improve graph reasoning performance\. However, these studies mostly use graphs as external knowledge, textual reasoning structures, or visualization of graph inputs\. They leave underexplored whether a graph image can externalize and transfer the structure of a reasoning process itself\. This paper focuses on that interface question: instead of asking whether a model can solve a graph problem from an image, we ask whether an image can carry a reasoning trajectory from one model to another\.

## 3Visual Graphs Are Reasoning Interfaces

To study the differences caused by the presentation of graphs during reasoning in LLMs, we design a pipeline that compares different representations of the same teacher guidance, which specifically focus on the comparison between rendered graph images versus linear text\. As illustrated in Figure[1](https://arxiv.org/html/2606.02673#S1.F1), the pipeline has three main stages: teacher trajectory generation, guidance construction, with distillation as an internalization step\. This design isolates whether the same reasoning structure is more usable when exposed visually rather than flattened into text\.

### 3\.1Teacher Trajectory Generation

We first identify QA examples that the base student answers incorrectly\. For each such case, a stronger teacher model solves the same question and produces an explicit reasoning trajectory\. We keep only cases where the teacher’s answer is verified as correct, so that the following comparison focuses on how the reasoning is transferred rather than whether the teacher solved the problem\. These verified teacher trajectories are then rewritten into guidance artifacts for the student\.

### 3\.2Guidance Construction

Each teacher trajectory is converted into guidance along two axes: modality and content style\. For modality,imageguidance converts the teacher trajectory into Graphviz DOT code and renders it as a graph\-structured mind map, whiletextguidance expresses the same kind of support in plain text\. We also construct agraph\-to\-textcontrol, which converts the teacher\-generated graph code into text after graph construction\. This control preserves graph node content while removing visual layout\.

For content style,directguidance includes task\-specific hints, key facts, and intermediate conclusions\. In contrast,abstractguidance may include only general reasoning strategies and logical operations, and must exclude answer\-specific clues\. Direct guidance is included because it provides answer\-local information, while abstract guidance tests the student’s ability to truly use the graph structure like a mind map to reason\.

### 3\.3Student Use of Guidance

The student uses the constructed guidance in two ways\. First, inguided re\-evaluation, the student is frozen and re\-answers its original failure cases with teacher guidance\. This measures the immediate usefulness of each guidance interface\. Second, ininternalization, the student is trained on successful guided behavior and later tested without guidance\. We consider bothSelf\-SFT, which fine\-tunes the student on its own correct guided responses, andKL distillation, where a guided branch provides soft targets to an unguided branch\. Together, these settings ask whether visual graphs help only as inference\-time prompts, or whether their structured signal can be absorbed into the model\.

## 4Experiments

We aim to study the below research questions:

- •RQ1:Can visual graphs effectively guide the reasoning of student models?
- •RQ2:Can student models internalize the advantages of graph\-based guidance?
- •RQ3:What role does graph structure play in visual reasoning guidance?

To answer these questions, we first perform guided re\-evaluation on teacher\-correct failure cases with a frozen student model\. We then examine whether the observed advantage persists after internalization through Self\-SFT and KL distillation\. Finally, we analyze why images help by measuring output length and ablating graph topology\.

### 4\.1Experiment Setting

The main experiments are conducted on three classic multi\-hop QA datasets: HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.02673#bib.bib18)\), 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.02673#bib.bib19)\), and MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.02673#bib.bib20)\)\. The supervision dataset is built from the training splits of these three datasets\. After semantic validation, it contains14,49014\{,\}490teacher\-correct cases for guidance\-based re\-evaluation and downstream internalization\. The held\-out QA test set contains3,0003\{,\}000questions in total, with1,0001\{,\}000sampled from each dataset\. For the ablation study, we use a separate3,0003\{,\}000\-example subset sampled from the training\-split teacher\-correct pool, again with1,0001\{,\}000examples per dataset\. We report two QA evaluations\. The first is guided re\-evaluation on the teacher\-guided failure set\. The second is QA on the test set after internalization\. In the experiments, we instantiate the teacher, student, and verifier with DeepSeek\-V3\.2\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.02673#bib.bib15)\), Qwen3\-VL\-8B\-Instruct\(Qwen Team,[2025b](https://arxiv.org/html/2606.02673#bib.bib16)\), and Qwen3\-8B\-Instruct\(Qwen Team,[2025a](https://arxiv.org/html/2606.02673#bib.bib17)\), respectively\. Full prompts, setup details, and hyperparameters are deferred to the appendix\.

### 4\.2Main Results

Table 1:Main QA results\. Re\-eval reports guided re\-evaluation accuracy \(%\) on teacher\-correct QA failures with a frozen student\. Self\-SFT and KL report held\-out QA accuracy \(%\) on the test set after internalization\. Train with teacher CoT stands for directly SFT on teacher’s reasoning content\.##### Finding 1:The image advantage appears when reasoning guidance must remain structural rather than answer\-local\.

Table[1](https://arxiv.org/html/2606.02673#S4.T1)shows that modality matters little in the direct setting\. During the re\-evaluation, direct image and direct text guidance are essentially tied\. This is consistent with the idea that once strong answer\-local hints are present, changing the modality matters little\. The pattern changes in the abstract setting\. Here the guidance must teach how to reason rather than reveal what answer to recover\. Under this constraint, abstract image guidance remains strong while abstract text falls largely with the Graph\-to\-text control drops even further\. These results suggest that images become especially valuable when the student must rely on structural guidance rather than answer\-local clues\.

##### Finding 2:This advantage survives internalization\.

The Self\-SFT and KL blocks in Table[1](https://arxiv.org/html/2606.02673#S4.T1)show that the same ordering persists after training\. In Self\-SFT, image guidance remains stronger than text in both styles\. KL distillation shows the same pattern, with the largest gap again appearing in the abstract setting\. The graph\-to\-text control again remains below image guidance\. This indicates that image\-based graph\-structured guidance is easier for the student to internalize than its text counterpart\.

### 4\.3Why Images Help

##### Finding 3:Images help by offering a shorter but more structured interface\.

Table[2](https://arxiv.org/html/2606.02673#S4.T2)reports average output length in the abstract setting\. In re\-evaluation, abstract image guidance yields only226226output tokens on average, compared with703703for abstract text and697697for graph\-to\-text\. After training, the abstract image models also remain far shorter than their text counterparts\. This suggests that the graph representation provides a strong compression effect\. Rather than spending tokens on unpacking a long sequential text guidance, the model can recover the reasoning process from a more concise structural interface via vision\. In this sense, the rendered image preserves the compactness with which graphs express complex relations while foregrounding the reasoning pattern\.

Table 2:Average output tokens in the abstract setting\. Image guidance stays substantially shorter than text\-based alternatives\.
##### Finding 4:Topology preservation is critical for abstract visual guidance\.

In Table[3](https://arxiv.org/html/2606.02673#S4.T3), we conduct an ablation study to analyze the effect of structure during image\-guidance\. We can see that when abstract image guidance is forced into a chain, or when its node budget is sharply reduced before rendering, re\-evaluation accuracy drops substantially\. The drop points to the role of preserved branching and convergence in abstract guidance, rather than to visual presentation alone\. In this setting, graph topology appears to carry part of the supervision signal\.

Table 3:Abstract image\-guidance ablations on guided re\-evaluation accuracy \(%\)\.

## 5Limitations

##### Limitation 1:Transfer beyond the QA family remains limited\.

The image advantage is clearest within the QA family on which guidance is constructed and the student is trained\. Although the same ordering holds on a separate six\-dataset reasoning benchmark, the absolute performance remains modest as shown in Table[4](https://arxiv.org/html/2606.02673#S5.T4)\. This suggests that the current pipeline is still task\-specific and does not yet establish a broadly reusable structured\-reasoning capability\.

Table 4:Out\-of\-domain reasoning as a limitation signal\.A stronger test would require guidance construction and training across more diverse reasoning families, rather than transfer from multi\-hop QA alone\.

##### Limitation 2:Image supervision does not yet match direct CoT supervision\.

As Table[1](https://arxiv.org/html/2606.02673#S4.T1)shows, image guidance is stronger than text guidance within the constrained guidance pipeline studied here, but it still does not outperform direct training on teacher’s CoT\. The contribution should therefore be read as a claim about a better interface for transferring structure under restricted supervision, not as evidence that image guidance is a complete replacement for current distillation for reasoning\.

## 6Conclusion

This paper argues that graphs should be studied not only as external knowledge for LLMs, but also as topology\-preserving interfaces for organizing reasoning\. We find that visual graph guidance matches text guidance when answer\-local hints are available, but remains far more effective when guidance must stay abstract\. This advantage persists after internalization and weakens when graph is disrupted, suggesting that the useful signal lies in preserved structure rather than visual presentation alone\. A broader take\-away is that graph–LLM integration should not be framed only as retrieval or grounding\. Graphs can serve as compact abstractions for reasoning organization, and vision offers a promising modality for exposing this structure without flattening it into text\.

## References

- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk, and T\. Hoefler \(2024\)Graph of thoughts: solving elaborate problems with large language models\.InThirty\-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty\-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20\-27, 2024, Vancouver, Canada,M\. J\. Wooldridge, J\. G\. Dy, and S\. Natarajan \(Eds\.\),pp\. 17682–17690\.External Links:[Link](https://doi.org/10.1609/aaai.v38i16.29720),[Document](https://dx.doi.org/10.1609/AAAI.V38I16.29720)Cited by:[§1](https://arxiv.org/html/2606.02673#S1.p3.1),[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.CoRRabs/2512\.02556\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.02556),[Document](https://dx.doi.org/10.48550/ARXIV.2512.02556),2512\.02556Cited by:[§4\.1](https://arxiv.org/html/2606.02673#S4.SS1.p1.5)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, and J\. Larson \(2024\)From local to global: A graph RAG approach to query\-focused summarization\.CoRRabs/2404\.16130\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.16130),[Document](https://dx.doi.org/10.48550/ARXIV.2404.16130),2404\.16130Cited by:[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- H\. Han, Y\. Wang, H\. Shomer, K\. Guo, J\. Ding, Y\. Lei, M\. Halappanavar, R\. A\. Rossi, S\. Mukherjee, X\. Tang, Q\. He, Z\. Hua, B\. Long, T\. Zhao, N\. Shah, A\. Javari, Y\. Xia, and J\. Tang \(2025a\)Retrieval\-augmented generation with graphs \(graphrag\)\.CoRRabs/2501\.00309\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.00309),[Document](https://dx.doi.org/10.48550/ARXIV.2501.00309),2501\.00309Cited by:[§1](https://arxiv.org/html/2606.02673#S1.p1.1)\.
- H\. Han, Y\. Xie, H\. Liu, X\. Tang, S\. Nag, W\. Headden, Y\. Li, C\. Luo, S\. Ji, Q\. He, and J\. Tang \(2025b\)Reasoning with graphs: structuring implicit knowledge to enhance llms reasoning\.InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Findings of ACL,pp\. 25698–25714\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1319/)Cited by:[§1](https://arxiv.org/html/2606.02673#S1.p3.1)\.
- X\. He, Y\. Tian, Y\. Sun, N\. V\. Chawla, T\. Laurent, Y\. LeCun, X\. Bresson, and B\. Hooi \(2024\)G\-retriever: retrieval\-augmented generation for textual graph understanding and question answering\.InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/efaf1c9726648c8ba363a5c927440529-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.02673#S1.p1.1),[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing A multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain \(Online\), December 8\-13, 2020,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),pp\. 6609–6625\.External Links:[Link](https://doi.org/10.18653/v1/2020.coling-main.580),[Document](https://dx.doi.org/10.18653/V1/2020.COLING-MAIN.580)Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px2.p1.2),[§4\.1](https://arxiv.org/html/2606.02673#S4.SS1.p1.5)\.
- J\. Liu, L\. Cui, H\. Liu, D\. Huang, Y\. Wang, and Y\. Zhang \(2020\)LogiQA: A challenge dataset for machine reading comprehension with logical reasoning\.InProceedings of the Twenty\-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020,C\. Bessiere \(Ed\.\),pp\. 3622–3628\.External Links:[Link](https://doi.org/10.24963/ijcai.2020/501),[Document](https://dx.doi.org/10.24963/IJCAI.2020/501)Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px4.p1.1)\.
- S\. Ma, C\. Xu, X\. Jiang, M\. Li, H\. Qu, C\. Yang, J\. Mao, and J\. Guo \(2025\)Think\-on\-graph 2\.0: deep and faithful large language model reasoning with knowledge\-guided retrieval augmented generation\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=oFBu7qaZpS)Cited by:[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- C\. Mavromatis and G\. Karypis \(2025\)GNN\-RAG: graph neural retrieval for efficient large language model reasoning on knowledge graphs\.InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Findings of ACL,pp\. 16682–16699\.External Links:[Link](https://aclanthology.org/2025.findings-acl.856/)Cited by:[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- M\. Nezhurina, L\. Cipolina\-Kun, M\. Cherti, and J\. Jitsev \(2024\)Alice in wonderland: simple tasks showing complete reasoning breakdown in state\-of\-the\-art large language models\.CoRRabs/2406\.02061\.External Links:[Link](https://doi.org/10.48550/arXiv.2406.02061),[Document](https://dx.doi.org/10.48550/ARXIV.2406.02061),2406\.02061Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px4.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Findings of ACL,pp\. 5687–5711\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-emnlp.378),[Document](https://dx.doi.org/10.18653/V1/2023.FINDINGS-EMNLP.378)Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px4.p1.1)\.
- Qwen Team \(2025a\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.09388),[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px5.p1.5),[§4\.1](https://arxiv.org/html/2606.02673#S4.SS1.p1.5)\.
- Qwen Team \(2025b\)Qwen3\-vl technical report\.CoRRabs/2511\.21631\.External Links:[Link](https://doi.org/10.48550/arXiv.2511.21631),[Document](https://dx.doi.org/10.48550/ARXIV.2511.21631),2511\.21631Cited by:[§4\.1](https://arxiv.org/html/2606.02673#S4.SS1.p1.5)\.
- K\. Sinha, S\. Sodhani, J\. Dong, J\. Pineau, and W\. L\. Hamilton \(2019\)CLUTRR: A diagnostic benchmark for inductive reasoning from text\.pp\. 4505–4514\.External Links:[Link](https://doi.org/10.18653/v1/D19-1458),[Document](https://dx.doi.org/10.18653/V1/D19-1458)Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px4.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Trans\. Assoc\. Comput\. Linguistics10,pp\. 539–554\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00475),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00475)Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px2.p1.2),[§4\.1](https://arxiv.org/html/2606.02673#S4.SS1.p1.5)\.
- H\. Wang, S\. Feng, T\. He, Z\. Tan, X\. Han, and Y\. Tsvetkov \(2023\)Can language models solve graph problems in natural language?\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/622afc4edf2824a1b6aaf5afe153fa93-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- Y\. Wei, S\. Fu, W\. Jiang, Z\. Zhang, Z\. Zeng, Q\. Wu, J\. T\. Kwok, and Y\. Zhang \(2024\)GITA: graph to visual and textual integration for vision\-language graph reasoning\.InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/00295cede6e1600d344b5cd6d9fd4640-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 2369–2380\.External Links:[Link](https://doi.org/10.18653/v1/d18-1259),[Document](https://dx.doi.org/10.18653/V1/D18-1259)Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px2.p1.2),[§4\.1](https://arxiv.org/html/2606.02673#S4.SS1.p1.5)\.
- Y\. Yao, Z\. Li, and H\. Zhao \(2024\)GoT: effective graph\-of\-thought reasoning in language models\.InFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16\-21, 2024,K\. Duh, H\. Gómez\-Adorno, and S\. Bethard \(Eds\.\),Findings of ACL,pp\. 2901–2921\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-naacl.183),[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-NAACL.183)Cited by:[§1](https://arxiv.org/html/2606.02673#S1.p3.1),[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- G\. Zhang, M\. Fu, G\. Wan, M\. Yu, K\. Wang, and S\. Yan \(2025\)G\-memory: tracing hierarchical memory for multi\-agent systems\.CoRRabs/2506\.07398\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.07398),[Document](https://dx.doi.org/10.48550/ARXIV.2506.07398),2506\.07398Cited by:[§1](https://arxiv.org/html/2606.02673#S1.p1.1)\.
- X\. Zhao, W\. Pang, Z\. Xue, X\. Jian, L\. Zhang, Y\. Xu, X\. Song, S\. Wu, and T\. Yu \(2025\)The underappreciated power of vision models for graph structural understanding\.CoRRabs/2510\.24788\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.24788),[Document](https://dx.doi.org/10.48550/ARXIV.2510.24788),2510\.24788Cited by:[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.
- W\. Zhong, S\. Wang, D\. Tang, Z\. Xu, D\. Guo, J\. Wang, J\. Yin, M\. Zhou, and N\. Duan \(2021\)AR\-LSAT: investigating analytical reasoning of text\.CoRRabs/2104\.06598\.External Links:[Link](https://arxiv.org/abs/2104.06598),2104\.06598Cited by:[§A\.1](https://arxiv.org/html/2606.02673#A1.SS1.SSS0.Px4.p1.1)\.
- Y\. Zhu, X\. Bai, K\. Chen, Y\. Xiang, J\. Yu, and M\. Zhang \(2025\)Benchmarking and improving large vision\-language models for fundamental visual graph understanding and reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 30678–30701\.External Links:[Link](https://aclanthology.org/2025.acl-long.1482/)Cited by:[§2](https://arxiv.org/html/2606.02673#S2.p1.1)\.

## Appendix AAdditional Details

### A\.1Experimental Setup

##### Classic QA supervision pool\.

The main pipeline is built on the training splits of HotpotQA \(20,00020\{,\}000examples\), 2WikiMultiHopQA \(20,00020\{,\}000\), and MuSiQue \(19,93819\{,\}938\)\. The base student is first evaluated on these training questions with a direct\-answer prompt\. Teacher generation is then applied to the student\-failure pool\. After semantic validation, this yields18,78518\{,\}785usable teacher records, of which14,49014\{,\}490are teacher\-correct\. These teacher\-correct cases are the source of guided re\-evaluation and downstream internalization\. The number of evaluable examples can differ across guidance conditions because image guidance requires valid graph code and successful rendering, whereas text guidance only requires non\-empty textual guidance\. Consequently, the re\-evaluation accuracy in Table[1](https://arxiv.org/html/2606.02673#S4.T1)is computed with the condition\-specific evaluable denominator, not always with14,49014\{,\}490as the denominator\.

##### Held\-out QA test set\.

The held\-out QA test set contains3,0003\{,\}000examples in total\. It is constructed by sampling1,0001\{,\}000test questions from each of HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.02673#bib.bib18)\), 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.02673#bib.bib19)\), and MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.02673#bib.bib20)\)\. All main QA results in the paper use this fixed test set\.

##### Ablation diagnostic subset\.

The ablation study uses a separate3,0003\{,\}000\-example subset sampled from the training\-split teacher\-correct pool, with1,0001\{,\}000examples from each QA dataset\. This subset is not the held\-out QA test set\. It is used because the ablation experiment must regenerate teacher graph code under alternative structural constraints and then re\-evaluate the frozen student on the same kind of teacher\-guided failure cases as the main re\-evaluation experiment\. After filtering to examples with valid guidance, the abstract\-image ablation baseline contains2,8272\{,\}827examples\.

##### Reasoning benchmark construction\.

The out\-of\-domain reasoning benchmark is constructed separately from six reasoning datasets: CLUTRR\(Sinhaet al\.,[2019](https://arxiv.org/html/2606.02673#bib.bib21)\), LogiQA\(Liuet al\.,[2020](https://arxiv.org/html/2606.02673#bib.bib22)\), AR\-LSAT\(Zhonget al\.,[2021](https://arxiv.org/html/2606.02673#bib.bib23)\), AIW\-Easy\(Nezhurinaet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib24)\), AIW\-Hard\(Nezhurinaet al\.,[2024](https://arxiv.org/html/2606.02673#bib.bib24)\), and Bamboogle\(Presset al\.,[2023](https://arxiv.org/html/2606.02673#bib.bib25)\)\. Unlike the QA test set, this benchmark is not re\-sampled to a fixed per\-dataset size\. The resulting benchmark contains2,5442\{,\}544examples in total\. Its dataset composition is shown in Table[5](https://arxiv.org/html/2606.02673#A1.T5)\.

Table 5:Construction of the out\-of\-domain reasoning benchmark\.Three details matter for interpreting this benchmark\. First, the reasoning benchmark is used only for evaluation, not for the main guidance\-construction claim\. Second, CLUTRR, LogiQA, and AR\-LSAT use the context\-present prompt branch, whereas AIW\-Easy, AIW\-Hard, and Bamboogle are treated as no\-context datasets\. Third, LogiQA and AR\-LSAT additionally require the final answer to include both the option letter and the full option text\.

##### Prompting, rendering, and decoding\.

The base QA failure pool is created with a standard non\-chain\-of\-thought QA prompt and no external guidance\. Guided re\-evaluation, held\-out QA testing, and the downstream training data used in Self\-SFT and KL all use chain\-of\-thought QA prompts\. For image guidance, the teacher first generates Graphviz DOT graph code\. That graph code is then rendered at8×88\\times 8inches and 150 DPI, and the resulting image is resized to1024×10241024\\times 1024\. Inference uses temperature0\.20\.2, top\-p=0\.95p=0\.95\. The teacher\-generation API is called with temperature0\.10\.1\. The verifier uses Qwen3\-8B\(Qwen Team,[2025a](https://arxiv.org/html/2606.02673#bib.bib17)\)\.

##### Training setup\.

Train with teacher CoT uses all14,49014\{,\}490teacher\-correct QA cases\. It always targets the teacher’s original round\-1 chain\-of\-thought response, even when correctness was confirmed by the later semantic\-validation step\. The correct re\-evaluation responses used for downstream Self\-SFT and KL training are condition\-specific\. Table[6](https://arxiv.org/html/2606.02673#A1.T6)reports the corresponding denominators and correct\-response counts\.

Table 6:Condition\-specific guided re\-evaluation denominators and training\-response counts\. The number of correct guided responses is also the number of responses used by the corresponding Self\-SFT/KL condition\.
##### Optimization details\.

Both Self\-SFT and KL use LoRA rank3232, LoRA alpha6464, dropout0\.050\.05, learning rate10−510^\{\-5\}, weight decay0\.010\.01, warmup ratio0\.050\.05, batch size22, gradient accumulation3232, and maximum sequence length40964096\. The LoRA adapters are attached toq\_proj,k\_proj,v\_proj,o\_proj,gate\_proj,up\_proj, anddown\_proj\. KL distillation uses temperatureτ=2\.0\\tau=2\.0and the three\-epoch scheduleα=\{1\.0,0\.5,0\.0\}\\alpha=\\\{1\.0,0\.5,0\.0\\\}\.

### A\.2Prompt Templates

This section documents the prompt templates used in the experiments\. We show the templates with placeholders such as<question\>,<context\>, and<text\_guidance\>\. Unless noted otherwise, the boxes below show the context\-present branch used by the three QA datasets in the main experiments\. When a dataset has no usable context, the correspondingContext:block is omitted\. For multiple\-choice datasets such as LogiQA and AR\-LSAT, we append one extra instruction requiring both the option letter and the full option text in the final answer\.

#### A\.2\.1Base QA and teacher verification

The base QA failure pool is created with a direct answer prompt rather than a chain\-of\-thought prompt\. Teacher supervision is then built with a chain\-of\-thought teacher prompt and, when needed, a semantic validation turn\.

`Base student QA prompt used to create the failure pool Teacher Round\-1 CoT QA prompt Teacher semantic validation prompt`

`A\.2\.2 Guided re\-evaluation prompts The re\-evaluation stage uses chain\-of\-thought prompting for all guidance conditions\. The only difference across conditions is the guidance modality attached above or injected into the prompt\. Image\-guided CoT re\-evaluation prompt Text\-guided CoT re\-evaluation prompt Re\-evaluation answer verifier A\.2\.3 Teacher guidance construction The teacher first answers the question in a multi\-turn conversation\. Round 2 then rewrites that reasoning into one of four guidance artifacts\. For the image variants, the teacher outputs Graphviz DOT graph code, which is later rendered into an image\. The two image prompts differ only in content style, and the two text prompts mirror the same contrast\. Teacher image prompt: direct \(generate graph code\) Teacher image prompt: abstract \(generate graph code\) Teacher text prompt: direct Teacher text prompt: abstract A\.2\.4 Graph\-to\-text control The graph\-to\-text control is produced by converting the generated DOT graph code into linear text with a separate prompt\. This preserves node content more faithfully than the direct text\-guidance prompt, but it removes the visual and spatial structure of the rendered image\. Graph\-to\-text conversion prompt A\.2\.5 Ablation prompts The ablation study keeps the same multi\-turn teacher conversation and changes only the Round\-2 graph\-generation prompt\. The cleaner main\-text variants are the structure ablation, the 5\-node and 10\-node node\-budget ablations, and the combined 10\-node chain ablation\. The direct and abstract versions share the same structural constraints\. Their only difference is whether node content may include concrete hints or must remain abstract\. Ablation prompt: chain Ablation prompt: 5 nodes Ablation prompt: 10 nodes Ablation prompt: 10 nodes \+ chain`
Visual Graph Scaffolds for Structural Reasoning in Large Language Models

Similar Articles

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

Reasoning emerges from constrained inference manifolds in large language models

Enhanced and Efficient Reasoning in Large Learning Models

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

Thinking with Visual Grounding

Submit Feedback

Similar Articles

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
Reasoning emerges from constrained inference manifolds in large language models
Enhanced and Efficient Reasoning in Large Learning Models
ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Thinking with Visual Grounding