LLM-Guided Planning for Multi-hop Reasoning over Multimodal Nuclear Regulatory Documents
Summary
This paper frames regulatory document review as an LLM-guided planning problem, using a vectorless document tree with browse, read, and search tools and a dynamic knowledge graph as state. On a 200-question benchmark over NuScale FSAR documents, the system achieves 81.5% accuracy with 0.93 RAGAS Faithfulness, significantly outperforming existing RAG methods.
View Cached Full Text
Cached at: 06/30/26, 05:33 AM
# LLM-Guided Planning for Multi-hop Reasoning over Multimodal Nuclear Regulatory Documents
Source: [https://arxiv.org/html/2606.29399](https://arxiv.org/html/2606.29399)
###### Abstract
Reviewing nuclear regulatory documents requires multi\-hop reasoning across tens of thousands of pages, where judgments depend on evidence assembled across multiple chapters\. We frame this task as planning: an LLM\-based agent observes the evidence collected so far, picks the next document fragment to inspect, and stops when the evidence is sufficient\. The agent operates over a vectorless document tree using browse, read, and search tools, and maintains a dynamic knowledge graph \(KG\) as state\. On a 200\-question benchmark over NuScale Final Safety Analysis Report \(FSAR\) documents, the system reaches81\.5%accuracy with a RAGAS Faithfulness of0\.93\. The dominant performance factor is planning: against PageIndex, which uses the same document tree without state\-conditioned action selection, the gap is\+38\.0pp\(43\.5% to 81\.5%,p<0\.001p<0\.001\)\. The system also outperforms LightRAG \(73\.0%,p<0\.05p<0\.05\), HippoRAG \(70\.5%,p<0\.01p<0\.01\), and GraphRAG \(49\.5%,p<0\.001p<0\.001\), and matches RAPTOR \(75\.5%,p=0\.11p=0\.11\) without offline indexing\. Edge inference adds 2\.8×\\timescost without raising accuracy; we retain it as a traceability module\. Of 7,391 inferred edges, 3Violatesedges \(0\.04%\) flag scope boundaries \(Q058\) and partial conformance \(Q176\) as typed annotations that a human reviewer can audit\.
Retrieval\-Augmented Generation, Knowledge Graphs, Agentic AI, Nuclear Regulatory Documents, Multi\-hop Reasoning, Planning, Multimodal AI
## 1Introduction
LLM\-based agents have advanced rapidly in general\-purpose domains through paradigms such as ReAct\(Yao et al\.,[2023b](https://arxiv.org/html/2606.29399#bib.bib31)\), Toolformer\(Schick et al\.,[2023](https://arxiv.org/html/2606.29399#bib.bib26)\), and Self\-RAG\(Asai et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib2)\)\. Safety\-critical regulatory domains impose constraints these paradigms do not address\. Osprey\(Hellert et al\.,[2026](https://arxiv.org/html/2606.29399#bib.bib15)\)identifies the absence of action pre\-visibility in existing frameworks\. Lee\(Lee,[2025](https://arxiv.org/html/2606.29399#bib.bib19)\)analyzes the conflict between LLM opacity and the quality assurance traceability requirements of 10 CFR \(Code of Federal Regulations\) 50 Appendix B\. Nuclear regulatory document review, which examines Final Safety Analysis Reports \(FSARs\) and renders conformance judgments to enable plant licensing, is where these constraints bind most tightly\.
FSAR review has three properties that distinguish it from single\-shot question answering\. First, judgments depend on evidence accumulated from multiple chapters: answering “Does the Emergency Core Cooling System \(ECCS\) design satisfy 10 CFR 50\.46\(b\)?” requires specifications from Chapter 5 cross\-referenced against requirements in Chapter 1, with the verdict synthesized from both\. Second, evidence is multimodal: specification tables and engineering drawings co\-determine answers alongside the prose\. Third, the review process includes a sufficiency judgment in which the reviewer decides when collected evidence is enough to render a verdict\. Existing RAG \(Retrieval\-Augmented Generation\) methods address none of these\. Chunking severs cross\-references\(Gao et al\.,[2023](https://arxiv.org/html/2606.29399#bib.bib11)\)\. Graph\-based methods \(GraphRAG\(Edge et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib7)\), LightRAG\(Guo et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib12)\)\) build static global graphs and lack an iterative judgment loop\. All require offline indexing that is incompatible with continuously revised FSARs\.
Our key observation is that regulatory review is aplanning problem: given a goal \(regulatory judgment\), state \(evidence collected\), and actions \(document navigation\), the reviewer must decide what to examine next\. This is a sequential decision\-making process irreducible to single\-shot retrieval\. We construct the document as atext\-based environmentwith browse/read/search actions, enabling a closed planning loop, where 33% of queries terminate early and 67% use the full 4\-hop budget\.
We make four contributions\. \(1\)Document as environment for planning: a state\-conditioned planning loop over a vectorless document tree, isolated against PageIndex \(same environment, no planning\) by a\+38\.0\+38\.0pp accuracy gap \(p<0\.001p<0\.001\)\. \(2\)Multimodal evidence handling integrated into the planning loop: vision processing applied at the answer step yields\+18\+18pp over RAPTOR on table\-only questions while keeping all intermediate operations text\-only\. \(3\) A200\-question nuclear regulatory benchmarkalong three orthogonal axes \(reasoning type, evidence complexity, modality\) that jointly determine the regulatory review process\. \(4\)Traceability analysis viaViolatescase study: post\-retrieval edge inference adds 2\.8×\\timescost with no accuracy gain, but produces auditable reasoning paths\. Among 7,391 inferred edges, only 3 \(0\.04%\) areViolates, identifying scope exclusions \(Q058\) and partial conformance \(Q176\) as typed annotations that natural\-language answers cannot represent structurally\. This satisfies 10 CFR 50 Appendix B traceability requirements\.
## 2Related Work
#### RAG and graph\-based retrieval\.
Standard RAG\(Lewis et al\.,[2020](https://arxiv.org/html/2606.29399#bib.bib20)\)fragments context through chunking\. GraphRAG\(Edge et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib7)\)and LightRAG\(Guo et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib12)\)construct global knowledge graphs but require expensive pre\-indexing and generic edge schemas\. RAPTOR\(Sarthi et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib25)\)builds recursive abstractive trees; HippoRAG\(Gutiérrez et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib13)\)uses hippocampal memory models\. All perform single\-shot retrieval without iterative evidence accumulation\.
#### Agentic information retrieval\.
Iterative retrieval methods such as IRCoT, FLARE, and Iter\-RetGen interleave retrieval with chain\-of\-thought generation but lack persistent state and dynamic termination\. Self\-RAG\(Asai et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib2)\)inserts reflection tokens for adaptive retrieval within a single pass\. PRISM\(Nahid & Rafiei,[2025](https://arxiv.org/html/2606.29399#bib.bib22)\)separates precision and recall through an iterative Selector–Adder agent loop\. APEX\-Searcher\(Chen et al\.,[2026](https://arxiv.org/html/2606.29399#bib.bib4)\)combines reinforcement learning \(RL\) with supervised fine\-tuning \(SFT\) for planning but requires training\. For document navigation, ReadAgent\(Lee et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib18)\)uses gist memories, DocAgent\(Sun et al\.,[2025](https://arxiv.org/html/2606.29399#bib.bib27)\)extracts XML outlines, and BookRAG\(Wang et al\.,[2025](https://arxiv.org/html/2606.29399#bib.bib28)\)routes queries through hierarchical indices\. PageIndex\(Zhang & Tang,[2025](https://arxiv.org/html/2606.29399#bib.bib32)\)builds vectorless trees but operates as a single\-pass tool\. Our work departs by constructing the document as ascalable text\-based environmentwith persistent KG state, a planning loop, and dynamic termination, in atraining\-freesetting\.
#### Planning, world models, and KG\-RAG\.
ReAct\(Yao et al\.,[2023b](https://arxiv.org/html/2606.29399#bib.bib31)\), Tree of Thoughts\(Yao et al\.,[2023a](https://arxiv.org/html/2606.29399#bib.bib30)\), and LATS\(Zhou et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib35)\)establish LLM\-based planning in PDDL/robotic/web environments through tree search with reflection; we extend this paradigm toinformation environmentsover structured documents\. GWM\(Feng et al\.,[2025](https://arxiv.org/html/2606.29399#bib.bib9)\)uses graph\-structured state with message\-passing; we similarly employ a Dynamic Sub\-KG but generate explicit relational edges through LLM inference rather than embeddings\. Our edge ontology draws on SysML traceability\(Friedenthal et al\.,[2014](https://arxiv.org/html/2606.29399#bib.bib10)\), argumentation mining\(Peldszus & Stede,[2013](https://arxiv.org/html/2606.29399#bib.bib24); Cabrio & Villata,[2012](https://arxiv.org/html/2606.29399#bib.bib3)\), causal KG\(Hassanzadeh et al\.,[2019](https://arxiv.org/html/2606.29399#bib.bib14)\), and prerequisite learning\(Pan et al\.,[2017](https://arxiv.org/html/2606.29399#bib.bib23)\)\.
#### Nuclear NLP and evaluation\.
NuclearQA\(Acharya et al\.,[2023](https://arxiv.org/html/2606.29399#bib.bib1)\)and NukeBERT\(Jain et al\.,[2020](https://arxiv.org/html/2606.29399#bib.bib16)\)address single\-hop factual extraction\. Our benchmark is the first to target multi\-hop, multimodal, cross\-chapter regulatory judgment\. We adopt dual evaluation: RAGAS\(Es et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib8)\)for grounding quality and LLM\-as\-Judge\(Zheng et al\.,[2023](https://arxiv.org/html/2606.29399#bib.bib33)\)with three\-evaluator majority vote\.
## 3Method
Figure 1:Overall architecture\. The vectorless document tree \(left\) serves as the environment\. The planning loop \(center\) iterates through state estimation, action planning, execution, and sufficiency checking\. Post\-retrieval edge inference and vision\-augmented answer generation \(right\) are applied at the output stage\.### 3\.1Problem Formulation: Regulatory Review as Planning
We formulate regulatory document exploration as a planning problem⟨𝒮,𝒜,ftr,ϕ⟩\\langle\\mathcal\{S\},\\mathcal\{A\},f\_\{\\mathrm\{tr\}\},\\phi\\ranglewith a single agent acting over a structured information environment:
- •Statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}: the Dynamic Sub\-Knowledge Graph𝒢t=\(𝒱t,ℰt\)\\mathcal\{G\}\_\{t\}=\(\\mathcal\{V\}\_\{t\},\\mathcal\{E\}\_\{t\}\)collected through hoptt, representing the agent’s current evidence and inferred relationships \(§[3\.3](https://arxiv.org/html/2606.29399#S3.SS3)\)\.
- •Actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}: a tool invocation from\{browse\(d,v\),read\(d,v\),search\(κ\)\}\\\{\\texttt\{browse\}\(d,v\),\\,\\texttt\{read\}\(d,v\),\\,\\texttt\{search\}\(\\kappa\)\\\}over documentsdd, tree nodesvv, and keywordsκ\\kappa\(§[3\.4](https://arxiv.org/html/2606.29399#S3.SS4)\)\.
- •Transitionftr\(st,at\)→st\+1f\_\{\\mathrm\{tr\}\}\(s\_\{t\},a\_\{t\}\)\\to s\_\{t\+1\}: tool execution followed by node integration and \(optionally\) edge inference, producing the updated KG \(§[3\.5](https://arxiv.org/html/2606.29399#S3.SS5)\)\.
- •Goal testϕ\(st,q\)∈\{0,1\}\\phi\(s\_\{t\},q\)\\in\\\{0,1\\\}: an LLM\-judged sufficiency check on whether𝒢t\\mathcal\{G\}\_\{t\}contains enough evidence to answer queryqq; termination occurs whenϕ=1\\phi=1ort=Tmaxt=T\_\{\\max\}\.
Unlike plan\-then\-execute frameworks, the agent observessts\_\{t\}, selectsata\_\{t\}conditioned on accumulated state, and immediately incorporates environment feedback before selectingat\+1a\_\{t\+1\}\. This state\-conditioned planning structure is the central mechanism we isolate empirically against PageIndex \(§[5\.2](https://arxiv.org/html/2606.29399#S5.SS2)\)\.
### 3\.2Environment: Vector\-Free Multimodal Document Tree
The overall architecture is shown in Figure[1](https://arxiv.org/html/2606.29399#S3.F1)\. The planning loop described in this section \(state estimation, action selection, dynamic termination\) is architecturally domain\-agnostic and applies to any hierarchically structured document corpus\. Regulatory documents such as FSARs present the conditions under which this approach is most advantageous over conventional RAG: deep hierarchical structure that chunking destroys, dense cross\-references between sections and figures, multimodal evidence \(specification tables, engineering drawings\) that co\-determines answers, and a review process that inherently requires multi\-hop evidence gathering with sufficiency judgment\. The domain\-specific component is the edge ontology \(Section[3\.5](https://arxiv.org/html/2606.29399#S3.SS5)\), which encodes regulatory reasoning relations \(Satisfies,Violates\); the rest of the architecture transfers directly to other structured document domains\.
The environment is represented as a JSON hierarchical tree organized into chapter→\\tosection→\\toparagraph nodes, preserving the native structure of regulatory documents without any chunking or embedding\. To support multimodal reasoning, the system parses the LIST OF FIGURES/TABLES and detects in\-text references such as “Figure 5\.1\-1,” attaching figure and table metadata to the corresponding nodes via areferencesfield\. This directly addresses the “figure on different page” problem, wherein the referencing text and the actual diagram reside on different pages of the PDF\.
Rather than relying on dense vector retrieval, the system adopts a vector\-free design using BM25Okapi keyword search over the full document tree\. Section titles receive a3×3\\timesweight boost, and document\-length normalization naturally promotes short, focused leaf nodes to higher rankings\. At the scale evaluated in this work, the tree spans Ch\.01 with 866 nodes \(34 figures, 19 tables\) and Ch\.05 with 26 nodes \(29 figures, 30 tables\)\.
Figure 2:Document tree environment \(left\), three agent tools \(right\), and multimodal reference resolution linking in\-text references to actual PDF pages \(bottom\)\.
### 3\.3State \(Short\-Term Memory\): Dynamic Sub\-KG and Two\-Tier Edge Ontology
The agent state at timestepttis defined as a dynamic knowledge graph𝒢t=\(𝒱t,ℰt\)\\mathcal\{G\}\_\{t\}=\(\\mathcal\{V\}\_\{t\},\\mathcal\{E\}\_\{t\}\)\. The node set𝒱t\\mathcal\{V\}\_\{t\}comprises document sections \(evidence nodes\) collected through exploration along with their associated multimodal references\. The edge setℰt\\mathcal\{E\}\_\{t\}is governed by a domain\-specific two\-tier ontology, with edges retained only when confidence is≥0\.4\\geq 0\.4\(empirically set\)\.
The ontology is summarized in Table[1](https://arxiv.org/html/2606.29399#S3.T1)\. Tier 1 consists of*structural*edges that organize the exploration trajectory:
Table 1:Two\-tier regulatory edge ontology\.An example KG with edge distribution is shown in Figure[3](https://arxiv.org/html/2606.29399#S3.F3)\. Empirically, structural edges \(References,Specifies\) dominate in single\-hop factual queries by forming the exploration path, while semantic edges \(Satisfies,Supports\) emerge in composite multi\-hop judgment queries to support regulatory compliance synthesis\. In correct answers relative to incorrect ones,Supportsappears\+6\.8\+6\.8percentage points more frequently andSatisfies\+3\.2\+3\.2percentage points more frequently\.
Figure 3:An example Dynamic Sub\-Knowledge Graph showing five evidence nodes connected by structural \(Tier 1\) and semantic \(Tier 2\) edges\. The edge distribution summary \(right\) shows the prevalence of each edge type across 7,391 edges from 200 questions\.
### 3\.4Action Planning: LLM\-Based Tool Selection
Rather than precomputing a full retrieval plan, the system performs closed\-loop planning\. At each hop, the agent observes the current KG state𝒢t\\mathcal\{G\}\_\{t\}and decides actionata\_\{t\}, with environment feedback \(retrieved results\) immediately incorporated into the subsequent plan\. This constitutes a state\-based iterative decision\-making structure, distinct from the plan\-then\-execute separation of APEX\-Searcher\(Chen et al\.,[2026](https://arxiv.org/html/2606.29399#bib.bib4)\)and the token\-level reactive retrieval of Self\-RAG\(Asai et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib2)\)\. Unlike passive embedding\-similarity retrieval in conventional RAG, the LLM actively evaluates the current state and plans the next action\.
The agent has access to three tools \(Figure[2](https://arxiv.org/html/2606.29399#S3.F2)\) that mirror filesystem operations:browselists the child nodes of a tree node \(analogous tols\),readextracts the full content of a specific node \(analogous tocat\), andsearchperforms BM25\-ranked keyword search across all documents \(analogous togrep\)\.
A*browse\-first*pattern is enforced at Hop 1, where the document structure \(table of contents, ToC\) is automatically injected so that the agent obtains a global map before searching\. This intervention improved single\-evidence Context Recall from 0\.45 to 0\.89\. To address vocabulary mismatch, Pseudo\-Relevance Feedback \(PRF, RM3\) automatically expands queries using the top\-3 retrieved results at zero additional LLM cost\. The agent also maintains a search history to prevent duplicate keyword queries across hops\.
Dynamic terminationis implemented as a plan sufficiency check beginning at Hop 2: before each hop, the LLM judges whether the current KG already contains sufficient evidence to answer the query\. If so, the agent terminates early\. This functions as a goal test within the planning loop, automatically calibrating exploration depth to query complexity\. Across 200 questions, 33% of queries terminate early at 1–3 hops \(mean 3\.4, maximum 4\)\.
### 3\.5Post\-Retrieval Edge Inference \(Optional Component\)
Edge inference makes explicit the relationships among collected evidence nodes and is performed concurrently with the state transitionftr\(𝒢t,at\)→𝒢t\+1f\_\{\\mathrm\{tr\}\}\(\\mathcal\{G\}\_\{t\},a\_\{t\}\)\\to\\mathcal\{G\}\_\{t\+1\}\. The accuracy impact of this component is evaluated in Section[6\.2](https://arxiv.org/html/2606.29399#S6.SS2); we offer it as an optional module for use cases requiring traceability\.
Inference proceeds in two stages\. In Stage 1 \(Description\), the LLM produces a single natural\-language sentence describing the relationship between two nodes, without imposing any classification pressure\. This follows the free\-form relation extraction approach of LightRAG\(Guo et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib12)\); for example: “The ECCS design of 3 RVV \+ 2 RRV is configured to meet the acceptance criteria of 10 CFR 50\.46\.” In Stage 2 \(Ontology Mapping\), the free\-form description is mapped onto the regulatory domain ontology \(Satisfies,Violates, etc\.\)\. When no mapping is applicable, the relationship is preserved asSemantic, ensuring no relational information is discarded\.
Verification is integrated into the evidence\-gathering loop at every hop rather than applied as a post\-processing step: immediately after new evidence is retrieved \(planning\), relationship inference is performed \(verification\), and the resulting enriched KG state informs the plan sufficiency judgment for the subsequent hop\. This design differs from embedding\-based implicit relations used in GraphRAG and GWM\(Feng et al\.,[2025](https://arxiv.org/html/2606.29399#bib.bib9)\); by grounding relationships in explicit LLM\-generated natural\-language descriptions, the inference results remain human\-inspectable\.
### 3\.6Vision\-Augmented Final Answer Generation
Multimodal processing is applied exclusively at the final answer generation step, with all intermediate operations \(search, plan, infer\) remaining text\-only\. This cost\-efficient design avoids the expense of vision API calls during iterative exploration while still enabling visually grounded final answers\.
The implementation proceeds in three steps: \(1\) all Figure/Table references across KG nodes are collected; \(2\) the corresponding PDF pages are rendered to JPEG using PyMuPDF; and \(3\) the full text KG context together with the rendered images is passed to the GPT\-4\.1 vision\-language model \(VLM\) API\. For tables specifically, PyMuPDF’sfind\_tables\(\)function extracts row and column structure directly as structured text, making VLM image processing unnecessary\. This approach achieves 86\.0% accuracy on table\-only questions, compared to 68\.0% for RAPTOR \(\+18\+18percentage points\)\.
The complete pipeline is summarized in Algorithm[1](https://arxiv.org/html/2606.29399#alg1)\.
Algorithm 1SubKG\-AgentPlanning PipelineInput:Query
qq, documents
𝒟\\mathcal\{D\}, max hops
TT
Initialize
𝒢0←∅\\mathcal\{G\}\_\{0\}\\leftarrow\\emptyset; inject ToC at hop 0
for
t=0t=0to
T−1T\-1do
at←LLM\(q,𝒢t,ToC\)a\_\{t\}\\leftarrow\\mathrm\{LLM\}\(q,\\mathcal\{G\}\_\{t\},\\mathrm\{ToC\}\)\{plan actions\}
Execute
ata\_\{t\}viabrowse/read/search
𝒢t\+1←ftr\(𝒢t,at\)\\mathcal\{G\}\_\{t\+1\}\\leftarrow f\_\{\\mathrm\{tr\}\}\(\\mathcal\{G\}\_\{t\},a\_\{t\}\)\{integrate \+ infer edges \(optional\)\}
if
t≥1t\\geq 1andSufficientEvidence
\(q,𝒢t\+1\)\(q,\\mathcal\{G\}\_\{t\+1\}\)then
break
endif
endfor
returnVisionAugmentedAnswer
\(q,𝒢final\)\(q,\\mathcal\{G\}\_\{\\mathrm\{final\}\}\)
## 4Benchmark: Nuclear Regulatory Multi\-hop QA
No existing benchmark combines nuclear regulatory domain, multi\-hop reasoning, multimodal evidence \(tables and engineering drawings\), and regulatory judgment\. We survey the most relevant benchmarks along these dimensions in Table[2](https://arxiv.org/html/2606.29399#S4.T2)\. FDARxBench\(Xiong et al\.,[2026](https://arxiv.org/html/2606.29399#bib.bib29)\)is the most similar \(regulatory documents with judgment\) but covers only single documents without engineering drawings\. MMLongBench\-Doc\(Ma et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib21)\)and M3DocRAG\(Cho et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib5)\)support multimodal understanding but lack regulatory judgment\. DesignQA\(Doris et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib6)\)addresses engineering compliance but not multi\-hop reasoning\. NuclearQA\(Acharya et al\.,[2023](https://arxiv.org/html/2606.29399#bib.bib1)\)targets single\-hop factual extraction only\.
Table 2:Existing document QA benchmarks\. NuScale\-MQA is the first to combine all five properties\.### 4\.1Design and Composition
We construct 200 questions over NuScale FSAR Chapter 01 \(352 pages\) and Chapter 05 \(160 pages\), organized along three orthogonal axes\.Reasoning type: factual \(70\), comparative \(65\), judgment \(65\)\.Evidence complexity: single\-evidence \(50\), multi\-evidence \(75\), cross\-document \(75\)\.Modality: text\-only \(80\), table\-only \(50\), image\-only \(30\), composite \(40\)\. The benchmark contains 357 ground\-truth evidence items \(152 text, 125 table, 80 figure\)\. This distribution is summarized in Table[3](https://arxiv.org/html/2606.29399#S4.T3); the judgment×\\timescross\-document cell \(35 questions\) is the largest, reflecting the core regulatory review task\.
Table 3:Question distribution in NuScale\-MQA \(200 total\)\.
### 4\.2Dual Evaluation Framework
We adopt dual evaluation:RAGAS\(Es et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib8)\)for grounding quality \(Faithfulness, Answer Relevancy, Context Recall, Factual Correctness\) andLLM\-as\-Judge\(Zheng et al\.,[2023](https://arxiv.org/html/2606.29399#bib.bib33)\)with three independent evaluators, Eval\-A \(GPT\-4\-turbo\), Eval\-B \(GPT\-4o\), and Eval\-C \(Claude Sonnet 4\.5\), combined via majority vote\. The RAGAS–Judge agreement rate is 66\.2%, with the 34% disagreement providing complementary information: RAGAS measures grounding fidelity while the Judge captures practical correctness\.
## 5Experiments
### 5\.1Setup and Baselines
All methods use GPT\-4\.1 for generation with temperature 0 and max\_tokens 300\. Our agent is configured with max\_hops==4 and top\_k==2\. We compare against RAPTOR\(Sarthi et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib25)\)\(recursive summarization tree\), HippoRAG\(Gutiérrez et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib13)\)\(hippocampal associative KG\), LightRAG\(Guo et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib12)\)\(dual\-level graph \+ vector DB\), GraphRAG\(Edge et al\.,[2024](https://arxiv.org/html/2606.29399#bib.bib7)\)\(community\-based local search\), and PageIndex\(Zhang & Tang,[2025](https://arxiv.org/html/2606.29399#bib.bib32)\)as an ablation baseline\. Full baseline configurations are in Appendix[A](https://arxiv.org/html/2606.29399#A1)\.
PageIndex is the critical comparison: it operates over theidenticaldocument tree with the same tools, browse\-first ToC injection, BM25 configuration, PRF/RM3 query expansion, search history deduplication, and dynamic termination\. The only difference is action selection: our system conditions on accumulated KG state; PageIndex selects actions from the query and immediate results alone\. The gap therefore isolates state\-conditioned planning\.
### 5\.2Main Results
We report LLM\-as\-Judge accuracy in Table[4](https://arxiv.org/html/2606.29399#S5.T4)\. Our system \(planning only\) achieves81\.5%overall\. The most important comparison is PageIndex at 43\.5%, isolating a\+38\.0\+38\.0pp planning contribution \(McNemarp<0\.001p<0\.001\)\. Among external baselines, we significantly outperform HippoRAG \(p<0\.01p<0\.01\), LightRAG \(p<0\.05p<0\.05\), and GraphRAG \(p<0\.001p<0\.001\)\. The\+6\.0\+6\.0pp gap over RAPTOR is consistent across all categories but does not reach significance \(McNemarp=0\.11p=0\.11,n=200n=200\); however, our system requires zero pre\-indexing cost versus RAPTOR’s 44 minutes and $1\.4 of offline indexing\.
Judgment caveat\.98% of judgment questions have “Yes” as the correct answer \(FSARs document compliant designs by definition\), so the judgment column has limited discriminative power\. Comparative and factual categories provide more meaningful comparisons\.
Table 4:LLM\-as\-Judge accuracy on NuScale\-MQA \(200 questions\)\. PageIndex table\-only and composite results were not collected in the same evaluation run\.
### 5\.3RAGAS and Efficiency
We report RAGAS metrics in Table[5](https://arxiv.org/html/2606.29399#S5.T5)\. Our system achieves Faithfulness of0\.93and Context Recall of0\.93, ranking first across all metrics\. GraphRAG scores 0\.28/0\.18, indicating factual loss during community summarization\. PageIndex achieves only 0\.58 Faithfulness, illustrating the limits of unguided retrieval\. The per\-reasoning\-type breakdown \(Table[6](https://arxiv.org/html/2606.29399#S5.T6)\) confirms this dominance across all three categories: our system leads both Faithfulness and Context Recall on factual, comparative, and judgment questions\. GraphRAG’s Context Recall of 0\.11 on factual questions confirms that specific numerical values are systematically lost during community summarization\.
Table 5:RAGAS comparison across all methods\.Table 6:RAGAS by reasoning type across all methods\. Our system leads Faithfulness and Context Recall on all three reasoning types\.Efficiency\.We report indexing and per\-query costs across methods in Table[7](https://arxiv.org/html/2606.29399#S5.T7)\. Our system requires 8–20 min and $4\.1 for pre\-indexing, but higher per\-query cost \($0\.215, 93s\) due to multi\-hop exploration\. Total cost for 200 questions is∼\\sim$46 vs\.∼\\sim$2\.3 for RAPTOR, yielding a per\-accuracy\-point cost of $0\.56/%p\. Dynamic termination partially mitigates this: 33% of queries terminate early, with costs ranging from $0\.03 \(1\-hop factual\) to $0\.29 \(4\-hop cross\-document judgment\)\. For regulatory applications, where human reviewers spend hours per question, the $46 total for 200 questions is negligible compared to manual review cost\.
Table 7:Efficiency comparison across methods\. Per\-query cost extrapolated from 5\-question sample with±\\pm10–15% variance\.
## 6Analysis
### 6\.1Component Ablation \(10Q\)
To isolate each component’s contribution, we remove one at a time from the full system on a 10\-question subset spanning diverse reasoning types \(Table[8](https://arxiv.org/html/2606.29399#S6.T8)\)\. Only the full system achieves 10/10, and each removal fails on a distinct question type: Vision RAG removal drops Q101 \(table/comparative\) and Q131 \(composite/comparative\), with Faithfulness falling from 0\.96 to 0\.83; Edge inference removal drops Q058 \(seismic scope boundary\) due to missingViolatesedge but saves 66% of cost; Browse\-first removal drops Q191 \(image/judgment/cross\) due to lost structural orientation\. Atn=10n=10, differences are indicative rather than statistically conclusive, motivating the 200Q scale\-up for the edge inference question\.
Table 8:10\-question component ablation\. Each removal fails on a distinct question type\.
### 6\.2Ablation: Edge Inference at Scale \(200Q\)
Since edge inference accounted for 65% of per\-query cost in the 10Q ablation, we scale this comparison to all 200 questions \(Table[9](https://arxiv.org/html/2606.29399#S6.T9)\)\.
Table 9:Edge inference ablation \(200 questions\)\.Key finding: Planning is the primary driver\.Removing edge inference does not reduce accuracy; it provides traceability \(human\-readable paths like “Section ASatisfiesRegulation B”\) but no accuracy gain\. Planning mechanisms alone, namely browse\-first \(CR 0\.45→\\to0\.89\), dynamic termination \(33% early\), and state\-conditioned selection \(PageIndex\+38\.0\+38\.0pp\), suffice to outperform all baselines\. This parallels APEX\-Searcher\(Chen et al\.,[2026](https://arxiv.org/html/2606.29399#bib.bib4)\)and PRISM\(Nahid & Rafiei,[2025](https://arxiv.org/html/2606.29399#bib.bib22)\)in scope but diverges in method: both require RL/SFT training, while our system achieves equivalent planning in atraining\-freesetting\.
### 6\.3Edge Distribution and Case Studies
Across 200 questions, 7,391 edges are generated:Supports\(34\.3%\),Specifies\(31\.5%\),References\(13\.1%\),Is\_Prerequisite\_Of\(9\.5%\),Satisfies\(8\.4%\), withViolatesat only 0\.04% \(3 instances\)\. Semantic edges correlate with correctness:Supports\+6\.8\+6\.8pp andSatisfies\+3\.2\+3\.2pp more frequent in correct answers\.
WhyViolatesappears in a certified document\.The FSAR has already been certified by the U\.S\. Nuclear Regulatory Commission \(NRC\), so the emergence ofViolatesedges warrants explanation\. All three instances capturescope boundary exclusionsrather than design deficiencies\. This reveals a regulatory reasoning capability absent from standard RAG:
- •Q058 \(Seismic scope,×\\times2\):BothViolatesedges \(confidence 0\.85, 0\.90\) identify that non\-safety\-related systems \(Chilled Water, Condensate Storage\) are intentionally placed outside Seismic Category I classification\. The FSAR explicitly justifies this: “failure of non\-safety systems, structures, and components \(SSCs\) does not affect safety\-related SSCs\.” The agent marks these scope boundaries in the KG, so that a human reviewer can immediately see which requirements do not apply\. Without this marking, such information would require manual cross\-referencing across chapters\. The distinction matters for regulatory review: “requirement does not apply” and “requirement is not satisfied” carry different licensing consequences, and a representation that does not separate them structurally shifts the burden of distinguishing onto the human reviewer\.
- •Q176 \(Partial conformance,×\\times1\):TheViolatesedge \(confidence 0\.85\) captures that NuScale’s integrated Steam Generator \(SG\) design eliminates the traditional containment bypass problem but introduces a leakage detection limitation: the system cannot distinguish identified from unidentified leakage, resulting in partial conformance with Design\-Specific Review Standard \(DSRS\) 15\.6\.5\. This is the space between fullSatisfiesand fullViolates, the kind of state where regulatory review typically requires additional engineering analysis or compensatory measures\. The agent represents the state as a typed edge with confidence 0\.85; the baselines we evaluate in §[5\.2](https://arxiv.org/html/2606.29399#S5.SS2)produce no equivalent typed annotation\.
These cases show that while edge inference does not improve accuracy \(Table[9](https://arxiv.org/html/2606.29399#S6.T9)\), it provides concrete regulatory value: \(1\) explicit scope boundary identification, marking where requirements apply and do not apply; \(2\) partial conformance that mirrors actual regulatory judgment; and \(3\) auditable reasoning paths satisfying the traceability requirements of 10 CFR 50 Appendix B\. The rarity ofViolates\(3 of 7,391 edges, 0\.04%\) itself serves as a quality signal consistent with a certified document\. For practical use, these three typed edges identify the locations a human reviewer can focus on without re\-reading every natural\-language answer\.
The 66\.2% RAGAS–Judge agreement leaves 67 disagreement cases that decompose into two qualitatively distinct categories\.29 cases \(RAGAS Good \+ Judge X\): answers grounded in the KG \(Faith≥0\.8\\geq 0\.8, CR≥0\.8\\geq 0\.8\) but penalized for wording differences from the reference\. Q019 is representative: reference “the pressurizer volume is 578 ft3,” agent “the pressurizer region volume is 578 ft3and the cylindrical section is 487 ft3,” RAGAS Faith 1\.00 / CR 1\.00, Judge X\. These reflect benchmark\-level evaluation strictness, not regulatory error\.38 cases \(RAGAS Bad \+ Judge O\): answers Judge marks correct despite low Context Recall, meaning the agent answered from knowledge not fully captured by the retrieved KG\. This is the deployment\-relevant category:38/200=19%38/200=19\\%of correct answers cannot be fully verified by inspecting the KG alone, and a reviewer needing audit\-grade reasoning should flag this fraction for separate verification\. The two categories motivate dual evaluation: RAGAS measures grounding, Judge measures correctness, and either alone misclassifies cases that matter for safety review\.
### 6\.4Failure Mode Taxonomy
We classify the 38 incorrect answers \(Judge X\) by RAGAS scores, using the same Faith≥0\.8\\geq 0\.8and CR≥0\.8\\geq 0\.8thresholds as §[5\.3](https://arxiv.org/html/2606.29399#S5.SS3)\.
Expression / reasoning\(29/38, 76\.3%\): well\-grounded by both metrics, Judge X\. The dominant pattern is wording mismatch with the reference, as in Q019 above\. The penalty is for added detail, not regulatory error\.
Hallucination\(2/38, 5\.3%\): Faith<<0\.5\. The answer asserts content the KG does not support\.
Retrieval miss\(1/38, 2\.6%\): CR<<0\.5\. Retrieval missed evidence the reference relies on\.
Mixed\(6/38, 15\.8%\): intermediate Faith and CR\.
Excluding the wording\-mismatch category, the deployment\-relevant failure rate is9/200=4\.5%9/200=4\.5\\%\. Hallucination at scale is2/200=1\.0%2/200=1\.0\\%\. These are the rates a reviewer should expect to verify against external evidence\.
### 6\.5Self\-Assessment Calibration
Dynamic termination \(§[3\.4](https://arxiv.org/html/2606.29399#S3.SS4)\) lets the agent decide when its evidence is sufficient\. Whether the early\-stopping signal is reliable determines whether early termination introduces a deployment risk\. We stratify accuracy by hop count: early\-terminated queries \(1–3 hops\) are the high\-confidence subset; full\-budget queries \(4 hops\) are the low\-confidence subset\.
Table 10:Accuracy by hop count\.BucketHopsnnAccuracyEarly111100\.0%Early21384\.6%Early34582\.2%Full412978\.3%Early total1–36985\.5%Full total412978\.3%Early\-terminated queries reach 85\.5% accuracy; full\-budget queries reach 78\.3%\. Higher accuracy on the high\-confidence subset is the calibration signature: the early\-stopping signal is reliable\. Question difficulty confounds the comparison \(simple factual queries terminate at hop 1, cross\-document judgment queries use 4 hops\), but the ordering rules out the failure mode in which the agent quits too early on hard cases\. That mode would invert the ordering\.
## 7Limitations
#### System\.
Our system underperforms RAPTOR on text\-only questions \(76\.2% vs\. 80\.0%,−3\.8\-3\.8pp\); adding summary nodes to the tree is a potential improvement\. Per\-query cost \($0\.215, 93 s\) is 42×\\timesRAPTOR’s cost but justified for regulatory use where a human reviewer requires hours per question; the system’s 94\.3% accuracy on the 35 judgment×\\timescross\-document questions, the core regulatory review task, justifies the cost premium where it matters most\. The RAPTOR gap \(\+6\.0\+6\.0pp\) cannot be asserted as a reliable improvement at the current sample size \(McNemarp=0\.11p=0\.11\)\. A follow\-reference tool for directly navigating “see Table 5\.1\-1”\-style pointers remains unimplemented\. The max\_hops==4 ceiling is used by 67% of queries; whether increasing the budget would improve accuracy on the most complex questions remains an open question\.
#### Benchmark\.
The benchmark exhibits five structural limitations identified during evaluation: \(1\)*Factual Correctness ceiling*\(∼\\sim0\.42 across all methods\) reflects single\-perspective expected answers rather than retrieval failure; \(2\)*judgment polarity bias*: 98% “Yes” answers, as FSARs document by definition compliant designs; \(3\)*limited evidence depth*: 66% of questions are effectively 2\-hop; \(4\)*document coverage imbalance*: Ch\.01 uses only 19% of its pages; \(5\)*no external validation*: the benchmark is self\-designed, with mitigating factors including the three\-axis orthogonal design, 3\-evaluator majority voting, and uniform comparison across 5 methods under identical conditions\.
#### Edge ontology coverage\.
Our eight\-relation ontology is chosen for FSAR conformance review and does not cover relations relevant to other regulatory frameworks, such as cross\-jurisdictional applicability, temporal validity, conditional applicability, and engineering exception precedents\. The threeViolatesinstances we report come from a single FSAR corpus; whether similar typed\-edge patterns appear in other regulatory domains is unknown\. Deploying the system to a different framework requires domain\-expert ontology curation and human oversight\.
## 8Conclusion
This paper frames multi\-hop regulatory document review as a planning problem and instantiates it as an LLM\-based agent that operates over a vectorless document tree with a dynamic knowledge graph as state\. On a 200\-question NuScale FSAR benchmark, the system reaches 81\.5% accuracy with RAGAS Faithfulness 0\.93, outperforming GraphRAG, HippoRAG, and LightRAG, and matching RAPTOR while eliminating offline indexing cost\. The\+38\.0\+38\.0pp gap over PageIndex isolates state\-conditioned planning as the primary accuracy driver\. A 200\-question ablation finds edge inference contributes no accuracy gain at 2\.8×\\timescost, but produces typed regulatory relations: 3Violatesedges among 7,391 \(0\.04%\) make scope\-bounded inapplicability and partial conformance explicit, the audit form that 10 CFR 50 Appendix B requires\. Failure mode analysis shows 29 of 38 errors are wording mismatch rather than regulatory error, leaving a deployment\-relevant failure rate of 4\.5%; hop\-stratified accuracy \(85\.5% early vs\. 78\.3% full\) confirms that the early\-stopping mechanism is well\-calibrated\. The architecture \(document\-as\-environment, action interface, KG state, dynamic termination\) transfers to any domain whose review process requires multi\-hop evidence gathering and explicit conformance judgment, with the edge ontology as the domain\-specific component\.
## Impact Statement
The primary goal of this work is to improve the efficiency and accuracy of agent\-based machine learning systems\. We anticipate no immediate negative societal impacts arising uniquely from our contributions beyond those generally associated with the deployment of large language model based agents\. On the contrary, by reducing the computational cost of multi\-hop reasoning over large document collections, our approach can help make sophisticated agentic workflows more accessible and environmentally efficient in settings where compute is a limiting factor\. In safety\-critical domains such as nuclear regulatory review, we emphasize that the system is intended to assist, not replace, expert human judgment, and that its outputs should remain subject to domain\-expert oversight\.
## Acknowledgements
This work was supported by the Substantiation Support Program, through the Korea Innovation Foundation funded by the Ministry of Science and ICT \(No\. 76170\-26\)\.
## References
- Acharya et al\. \(2023\)Acharya, A\., Munikoti, S\., Hellinger, A\., Smith, S\., Wagle, S\., and Horawalavithana, S\.NuclearQA: A human\-made benchmark for language models for the nuclear domain\.*arXiv preprint arXiv:2310\.10920*, 2023\.
- Asai et al\. \(2024\)Asai, A\., Wu, Z\., Wang, Y\., Sil, A\., and Hajishirzi, H\.Self\-RAG: Learning to retrieve, generate, and critique through self\-reflection\.In*Proceedings of the 12th International Conference on Learning Representations \(ICLR\)*, 2024\.Oral\.
- Cabrio & Villata \(2012\)Cabrio, E\. and Villata, S\.Combining textual entailment and argumentation theory for supporting online debates interactions\.In*Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pp\. 208–212, 2012\.
- Chen et al\. \(2026\)Chen, K\., Kong, Q\., Zhao, F\., and Mao, W\.APEX\-Searcher: Augmenting LLMs’ search capabilities through agentic planning and execution\.*arXiv preprint arXiv:2603\.13853*, 2026\.
- Cho et al\. \(2024\)Cho, J\. et al\.M3DocRAG: Multi\-modal retrieval is what you need for multi\-page multi\-document understanding\.*arXiv preprint arXiv:2411\.04952*, 2024\.
- Doris et al\. \(2024\)Doris, A\. C\. et al\.DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation\.*arXiv preprint arXiv:2404\.07917*, 2024\.
- Edge et al\. \(2024\)Edge, D\., Trinh, H\., Cheng, N\., Bradley, J\., Chao, A\., Mody, A\., Truitt, S\., and Larson, J\.From local to global: A Graph RAG approach to query\-focused summarization\.*arXiv preprint arXiv:2404\.16130*, 2024\.
- Es et al\. \(2024\)Es, S\., James, J\., Anke, L\. E\., and Schockaert, S\.RAGAs: Automated evaluation of retrieval augmented generation\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(EACL\)*, pp\. 150–158, 2024\.
- Feng et al\. \(2025\)Feng, T\., Wu, Y\., Lin, G\., and You, J\.Graph world model\.In*Proceedings of the 42nd International Conference on Machine Learning \(ICML\)*, 2025\.
- Friedenthal et al\. \(2014\)Friedenthal, S\., Moore, A\., and Steiner, R\.*A Practical Guide to SysML: The Systems Modeling Language*\.Morgan Kaufmann, 3rd edition, 2014\.
- Gao et al\. \(2023\)Gao, Y\., Xiong, Y\., Gao, X\., et al\.Retrieval\-augmented generation for large language models: A survey\.*arXiv preprint arXiv:2312\.10997*, 2023\.
- Guo et al\. \(2024\)Guo, Z\., Xia, L\., Yu, Y\., Ao, T\., and Huang, C\.LightRAG: Simple and fast retrieval\-augmented generation\.*arXiv preprint arXiv:2410\.05779*, 2024\.
- Gutiérrez et al\. \(2024\)Gutiérrez, B\. J\., Shu, Y\., Gu, Y\., Yasunaga, M\., and Su, Y\.HippoRAG: Neurobiologically inspired long\-term memory for large language models\.*arXiv preprint arXiv:2405\.14831*, 2024\.
- Hassanzadeh et al\. \(2019\)Hassanzadeh, O\., Bhattacharjya, D\., Feblowitz, M\., Srinivas, K\., Perrone, M\., Sohrabi, S\., and Katz, M\.Answering binary causal questions through large\-scale text mining\.In*Proceedings of the 28th International Joint Conference on Artificial Intelligence \(IJCAI\)*, pp\. 5003–5009, 2019\.
- Hellert et al\. \(2026\)Hellert, T\., Montenegro, J\., and Sulc, A\.Osprey: Production\-ready agentic AI for safety\-critical control systems\.*APL Machine Learning*, 4\(1\):016103, 2026\.
- Jain et al\. \(2020\)Jain, A\., Meenachi, N\. M\., and Venkatraman, B\.NukeBERT: A pre\-trained language model for low resource nuclear domain\.*arXiv preprint arXiv:2003\.13821*, 2020\.
- Lai et al\. \(2024\)Lai, V\. D\. et al\.SEC\-QA: A systematic evaluation corpus for financial QA\.*arXiv preprint arXiv:2406\.14394*, 2024\.
- Lee et al\. \(2024\)Lee, K\.\-H\., Chen, X\., Furuta, H\., Canny, J\., and Fischer, I\.ReadAgent: A human\-inspired reading agent with gist memory of very long contexts\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024\.
- Lee \(2025\)Lee, Y\. P\.Mechanistic interpretability of LoRA\-adapted language models for nuclear reactor safety applications\.*arXiv preprint arXiv:2507\.09931*, 2025\.
- Lewis et al\. \(2020\)Lewis, P\., Perez, E\., Piktus, A\., Petroni, F\., Karpukhin, V\., Goyal, N\., Küttler, H\., Lewis, M\., Yih, W\.\-t\., Rocktäschel, T\., Riedel, S\., and Kiela, D\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 33, pp\. 9459–9474, 2020\.
- Ma et al\. \(2024\)Ma, Y\. et al\.MMLongBench\-Doc: Benchmarking long\-context document understanding with visualizations\.*arXiv preprint arXiv:2407\.01523*, 2024\.
- Nahid & Rafiei \(2025\)Nahid, M\. M\. H\. and Rafiei, D\.PRISM: Agentic retrieval with LLMs for multi\-hop question answering\.*arXiv preprint arXiv:2510\.14278*, 2025\.
- Pan et al\. \(2017\)Pan, L\., Li, C\., Li, J\., and Tang, J\.Prerequisite relation learning for concepts in MOOCs\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pp\. 1447–1456, 2017\.
- Peldszus & Stede \(2013\)Peldszus, A\. and Stede, M\.From argument diagrams to argumentation mining in texts: A survey\.*International Journal of Cognitive Informatics and Natural and Artificial Intelligence*, 7:1–31, 2013\.
- Sarthi et al\. \(2024\)Sarthi, P\., Abdullah, S\., Tuli, A\., Khanna, S\., Goldie, A\., and Manning, C\. D\.RAPTOR: Recursive abstractive processing for tree\-organized retrieval\.In*Proceedings of the 12th International Conference on Learning Representations \(ICLR\)*, 2024\.
- Schick et al\. \(2023\)Schick, T\., Dwivedi\-Yu, J\., Dessì, R\., Raileanu, R\., Lomeli, M\., Hambro, E\., Zettlemoyer, L\., Cancedda, N\., and Scialom, T\.Toolformer: Language models can teach themselves to use tools\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 36, 2023\.
- Sun et al\. \(2025\)Sun, L\., He, L\., Jia, S\., He, Y\., and You, C\.DocAgent: An agentic framework for multi\-modal long\-context document understanding\.In*Proceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2025\.
- Wang et al\. \(2025\)Wang, S\., Zhou, Y\., and Fang, Y\.BookRAG: A hierarchical structure\-aware index\-based approach for RAG on complex documents\.*arXiv preprint arXiv:2512\.03413*, 2025\.
- Xiong et al\. \(2026\)Xiong, B\. et al\.FDARxBench: Benchmarking regulatory and clinical reasoning on FDA generic drug assessment\.*arXiv preprint arXiv:2603\.19539*, 2026\.
- Yao et al\. \(2023a\)Yao, S\., Yu, D\., Zhao, J\., Shafran, I\., Griffiths, T\. L\., Cao, Y\., and Narasimhan, K\.Tree of thoughts: Deliberate problem solving with large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 36, 2023a\.
- Yao et al\. \(2023b\)Yao, S\., Zhao, J\., Yu, D\., Du, N\., Shafran, I\., Narasimhan, K\., and Cao, Y\.ReAct: Synergizing reasoning and acting in language models\.In*Proceedings of the 11th International Conference on Learning Representations \(ICLR\)*, 2023b\.
- Zhang & Tang \(2025\)Zhang, M\. and Tang, Y\.PageIndex: Next\-generation vectorless, reasoning\-based RAG\.[https://pageindex\.ai/](https://pageindex.ai/), 2025\.
- Zheng et al\. \(2023\)Zheng, L\., Chiang, W\.\-L\., Sheng, Y\., et al\.Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.*arXiv preprint arXiv:2306\.05685*, 2023\.
- Zhong et al\. \(2012\)Zhong, B\., Ding, L\., Luo, H\., Zhou, Y\., Hu, Y\., and Hu, H\.Ontology\-based semantic modeling of regulation constraint for automated construction quality compliance checking\.*Automation in Construction*, 28:58–70, 2012\.
- Zhou et al\. \(2024\)Zhou, A\., Yan, K\., Shlapentokh\-Rothman, M\., Wang, H\., and Wang, Y\.\-X\.Language agent tree search unifies reasoning, acting, and planning in language models\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024\.
- Zhu et al\. \(2021\)Zhu, F\. et al\.TAT\-QA: A question answering benchmark on a hybrid of tabular and textual content in finance\.*arXiv preprint arXiv:2105\.07624*, 2021\.
## Appendix
## Appendix ABaseline Configurations
The configuration used for each baseline is detailed in Table[11](https://arxiv.org/html/2606.29399#A1.T11)\. All methods share the same generation LLM \(GPT\-4\.1, temperature 0, max\_tokens 300\) and operate on the same 200\-question benchmark over NuScale FSAR Chapters 01 and 05\. The configurations below reflect each method’s published defaults, tuned only where the original settings were incompatible with our document corpus \(e\.g\., chunk size adjusted to accommodate FSAR section lengths\)\.
RAPTORconstructs a recursive summarization tree using 100\-token leaves, producing 3,422 leaves from our corpus, with retrieval via thecollapse\_treestrategy under a 2,000\-token budget\.HippoRAGextracts a hippocampal associative knowledge graph from 1,000\-character passages \(1,080 in total\) and retrieves via Personalized PageRank combined with dense similarity \(top\-10\)\.LightRAGperforms three\-pass entity\-relation extraction over 1,200\-token chunks \(266 chunks\), using hybrid retrieval of top\-40 entities plus top\-20 chunks\.GraphRAGbuilds a community graph over 1,200\-token chunks \(267 chunks\) and retrieves via local search at community level 2\.PageIndexshares our system’s tree environment and tools \(browse, read, search\) exactly, differing only in action\-selection logic \(no KG state accumulation\), and thus serves as the single\-variable ablation for state\-conditioned planning\.
Table 11:Baseline configurations\. All methods use GPT\-4\.1 for generation\.
## Appendix BRAGAS: Our System by Reasoning Type
The cross\-method per\-type breakdown is shown in the main body \(Table[6](https://arxiv.org/html/2606.29399#S5.T6)\)\. Here we provide our system’s decomposition across all four RAGAS metrics \(Faithfulness, Answer Relevancy, Context Recall, Factual Correctness\) by reasoning type, giving a fuller picture of where the planning loop is strongest and where it struggles\.
Table 12:RAGAS metrics for our system by reasoning type\.Two patterns emerge\. First,judgment questions dominate across three of four metrics\(Faith\. 0\.97, AR 0\.89, CR 0\.96\), indicating that the planning loop is particularly effective at assembling complete evidence chains for regulatory conformance reasoning, the scenario most aligned with multi\-hop state\-conditioned exploration\. Second,Factual Correctness is the only metric where judgment does not lead: the 0\.42 overall FC reflects the wording\-sensitive nature of this metric \(exact string match against the expected answer\) rather than a retrieval failure\. A closer look at flagged cases shows that FC penalizes correct answers phrased differently than the reference \(e\.g\., “vertical helical once\-through SG with 1,380 tubes” vs\. the reference’s “helical coil SG integrated within RPV”\)\. Both responses are factually correct; the metric penalizes the paraphrase\. This is a benchmark\-level limitation \(single\-perspective reference answers\) rather than a model failure\.
## Appendix CPer\-Question Cost Breakdown
To illustrate the dynamic termination mechanism concretely, we profiled five representative questions drawn from different points in the three\-axis taxonomy\. Token counts were measured withtiktoken\(o200k\_baseencoding\) over the full LLM inputs \(retrieved context, system prompts, and agent reasoning\) and outputs, and costs were computed using GPT\-4\.1 pricing \($2/M input tokens, $8/M output tokens\)\. Because the LLM is non\-deterministic even at temperature 0 \(due to backend batching\), node and edge counts can vary slightly between runs; the numbers below are from a single dedicated profiling run\.
Table 13:Per\-question cost breakdown \(5\-question profiling sample\)\. Illustrates dynamic termination: simple queries \(Q001\) cost $0\.03, complex queries \(Q191\) cost $0\.29\.The breakdown demonstrates that cost scales approximately with question complexity: the single\-hop factual query Q001 consumes less than 10K tokens and costs $0\.03, while cross\-document queries \(Q071, Q131, Q161, Q191\) use the full 4\-hop budget and cost between $0\.18 and $0\.30\. Node counts also correlate with complexity, ranging from 4 nodes \(single factual\) to 19 nodes \(multi\-hop cross\-document\)\. Edge counts scale super\-linearly with node counts, reflecting the pairwise nature of the two\-stage edge inference\. Total 200\-question costs reported in the main body \(Table[7](https://arxiv.org/html/2606.29399#S5.T7)\) are extrapolated from this sample with an estimated±\\pm10–15% variance; a larger profiling run would tighten these estimates but would not change the qualitative conclusion that per\-query cost is dominated by a small number of 4\-hop complex queries\.
## Appendix D10Q Ablation: Per\-Question Judge Detail
The 10\-question ablation summary table is shown in the main text \(Table[8](https://arxiv.org/html/2606.29399#S6.T8)\)\. Here we provide the per\-question O/X detail across the four variants \(full,no\_vision,no\_edges,no\_browse\_first\)\. The 10 questions were selected to span the three\-axis taxonomy, including at least one question from each reasoning type and each modality\.
Table 14:Per\-question 3\-Judge detail across ablation variants\.Robustness vs\. diagnostic questions\.Six of the ten questions \(Q001, Q010, Q031, Q071, Q161, Q176\) are answered correctly by all four variants, indicating that the baseline system, even with any single component removed, handles single\-hop factual queries, multi\-hop factual queries, and judgment queries over well\-structured evidence paths\. These six*robustness*questions confirm that the planning loop as a whole is resilient to individual component ablation for straightforward cases\. The remaining four questions are*diagnostic*: each is failed by exactly one variant, revealing which component is the critical dependency for that question type\.
Failure modes by component\.Vision RAG removaldrops Q101 \(table/comparative\) and Q131 \(composite/comparative\)\. These are the only two questions requiring numerical content from figure\-rendered tables; without vision processing the system cannot recover the relevant cells\. Faithfulness also drops from 0\.96 to 0\.83, reflecting ungrounded answers when tabular evidence is withheld\.Edge inference removaldrops Q058 \(seismic scope boundary\), where theViolatesedge was required to mark non\-safety\-related systems as outside Category I scope\. This failure does not replicate at the 200Q scale: both variants answer Q058 correctly in the main 200Q evaluation, indicating the 10Q Q058 result reflects execution\-level variance rather than a systematic dependency\.Browse\-first removaldrops Q191 \(image/judgment/cross\), where the agent, lacking the table\-of\-contents injection at Hop 1, selects poorly\-targeted sections and never recovers the cross\-document evidence chain\.
Why scale up only edge inference to 200Q\.Each component is responsible for a distinct failure mode at the 10Q level, supporting the complementarity claim\. However, Q058 was the only failure that we suspected might be a sample\-size artifact \(because theViolatesedge is a rare edge type, it is plausible that the dependency does not generalize\)\. Scaling edge inference to 200Q confirms this suspicion: the effect is reliably null \(Table[9](https://arxiv.org/html/2606.29399#S6.T9)\), while the vision and browse\-first effects continue to manifest \(Q101, Q131, Q191 all remain diagnostic at larger scales\)\. This selective scale\-up also reflects practical constraints: the 200Q full ablation consumes roughly 200×\\timesthe compute of the 10Q pilot, so we prioritized the ablation with the strongest a priori cost\-accuracy hypothesis \(edge inference accounts for 65% of per\-query cost\)\.Similar Articles
Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering
This paper introduces a multi-agent peer-reviewed reasoning method where multiple LLMs independently generate chain-of-thought reasoning and then evaluate each other's outputs to select the best answer. The method outperforms single-model reasoning and majority voting on medical QA benchmarks.
SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation
Introduces SGR, a stepwise reasoning framework that enhances LLM reasoning by generating query-specific subgraphs from external knowledge bases, improving accuracy and factual reliability.
Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation
This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
A comprehensive survey analyzing over 300 papers on LLM reasoning, presenting a taxonomy of reasoning paradigms including Chain-of-Thought, Multi-Hop, Mathematical, Commonsense, and others, along with common failure modes and research gaps.
Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics
This paper introduces a methodology to enrich scientific logicality in LLM reasoning, including assessment criteria and data sampling methods, and demonstrates its effectiveness on physics problems using multiple backbone LLMs.