ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience
Summary
ReflectiChain bridges the epistemic gap between LLMs and RL for supply chain resilience using a generative world model and double-loop learning, improving rationale consistency by 33% and maintaining operability under adversarial shocks.
View Cached Full Text
Cached at: 06/10/26, 06:14 AM
# Reflection-in-Action (𝒞_{𝑟𝑢𝑙𝑒}-bounded candidate scoring) + Reflection-on-Action (KL-trust-region LoRA updates).
Source: [https://arxiv.org/html/2606.10359](https://arxiv.org/html/2606.10359)
marginparsep has been altered\. topmargin has been altered\. marginparpush has been altered\. The page layout violates the ICML style\.Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you\. We’re not able to reliably undo arbitrary changes to the style\. Please remove the offending package\(s\), or layout\-changing commands and try again\.
ReflectiChain: Epistemic Grounding in LLM\-Driven World Models for Supply Chain Resilience
Jia Luo1
††footnotetext:1School of Foreign Languages,
Huazhong University of Science and Technology, Wuhan 430074, China\. Correspondence to: Jia Luo <u202317016@hust\.edu\.cn\>\.
2nd Workshop on Epistemic Intelligence in Machine Learning \(EIML@ICML 2026\), Seoul, South Korea\. Copyright 2025 by the author\(s\)\.###### Abstract
AI agents in supply chains face a fundamental epistemic gap: large language models \(LLMs\) interpret policies but lack physical grounding, while reinforcement learning \(RL\) optimizes flows but is semantically blind to unstructured constraints\. We introduceReflectiChain, bridging this gap through a Generative Supply Chain World Model \(SC\-WM\)—encoding heterogeneous supply networks into a 6\-dim graph\-latent space with physical conservation—and Double\-Loop Learning that separates epistemic uncertainty \(KL\-trust\-region\-bounded policy adaptation\) from aleatoric uncertainty \(stochastic latent rollouts\)\. On Semi\-Sim, a 10\-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates,ReflectiChainimproves Rationale Consistency Score by33\.0%\(p<0\.0001p<0\.0001,d=2\.78d=2\.78\), maintains 82\.3% operability under adversarial shocks, and exhibits anti\-fragile behavior \(\+40\.2% gain under moderate pressure\)\. We identify three operational epistemic mechanisms—uncertainty separation, knowledge\-boundary detection, and empirical Bayesian policy updating—and discuss five limitation categories\.
Figure 1:ReflectiChain architecture\.\(Left\) SC\-WM: Graph Encoder→\\toLatentztz\_\{t\}→\\toMulti\-step Rollout→\\toDual\-Head Decoder\. \(Right\) Double\-Loop: Reflection\-in\-Action \(𝒞rule\\mathcal\{C\}\_\{rule\}\-bounded candidate scoring\) \+ Reflection\-on\-Action \(KL\-trust\-region LoRA updates\)\.## 1Introduction
Modern semiconductor supply chains exemplify a criticalepistemic grounding problem: when geopolitical policies arrive as unstructured natural\-language text, an AI agent must jointly reason aboutwhat constraints meanandwhat actions are physically possible—bridging semantic and physical knowledge whose representations are fundamentally misaligned\. This epistemic gap causes complementary failure modes:RLissemantically blind—policies as text never enter its state representation;LLMssuffergrounding gaps—they prescribe semantically plausible but physically infeasible actions\. Neither system can represent the boundary of its own knowledge\.
Illustration\.When the CHIPS Act’s “guardrail clause” prohibits entities receiving U\.S\. subsidies from expanding advanced capacity \(≤28\\leq 28nm\) in mainland China for 10 years, an agent must parse the conditional triggers \(temporal: 10 years; geographic: mainland China; technical:≤28\\leq 28nm\), verify physical feasibility of alternative routes, and anticipate cascading network effects\. In a 4\-tier network \(S1–S3→\\toM1–M2→\\toD1–D2→\\toR1–R3\), an export ban severs certified edge E\_M1\_D1\. A vanilla LLM proposes uncertified edge E\_S1\_D2 \(capacity 0\)—semantically plausible but physically impossible\. RL routes through E\_M1\_D1—physically optimal but policy\-violating\.ReflectiChainaddresses both throughepistemic grounding: SC\-WM encodesGtG\_\{t\}into a 6\-dim latentztz\_\{t\}, performsH=5H\{=\}5rollouts to simulate physical consequences, and Double\-Loop separates epistemic uncertainty \(KL\-trust\-region\-bounded policy adaptation\) from aleatoric uncertainty \(stochastic rollouts\), while𝒞rule\\mathcal\{C\}\_\{rule\}detects knowledge boundaries when allNNcandidates are physically infeasible\.
We formalize this as a C\-POMDP where epistemic constraints𝒞policy\\mathcal\{C\}\_\{policy\}are expressed in natural language, and contribute: \(1\)SC\-WM—topology\-aware world model, MPNN\+attention encoder, 6\-dim latent, learned transition dynamics, dual\-head decoding with physical conservation; \(2\)Double\-Loop Learning—𝒞rule\\mathcal\{C\}\_\{rule\}\-bounded scoring \+ KL\-trust\-region LoRA updates; \(3\)Epistemic mechanisms—uncertainty separation, boundary detection, empirical Bayesian updating; \(4\)Rigorous validationacross 4 strategies×\\times4 models, bootstrap tests, ablations with variance, anti\-fragility analysis\.
## 2Related Work
LLMs for Supply Chains\.Traditional OR/RL are fragile under policy uncertainty\([11](https://arxiv.org/html/2606.10359#bib.bib14)\)\. LLMs enable forecasting\([1](https://arxiv.org/html/2606.10359#bib.bib15);[6](https://arxiv.org/html/2606.10359#bib.bib17)\)and optimization\([16](https://arxiv.org/html/2606.10359#bib.bib21);[14](https://arxiv.org/html/2606.10359#bib.bib23)\)\. KG\-augmented LLMs parse geopolitical risk\([13](https://arxiv.org/html/2606.10359#bib.bib25);[5](https://arxiv.org/html/2606.10359#bib.bib26);[8](https://arxiv.org/html/2606.10359#bib.bib27)\)but remain static interpreters—classifying policies without simulating physical propagation\.Generative World Models\.Pixel\-level models\([2](https://arxiv.org/html/2606.10359#bib.bib30)\)are prohibitive for graphs\. Latent\-space models\([9](https://arxiv.org/html/2606.10359#bib.bib32);[7](https://arxiv.org/html/2606.10359#bib.bib33)\)lack semantic reasoning\. LLM\-driven simulators\([3](https://arxiv.org/html/2606.10359#bib.bib34);[17](https://arxiv.org/html/2606.10359#bib.bib35)\)hallucinate cascades\. SC\-WM resolves this with physical conservation enforcement\.Reflection\.Verbal methods \(Reflexion\([10](https://arxiv.org/html/2606.10359#bib.bib37)\), ReAct\([15](https://arxiv.org/html/2606.10359#bib.bib39)\)\) append text without modifying policy\. ReflAct succumbs togoal drift\. Test\-time training\([12](https://arxiv.org/html/2606.10359#bib.bib43);[4](https://arxiv.org/html/2606.10359#bib.bib46)\)is single\-step\. Our Double\-Loop evolves the policy distribution viaK=3K\{=\}3\-step gradients\.
## 3ReflectiChain Framework
We formalize the setting as C\-POMDP\(𝒮,𝒜,𝒯,ℛ,Ω,𝒪,𝒞,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},\\mathcal\{R\},\\Omega,\\mathcal\{O\},\\mathcal\{C\},\\gamma\)\. Observationoto\_\{t\}: structured inventory\{Ii,t,Ci,t\}\\\{I\_\{i,t\},C\_\{i,t\}\\\}and unstructured policy texts𝒞policy\\mathcal\{C\}\_\{policy\}\. Objective:maxπ𝔼\[∑γtrt\]\\max\_\{\\pi\}\\mathbb\{E\}\[\\sum\\gamma^\{t\}r\_\{t\}\]s\.t\.at⊧𝒞policy,∀ta\_\{t\}\\models\\mathcal\{C\}\_\{policy\},\\forall t\.
### 3\.1Generative Supply Chain World Model
Gt=\(𝒱,ℰ,Xt,Et\)G\_\{t\}=\(\\mathcal\{V\},\\mathcal\{E\},X\_\{t\},E\_\{t\}\): 10 nodes across 4 echelons \(3S\+2M\+2D\+3R\),∼\\sim30 edges \(certified/uncertified\)\. Node features: inventory, cash, congestion, compliance, risk, production rate, quality \(A/B\), region \(Alpha/Beta\)\. Edge features: certification, capacity\[30,150\]\[30,150\], latency\[1,5\]\[1,5\], load, disruption prob\[0\.02,0\.08\]\[0\.02,0\.08\], carbon cost\.
EncoderEψE\_\{\\psi\}:MPNN with multi\-head attention:𝐡v\(l\+1\)=Attention\(l\)\(𝐡v\(l\),\{𝐡u\(l\)⊕𝐞uv\}u∈𝒩\(v\)\)\\mathbf\{h\}\_\{v\}^\{\(l\+1\)\}=\\text\{Attention\}^\{\(l\)\}\(\\mathbf\{h\}\_\{v\}^\{\(l\)\},\\\{\\mathbf\{h\}\_\{u\}^\{\(l\)\}\\oplus\\mathbf\{e\}\_\{uv\}\\\}\_\{u\\in\\mathcal\{N\}\(v\)\}\)\. Graph latent:zt=Wproj⋅1\|𝒱\|∑v𝐡v\(L\)∈ℝ6z\_\{t\}=W\_\{\\text\{proj\}\}\\cdot\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\}\\mathbf\{h\}\_\{v\}^\{\(L\)\}\\in\\mathbb\{R\}^\{6\}\. Six dims: inventory, congestion, demand pressure, carbon, stockout risk, constraint tension\.
Dynamics𝒯ω\\mathcal\{T\}\_\{\\omega\}:zt\+1=GELU\(zt\+Mω⋅zt\+Δz\(at;ω\)\)z\_\{t\+1\}=\\text\{GELU\}\(z\_\{t\}\+M\_\{\\omega\}\\cdot z\_\{t\}\+\\Delta z\(a\_\{t\};\\omega\)\),Mω∈ℝ6×6M\_\{\\omega\}\\in\\mathbb\{R\}^\{6\\times 6\},Δz\(at\)\\Delta z\(a\_\{t\}\): transfer\(uncertified\)→\\totension\+0\.3; produce→\\toinventory\+0\.8, carbon\+0\.2; wait→\\tocongestion−\-0\.05\.H=5H\{=\}5\-step rollouts\.
Dual\-Head Decoder:r^wm\\hat\{r\}\_\{wm\}\(Reward\),ΔS^pred\\Delta\\hat\{S\}\_\{pred\}\(State Delta\)\.ℒWM=MSE\(r^wm,rtrue\)\+0\.5⋅MSE\(ΔS^pred,ΔStrue\)\\mathcal\{L\}\_\{WM\}=\\text\{MSE\}\(\\hat\{r\}\_\{wm\},r\_\{true\}\)\+0\.5\\cdot\\text\{MSE\}\(\\Delta\\hat\{S\}\_\{pred\},\\Delta S\_\{true\}\)\.
### 3\.2Double\-Loop Test\-Time Learning
Reflection\-in\-Action:at∗=argmaxk∈\[N\]\(α⋅Clip\(sllm\(k\),𝒞rule\)\+β⋅r^wm\(k\)\)a\_\{t\}^\{\*\}=\\arg\\max\_\{k\\in\[N\]\}\(\\alpha\\cdot\\text\{Clip\}\(s\_\{llm\}^\{\(k\)\},\\mathcal\{C\}\_\{rule\}\)\+\\beta\\cdot\\hat\{r\}\_\{wm\}^\{\(k\)\}\),𝒞rule\\mathcal\{C\}\_\{rule\}verifies: mass conservation \(qship≤Isourceq^\{\\text\{ship\}\}\\leq I\_\{\\text\{source\}\}\), capacity \(qship≤capedgeq^\{\\text\{ship\}\}\\leq\\text\{cap\}\_\{\\text\{edge\}\}\), edge existence \(e\.is\_activee\.\\text\{is\\\_active\}\)\. Violations→\\toscore zero\.α=0\.6,β=0\.4\\alpha\{=\}0\.6,\\beta\{=\}0\.4\.
Reflection\-on\-Action:∇θ𝒥≈∑j∈ℬr\(j\)∇θlogπθ\(aj\|oj\)−ηKL∇θDKL\(πθ∥πbase\)\\nabla\_\{\\theta\}\\mathcal\{J\}\\approx\\sum\_\{j\\in\\mathcal\{B\}\}r^\{\(j\)\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{j\}\|o\_\{j\}\)\-\\eta\_\{KL\}\\nabla\_\{\\theta\}D\_\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{base\}\),ℬ\\mathcal\{B\}:K=3K\{=\}3steps\.
Epistemic Mechanisms \(operationalized\)\.ReflectiChaininstantiates three concrete epistemic operations:\(i\) Uncertainty separation—the KL trust region bounds epistemic uncertainty \(what the agent does not know aboutπθ∗\\pi\_\{\\theta\}^\{\*\}given finite experience\), while SC\-WM rollouts handle aleatoric uncertainty \(inherent stochasticity from demand volatility and perturbation timing\)\. These two uncertainty types flow through separate architectural pathways with distinct gradients\.\(ii\) Knowledge\-boundary detection—whenClip\(sllm\(k\),𝒞rule\)=0\\text\{Clip\}\(s\_\{llm\}^\{\(k\)\},\\mathcal\{C\}\_\{rule\}\)=0for allNNcandidates, the agent receives an unambiguous signal that its generative distribution cannot produce feasible actions\. This triggers systematic exploration \(N←N×2N\\leftarrow N\\times 2\) or conservative fallback, providing a measurable operational definition of epistemic boundary\.\(iii\) Empirical Bayesian policy updating—each episode trajectory provides data for updating the posterior overθ\\theta\. The policy gradient implements this update with the KL penalty acting as a prior centered atπbase\\pi\_\{base\}, preventing overfitting to stochastic single\-episode outcomes\.
### 3\.3Multi\-Agent and Adversarial Extension
Markov game:M=3M\{=\}3heterogeneous agents \(Profit/RCI weight 0\.35, Resilience/ARL 0\.25, Compliance/CEE 0\.20\)\.𝒢adv\\mathcal\{G\}\_\{adv\}severs max\-betweenness edge\. Adversarial Regret:Regretadv=∑t\(maxa∗𝔼\[R\|a∗\]−𝔼\[R\|at\]\)\\text\{Regret\}\_\{adv\}=\\sum\_\{t\}\(\\max\_\{a^\{\*\}\}\\mathbb\{E\}\[R\|a^\{\*\}\]\-\\mathbb\{E\}\[R\|a\_\{t\}\]\)\.
## 4Experiments
Semi\-Sim\.10\-node,∼\\sim30\-edge, 4\-tier network\. SIR risk:ℛi,t\+1=\(1−γ\)ℛi,t\+∑jwjimax\(0,ℛj,t−τ\)\\mathcal\{R\}\_\{i,t\+1\}=\(1\{\-\}\\gamma\)\\mathcal\{R\}\_\{i,t\}\+\\sum\_\{j\}w\_\{ji\}\\max\(0,\\mathcal\{R\}\_\{j,t\}\{\-\}\\tau\),γ=0\.1,τ=0\.3\\gamma\{=\}0\.1,\\tau\{=\}0\.3\. 6 perturbation types \(p=0\.15p\{=\}0\.15/step\)\. 10 constraint templates \(Absolute Embargo, Certified Path Only, Fair Allocation, Quality Threshold, Temporal Sequence, Data Sovereignty, Carbon Budget, Supplier Diversity, Dual\-Use Restriction, Inventory Floor\) sampled 2–4/episode\.T=30T\{=\}30steps\. Data: 3,000 trajectories, 2,000 perturbation scenarios, 500 multi\-agent episodes \(520 MB\)\. Full spec: Appendix\.
Baselines:4 strategies×\\times4 backbones\. NoThinking \(Direct\-CoT\), ReAct\([15](https://arxiv.org/html/2606.10359#bib.bib39)\), ReflAct \(state\-goal reflection\), LLM\+TreeSearch \(B=5B\{=\}5\)\. Models: DeepSeek\-V3\.2, Qwen2\.5\-7B, InternLM2\.5\-7B, GPT\-4o\-mini\. Reference: PPO\.All: identical $150K capital\.Metrics: RCS \(DeBERTa\-NLI\), CCR, TI, TS \(\+CEE/OR/ARL multi\-agent\)\. Statistics: 5 seeds, bootstrapN=100,000N\{=\}100\{,\}000, Cohen’sdd, 95% CI, Two\-Way ANOVA\.
### 4\.1Core Findings
Figure 2:Cross\-model RCS\.ReflectiChain: 88\.5–93\.1% RCS, \+33\.0% over ReflAct \(p<0\.0001p<0\.0001,d=2\.78d=2\.78\)\. Lower TS by design \(α\>β\\alpha\{\>\}\\beta\)\.Figure[2](https://arxiv.org/html/2606.10359#S4.F2)and Table[1](https://arxiv.org/html/2606.10359#S4.T1)reveal:\(1\)PPO collapses \(TS=−0\.20\-0\.20, CCR=60\.7%\) via compliance violations—semantic blindness is fatal\.\(2\)ReflAct improves RCS \(\+14\.2 pp over ReAct\) but plateaus<72%<72\\%—verbal reflection cannot resist goal drift\.\(3\)ReflectiChain achieves 88\.5–93\.1% RCS, lowest TI \(3\.10–3\.90\)\. \+33\.0% RCS \(d=2\.78d=2\.78, very large effect\)\.
Table 1:Results \(DeepSeek\-V3\.2\)\.Mean±\\pmstd, 5 seeds\.†\\daggerTreeSearch\.
### 4\.2Ablation and Reasoning Analysis
Figure 3:Ablation\(5 seeds×\\times3 eps\)\. SC\-WM: CEE−\-49%\. Retro RL: RCI−\-12\.8pp\. KL trust: variance \+81%\.𝒞rule\\mathcal\{C\}\_\{rule\}: RCI−\-15\.8pp\.Figure[3](https://arxiv.org/html/2606.10359#S4.F3):SC\-WMremoval→\\togrounding gap \(CEE−\-49%\)\.Retro RLremoval→\\tostatic myopia \(RCI−\-12\.8 pp\)\.KL trustremoval→\\tocatastrophic drift \(variance \+81%\)\.𝒞rule\\mathcal\{C\}\_\{rule\}removal→\\tocircular evaluation \(RCI 68\.5%\)\.
Figure 4:Reasoning traces\.Constraint: “Do not access Node C\.” Channel A blocked\. ReAct: Exception Injection\. ReflAct: Goal Drift\. Ours: Semantic Anchoring via SC\-WM\+𝒞rule\\mathcal\{C\}\_\{rule\}\.
### 4\.3Anti\-Fragility and Scaling
Figure 5:Anti\-fragility\.CEE vs\.ρ\\rho, 7 values, 5\-seed 95% CI\. CEE: $1\.02M \(ρ=0\.3\\rho\{=\}0\.3\)→\\to$1\.43M \(ρ=0\.5\\rho\{=\}0\.5\), \+40\.2% \(p<0\.05p<0\.05\)\.Figure[5](https://arxiv.org/html/2606.10359#S4.F5): anti\-fragile response under moderate pressure \(ρ∈\[0\.3,0\.5\]\\rho\\in\[0\.3,0\.5\]\)—Double\-Loop discovers counterfactual strategies \(\+40\.2%,p<0\.05p<0\.05\)\.T=100T\{=\}100: ARL→\\to0\.15, zero divergence\. Multi\-agent: disabling Double\-Loop→\\tosocial welfare−\-46\.7% \($15\.40M→\\to$8\.21M\),ϵ\\epsilon\-NE gap 0\.66→\\to0\.91\.
Scaling:N∈\{1,3,5,7,10\}N\{\\in\}\\\{1,3,5,7,10\\\},K∈\{1,3,5,7,10\}K\{\\in\}\\\{1,3,5,7,10\\\}\.NN: 1→\\to3 \+38\.6% \(p<0\.01p\{<\}0\.01\);N=7N\{=\}7: \+2\.1% \(p\>0\.1p\{\>\}0\.1\)\.K=3K\{=\}3optimal;K=1K\{=\}1myopic \(−\-22\.3%\);K=7K\{=\}7dilutes \(−\-5\.2%\)\. ANOVA:FN=35\.8F\_\{N\}\{=\}35\.8,FK=22\.2F\_\{K\}\{=\}22\.2, bothp<0\.001p\{<\}0\.001\. Pareto:N=3,K=3N\{=\}3,K\{=\}3\.
## 5Limitations
Sim\-to\-real:Semi\-Sim is synthetic; real data is proprietary\.Circular evaluation:LLM critic and𝒢adv\\mathcal\{G\}\_\{adv\}share model family—𝒞rule\\mathcal\{C\}\_\{rule\}mitigates hard constraints but soft scoring remains LLM\-mediated\.Test\-time safety:LoRA updates risk drift; KL trust region provides theoretical safeguard but human oversight is essential\.Scalability:MPNN scales linearly but LLM scoring grows quadratically\.Societal impact:Automated agents could be misused; compliance\-first design \(α\>β\\alpha\{\>\}\\beta\) and𝒞rule\\mathcal\{C\}\_\{rule\}prevent this\.We do not advocate 100% autonomous management\.
## 6Conclusion
This paper identified a fundamentalepistemic grounding gapin AI\-driven supply chain management: the misalignment between semantic policy understanding \(LLM\-mediated\) and physical feasibility verification \(simulation\-mediated\)\.ReflectiChainbridges this gap through SC\-WM and Double\-Loop learning, operationalizing three epistemic mechanisms—uncertainty separation\(KL trust region vs\. stochastic rollouts\),knowledge\-boundary detection\(constraint infeasibility signals\), andempirical Bayesian updating\(retrospective credit assignment with KL prior\)\. On Semi\-Sim across 4 models×\\times4 strategies, it achieves 33\.0% RCS improvement \(p<0\.0001p\{<\}0\.0001,d=2\.78d\{=\}2\.78\) with anti\-fragile behavior under adversarial pressure\. These epistemic principles extend beyond supply chains to any domain where language\-conditioned agents must operate under physical constraints with verifiable knowledge boundaries\.
## References
- Large language models are zero\-shot time series forecasters\.External Links:2310\.07820,[Link](https://arxiv.org/abs/2310.07820)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap \(2024\)Mastering diverse domains through world models\.External Links:2301\.04104,[Link](https://arxiv.org/abs/2301.04104)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- S\. Hao, Y\. Gu, H\. Ma, J\. J\. Hong, Z\. Wang, D\. Z\. Wang, and Z\. Hu \(2023\)Reasoning with language model is planning with world model\.External Links:2305\.14992,[Link](https://arxiv.org/abs/2305.14992)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- Y\. Hong, H\. Huang, M\. Li, L\. Fei\-Fei, J\. Wu, and Y\. Choi \(2026\)Learning from trials and errors: reflective test\-time planning for embodied llms\.External Links:2602\.21198,[Link](https://arxiv.org/abs/2602.21198)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- M\. Iacoviello and J\. Tong \(2026\)The ai\-gpr index: measuring geopolitical risk using artificial intelligence\.Note:Working PaperCited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- S\. Jia, B\. Song, C\. Ye, and C\. Yuan \(2026\)M3Time: llm\-enhanced multi\-modal, multi\-scale, and multi\-frequency multivariate time series forecasting\.Proceedings of the AAAI Conference on Artificial Intelligence40\(27\),pp\. 22265–22273\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/39383),[Document](https://dx.doi.org/10.1609/aaai.v40i27.39383)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- T\. Kipf, E\. van der Pol, and M\. Welling \(2020\)Contrastive learning of structured world models\.External Links:1911\.12247,[Link](https://arxiv.org/abs/1911.12247)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- B\. Kwon, T\. Park, P\. Rungcharoenkitkul, and F\. Smets \(2025\)Parsing the pulse: decomposing macroeconomic sentiment with llms\.Bank for International Settlements, Monetary and Economic Department\.Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel, T\. Lillicrap, and D\. Silver \(2020\)Mastering atari, go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-020-03051-4),[Document](https://dx.doi.org/10.1038/s41586-020-03051-4)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- Z\. Song, Y\. Xie, L\. Yang, and Y\. Zhao \(2026\)Large language models in supply chain management: a systematic literature review and application framework\.International Journal of Production Research0\(0\),pp\. 1–41\.External Links:[Document](https://dx.doi.org/10.1080/00207543.2026.2641103),[Link](https://doi.org/10.1080/00207543.2026.2641103),https://doi\.org/10\.1080/00207543\.2026\.2641103Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- Y\. Sun, X\. Li, K\. Dalal, J\. Xu, A\. Vikram, G\. Zhang, Y\. Dubois, X\. Chen, X\. Wang, S\. Koyejo, T\. Hashimoto, and C\. Guestrin \(2025\)Learning to \(learn at test time\): rnns with expressive hidden states\.External Links:2407\.04620,[Link](https://arxiv.org/abs/2407.04620)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- A\. T\. Wasi, M\. Islam, and A\. R\. Akib \(2024\)Supplygraph: a benchmark dataset for supply chain planning using graph neural networks\.arXiv preprint arXiv:2401\.15299\.Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- Z\. Xiao, Y\. J\. Wang, X\. Han, S\. Guan, J\. Zhu, J\. Xie, L\. Xu, H\. Wu, W\. Y\. Yu, Z\. Liu,et al\.\(2026\)DeepOR: a deep reasoning foundation model for optimization modeling\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 34052–34060\.Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1),[§4](https://arxiv.org/html/2606.10359#S4.p2.4)\.
- B\. Zhang, P\. Luo, G\. Yang, B\. Soong, and C\. Yuen \(2025a\)OR\-llm\-agent: automating modeling and solving of operations research optimization problems with reasoning llm\.External Links:2503\.10009,[Link](https://arxiv.org/abs/2503.10009)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
- J\. Zhang, M\. Jiang, N\. Dai, T\. Lu, A\. Uzunoglu, S\. Zhang, Y\. Wei, J\. Wang, V\. M\. Patel, P\. P\. Liang, D\. Khashabi, C\. Peng, R\. Chellappa, T\. Shu, A\. Yuille, Y\. Du, and J\. Chen \(2025b\)World\-in\-world: world models in a closed\-loop world\.External Links:2510\.18135,[Link](https://arxiv.org/abs/2510.18135)Cited by:[§2](https://arxiv.org/html/2606.10359#S2.p1.1)\.
## Appendix: Semi\-Sim Specification
Topology \(Fig\.[6](https://arxiv.org/html/2606.10359#Ax1.F6), left\):\|𝒱\|=10\|\\mathcal\{V\}\|\{=\}10\(3S\+2M\+2D\+3R\),∼\\sim30 edges\. Node: inventory, cash, compliance, risk, capacity, congestion, quality, region\. Edge: certified/uncertified, capacity, latency, disruption prob\. Dynamics:It\+1=It\+qbuy−qshipI\_\{t\+1\}=I\_\{t\}\+q^\{\\text\{buy\}\}\-q^\{\\text\{ship\}\},Ct\+1=Ct\+psaleqship−pcostqbuy\+Γ𝟙violateC\_\{t\+1\}=C\_\{t\}\+p\_\{\\text\{sale\}\}q^\{\\text\{ship\}\}\-p\_\{\\text\{cost\}\}q^\{\\text\{buy\}\}\+\\Gamma\\mathbb\{1\}\_\{\\text\{violate\}\}\. SIR risk:ℛt\+1=\(1−γ\)ℛt\+∑wjimax\(0,ℛj,t−τ\)\\mathcal\{R\}\_\{t\+1\}=\(1\{\-\}\\gamma\)\\mathcal\{R\}\_\{t\}\+\\sum w\_\{ji\}\\max\(0,\\mathcal\{R\}\_\{j,t\}\{\-\}\\tau\),γ=0\.1,τ=0\.3\\gamma\{=\}0\.1,\\tau\{=\}0\.3\. 6 perturbations, 10 constraints\.


Figure 6:Semi\-Sim specification\.Left:Topology—10 nodes, 4 tiers,∼\\sim30 edges\. Certified \(solid\), uncertified \(dashed, red\)\.Right:Prompt structure \(4 components\) and constraint violation rates \(3,000 trajectories\)\.### Full Prompt Templates \(Single\-Column, Color\-Coded\)
ReflectiChain \(Ours\) — Full System PromptYou are a supply chain AI agent with a Generative World Model \(SC\-WM\)\. Make sequential decisions under policy shocks\. STRICTLY follow all absolute constraints\. WM predicts physical consequences via latent rollouts\.CORE: Constraint compliance \> Profit\. Violations compound\.TASK GOAL: "Maintain maximum throughput while respecting all constraints\."ABSOLUTE CONSTRAINTS \(sampled 2\-\-4 per episode\): \[absolute\_embargo\] No interaction with Node C\. \[CRITICAL\] \[certified\_path\_only\] Certified channels only\. Backup forbidden\. \[CRITICAL\] \[fair\_allocation\] No downstream agent \>40% stock\. \[HIGH\] \[quality\_threshold\] Grade A materials only\. No Grade B substitution\. \[HIGH\] \[temporal\_sequence\] Inspection \(Step\-Q\) must precede shipping \(Step\-S\)\. \[HIGH\] \[data\_sovereignty\] Region Alpha processing only\. \[CRITICAL\] \[carbon\_cap\] Total emissions <500 units\. \[MEDIUM\] \[supplier\_diversity\] No single supplier \>50% volume\. \[MEDIUM\] \[dual\_use\_restriction\] No Dual\-Use transit via non\-allied nodes\. \[CRITICAL\] \[inventory\_floor\] Safety stock \>=20% baseline\. \[HIGH\]WORLD MODEL PREDICTION \(H=5 rollout\): Current: inv=320, cong=15%, carbon=245, risk=8%, tension=12% \+1: inv=298, feas=0\.92 \| \+2: inv=275, feas=0\.88 \(risk 18%\) \+3: inv=234, feas=0\.76 \(carbon 420/500\) \| \+4: inv=198, feas=0\.61 \(inv<floor\) \+5: inv=152, feas=0\.45 \(CRITICAL: carbon cap exceeded\)CURRENT STATE: Node\[S1\] supplier inv=180/200 Alpha \| Node\[M1\] mfr inv=95/150 Alpha Edge\[E\_S1\_M1\] S1\-\>M1 certified 45/120 \| Edge\[E\_S1\_M2\] DISRUPTED Edge\[E\_S1\_D2\_direct\] S1\-\>D2 UNCERTIFIED 30/50OUTPUT: Reasoning \[constraints \+ WM \+ state \+ trade\-off\] \+ Action JSON
ReflAct \(State\-Goal Reflection\) — BaselineSYSTEM ADDITION: "Reflect on relationship between current state and task goal\. Consider whether planned action maintains consistency with constraints\."NO World Model\. NO retrospective credit assignment\. Language\-level only\.Example: "My goal is throughput, but path is blocked\. Accessing C violates the rule, yet necessary to prevent global failure\. Prioritize ultimate goal\." \-\-\> GOAL DRIFT: violation rationalized by reprioritizing local task\. RCS plateaus <72% across all backbones\.
ReAct & NoThinking — BaselinesReAct: "First think about current condition and plan future actions, then output your action\." NO WM\. NO reflection\. Single\-step reasoning\. \-\-\> EXCEPTION INJECTION: "Channel A blocked\. Accessing Node C justified as emergency exemption\." RCS 52%\.NoThinking: "Directly analyze and output action\." NO memory\. NO WM\. \-\-\> CONTEXT COLLAPSE: constraint awareness lost after 5 steps\. RCS 48%\.
SC\-WM:6\-dim latent∈\[0,1\]\\in\[0,1\]\.MωM\_\{\\omega\}transition matrix,Δz\\Delta zaction perturbation\.Data:3,000 traj, 2,000 perturbs, 500 MA eps \(520 MB\)\.Stats:5 seeds, bootstrap10510^\{5\}, Cohen’sdd\.Similar Articles
From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning
This paper introduces Relational Reflective Intelligence (RRI), an inference-time governance layer that uses auditable reasoning loops to stabilize human-AI reasoning, addressing cognitive vulnerabilities shared by humans and LLMs.
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
This paper introduces ReFlect, a training-free harness system that wraps LLMs with deterministic error detection and recovery logic to improve performance on complex, long-horizon reasoning tasks.
Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving
Proposes Reason-Imagine-Act (RIA), a closed-loop framework coupling an LLM reasoner with an action-conditioned world model for online safety verification in autonomous driving, achieving 80.05% route completion and 0.20% collision rate in CARLA simulations.
Generating Logically Consistent Synthetic Supply Chain Data with LLM-Driven Knowledge Graph Reasoning
This paper introduces TabKG, a knowledge-graph-guided framework for generating logically consistent synthetic supply chain tabular data. It uses an LLM ensemble to discover operational dependencies and a latent diffusion model to generate independent columns, achieving high logical consistency while preserving statistical fidelity.
From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data
Introduces Multi-Agent Residual In-Context Learning (MARICL), an agentic framework that uses LLM agents to analyze residuals from a base model on tabular data, hypothesize missing structure, and produce explicit correction terms via textual gradient optimization. Across nine benchmarks, MARICL consistently improves over its base model and demonstrates mechanistic generalization in cell-free protein predictions.