What Will Happen Next: Large Models-Driven Deduction for Emergency Instances
Summary
This paper introduces WLDS, a large-model-driven system for simulating and deducing emergency instances by leveraging controllable randomness and cross-domain knowledge. It presents the Emergency Instances Deduction (EID) benchmark and demonstrates high-fidelity simulation capabilities across multiple domains.
View Cached Full Text
Cached at: 05/12/26, 07:18 AM
# What Will Happen Next: Large Models-Driven Deduction for Emergency Instances
Source: [https://arxiv.org/html/2605.08599](https://arxiv.org/html/2605.08599)
###### Abstract
Traditional simulation methods reproduce occurred emergency instances thr\- ough presetting to assist people in risk assessment and emergency decision\-making\. However, due to the lack of randomness and diversity, existing simulation systems struggle to fully explore the potential risk as emergency instances are scarce\. In contrast, Large Models \(LMs\) can dynamically adjust generation strategies to introduce controllable randomness, while also possessing extensive prior knowledge and cross\-domain knowledge transfer capabilities\. Inspired by it, we propose the LMs\-driven World Line Divergence System \(WLDS\), which enables diversified visualization and deduction of emergency instances in different domains\. WLDS leverages LMs to deduce emergency instances in various development directions, and introduces the factual calibration and logical calibration mechanism to ensure factual accuracy and logical rigor during the deduction process\. The interactive module can independently select deduction directions to avoid potential hallucinations that are difficult for the system to identify\. Furthermore, by introducing the visualization module, WLDS forms simulation and deduction that combine text and images, which enhances interpretability\. Extensive experiments conducted on the proposed Emergency Instances Deduction \(EID\) benchmark dataset demonstrate that WLDS achieves high\-precision and high\-fidelity simulation and deduction of emergency instances in multiple specific domains\. Relevant experiments further demonstrate that WLDS can generate more emergency instances deduction data for users and provide support for better decision\-making in similar emergency instances in the future\.
###### keywords:
Large models , simulation , deduction , interactive , emergency instances
††journal:Nuclear Physics B\\affiliation
\[inst1\]organization=Zhengzhou University, city=Zhengzhou, postcode=450001, state=Henan, country=China\\affiliation\[inst2\]organization=Zhejiang University, city=Hangzhou, postcode=310058, state=Zhejiang, country=China
\{graphicalabstract\}![[Uncaptioned image]](https://arxiv.org/html/2605.08599v1/x1.png)
\{highlights\}
WLDS: a simulation and deduction system for emergency instances in few\-shot, multi\-domain professional settings\.
Text–image fused, user\-steerable interactive deduction with multi\-branch world lines\.
EID benchmark for emergency deduction, covering 10 domains and 4,300 three\-step branched samples with expert labels\.
Superior factual and logical consistency and higher scenario prediction accuracy, corroborated by expert evaluations\.
## 1Introduction
Digital simulation technologies against real\-world scenarios greatly facilitate the understanding and reproduction of the processes and logic underlying the evolution of real events\[[17](https://arxiv.org/html/2605.08599#bib.bib30)\]\. By simulating real\-world scenarios and deducing event development processes, simulation systems not only provide training materials for operators but also offer risk assessment basis for decision\-makers\[[33](https://arxiv.org/html/2605.08599#bib.bib11)\]\. Therefore, their performance directly determines the prediction accuracy of risks in the scenario and the effectiveness of emergency decision\-making\[[40](https://arxiv.org/html/2605.08599#bib.bib46)\]\.
Existing simulation technologies have already demonstrated certain effectiveness in simulation modeling for normal scenarios, such as crowd simulation\[[2](https://arxiv.org/html/2605.08599#bib.bib40)\]\. However, they lack the ability to simulate and deduce emergency instances\[[24](https://arxiv.org/html/2605.08599#bib.bib20),[49](https://arxiv.org/html/2605.08599#bib.bib23)\], as traditional simulation technologies suffer from the following problems: \(1\)Lack of randomness:Simulation technologies against real\-world scenarios can achieve digital mapping of physical entities\. However, in terms of logic deduction and state evolution, they overly rely on preset rules and lack the ability to model the randomness of event states and the diversity of event development paths in the physical world\. \(2\)Lack of diversity:Specific domains such as autonomous driving and urban rail transit are characterized by high potential risks, rare but severe emergencies\. For example, in urban rail transit scenarios, although fires are infrequent, they may lead to severe consequences such as traffic paralysis and stampedes\. Such emergency instances are crucial for improving the accuracy of simulation and deduction\. Due to the lack of relevant emergency instances, existing simulation and deduction technologies are ineffective and suffer from problems such as deviation from facts and illogical deduction\.
Figure 1:Directly use LMs to simulate and deduce the process of autonomous driving\. It includes two types of hallucination issues: factual deviation and logical deviation\.In recent years, some studies have used Large Models \(LMs\) to dynamically adjust generation strategies and introduce controllable randomness to break the limitation of single scenario evolution patterns in traditional rule\-driven simulations\[[38](https://arxiv.org/html/2605.08599#bib.bib47),[13](https://arxiv.org/html/2605.08599#bib.bib48),[32](https://arxiv.org/html/2605.08599#bib.bib49)\]\. For example, Li et al\.\[[23](https://arxiv.org/html/2605.08599#bib.bib1)\]designed the ChatSUMO system which combines LMs with the traffic simulation platform SUMO\. It achieves the full\-process automation from natural language input to urban\-level traffic scenario generation and supports customized operations such as traffic signal optimization and vehicle path adjustment\. However, these studies lack the simulation of emergency instances\. Figure\.[1](https://arxiv.org/html/2605.08599#S1.F1)shows the result of directly using LMs to simulate and deduce the process of autonomous driving\. We summarize that directly using LMs to simulate and deduce is prone to the following two types of hallucination issues:Factual deviation:The generated content that violates physical laws or domain specifications affects severely the reliability of the simulation and deduction\. As shown in Figure\.[1](https://arxiv.org/html/2605.08599#S1.F1), in Step 3, the autonomous driving system reports only 5% remaining battery, yet still claims that the vehicle can continue driving for 500 kilometers, which is physically impossible\.Logical deviation:During the simulation and deduction process, there are logical flaws such as causal disconnection and broken element consistency, which lead to the lack of logical rigor\. As shown in Figure\.[1](https://arxiv.org/html/2605.08599#S1.F1), in Step 1, the vehicle is already on a busy highway, but in Step 4, LMs deduce that the same vehicle encounters a pedestrian crosswalk that should not exist on the highways\. Furthermore, due to the extensive prior knowledge and cross\-domain knowledge transfer capabilities of LMs\[[44](https://arxiv.org/html/2605.08599#bib.bib16),[42](https://arxiv.org/html/2605.08599#bib.bib17),[43](https://arxiv.org/html/2605.08599#bib.bib18),[41](https://arxiv.org/html/2605.08599#bib.bib19)\], they can be used to migrate event knowledge of emergency instances from other domains to the target domain, thereby alleviating the problem of emergency instances scarcity\[[10](https://arxiv.org/html/2605.08599#bib.bib41),[9](https://arxiv.org/html/2605.08599#bib.bib42),[12](https://arxiv.org/html/2605.08599#bib.bib43),[11](https://arxiv.org/html/2605.08599#bib.bib44)\]\.
The world lines represent the trajectory of an object or event in spacetime\[[20](https://arxiv.org/html/2605.08599#bib.bib45)\]\. Inspired by this concept, we proposeLMs\-DrivenWorldLineDivergenceSystem \(WLDS\), which aims to achieve high\-precision and high\-fidelity simulation and deduction of emergency instances, thereby providing references for safety assessment and decision support\. WLDS leverages the cross\-domain transfer capability of LMs to migrate emergency instances knowledge from other domains to the target domain, thereby generating initial emergency instances\. Subsequently, starting from the initial instance, WLDS uses LMs to generate multiple world lines with different development directions and allows users to independently select their desired deduction direction\. Meanwhile, WLDS introduces the dual calibration mechanism: The factual calibration mechanism achieves dynamic alignment between generated content and domain facts through real\-time knowledge retrieval to ensure that each world line possesses factual reliability\. The logical calibration mechanism uses the logical discriminator to dynamically evaluate whether the logic between the current event and previous content is consistent, thereby ensuring that each world line has rigorous internal logic\. To address the deficiencies of existing evaluation systems, we propose an automated evaluation mechanism that quantitatively evaluates the performance of WLDS by using factual consistency and logical consistency\. Moreover, we construct the Emergency Instances Deduction \(EID\) benchmark dataset to facilitate dynamic modeling and evaluation of emergency instances deduction\. It consists of 10 sub\-datasets which cover a wide range of domains from urban rail transit to autonomous driving and others\. The experimental results show that, compared to the baseline model, WLDS achieves a 7\.08% improvement in factual consistency and a 8\.34% improvement in logical consistency in the urban rail transit domain\. In the EID\-Chemical plant sub\-dataset, the scenarios prediction accuracy of WLDS is 8\.50% higher than that of the baseline model\. Additionally, in the autonomous driving domain, WLDS received a high rating of 4\.8 points from the domain experts\.
The main contributions of this paper can be summarized as follows:
1. 1\.We systematically analyze the problems of existing simulation systems in the simulation and deduction of emergency instances, and discuss the two types of hallucination issues \(factual deviation and logical deviation\) of LMs in this field\.
2. 2\.We propose LMs\-Driven World Line Divergence System \(WLDS\), which combines with factual calibration and logical calibration mechanisms to achieve high\-precision and high\-fidelity simulation and deduction of emergency instances through LMs\.
3. 3\.We construct the EID benchmark dataset, which consists of 10 sub\-datasets with a total of 4300 data entries\. This dataset can provide high\-quality data support for optimizing and evaluating models for emergency instances deduction\.
4. 4\.We design an automated evaluation mechanism based on factual consistency and logical consistency\. Extensive experiments have been conducted to demonstrate the effectiveness of WLDS in simulating and deducing emergency instances\.
## 2Related work
### 2\.1Scenario Generation Technology
Scenario generation is a key supporting technology for complex environment modeling\[[8](https://arxiv.org/html/2605.08599#bib.bib3),[26](https://arxiv.org/html/2605.08599#bib.bib4),[46](https://arxiv.org/html/2605.08599#bib.bib5)\], and existing methods can be broadly classified into model\-based and data\-driven approaches\[[7](https://arxiv.org/html/2605.08599#bib.bib6),[30](https://arxiv.org/html/2605.08599#bib.bib7)\]\.
Model\-based methods can generate continuous scenarios through mathematical modeling or rule systems\. For example, DiffScene\[[39](https://arxiv.org/html/2605.08599#bib.bib15)\]employs diffusion models combined with adversarial optimization to produce high\-quality safety\-critical scenarios\. Bagschik et al\.\[[3](https://arxiv.org/html/2605.08599#bib.bib13)\]proposed an ontology\-based highway scenario generation method\. Li et al\.\[[22](https://arxiv.org/html/2605.08599#bib.bib14)\]introduced a biologically inspired approach involving the exchange and mutation of scenario elements\.
Data\-driven methods rely on large\-scale scenario datasets, reproducing scenario characteristics and distributions by mining implicit information in the data\[[5](https://arxiv.org/html/2605.08599#bib.bib21)\]\. For example, Thal et al\.\[[36](https://arxiv.org/html/2605.08599#bib.bib22)\]generated high\-coverage test cases based on real driving data\. Bäumler et al\.\[[6](https://arxiv.org/html/2605.08599#bib.bib24)\]fused accident data with video\-based traffic observations to produce more representative test scenarios\.
However, model\-based methods are constrained by preset rules, and data\-driven methods are restricted by the original data distribution, making it difficult to generate emergency instances deduction beyond existing patterns\. To address the problem, WLDS introduces controllable randomness through LMs, dynamically adjusting generation strategies to enhance the diversity and randomness of emergency instances deduction\.
### 2\.2Simulation and Deduction Technology
Simulation and deduction are core technologies for risk assessment and decision\-making\[[16](https://arxiv.org/html/2605.08599#bib.bib2),[28](https://arxiv.org/html/2605.08599#bib.bib8)\]\. Traditional simulation models can integrate multiple perspectives to support complex decision\-making\[[4](https://arxiv.org/html/2605.08599#bib.bib9),[14](https://arxiv.org/html/2605.08599#bib.bib10)\], but their static nature limits applicability to dynamic scenarios\.
Digital twin technology, a key component of Industry 4\.0\[[19](https://arxiv.org/html/2605.08599#bib.bib25)\], continuously synchronizes with the physical system through real\-time multi\-source data\[[15](https://arxiv.org/html/2605.08599#bib.bib28),[25](https://arxiv.org/html/2605.08599#bib.bib26),[1](https://arxiv.org/html/2605.08599#bib.bib27),[34](https://arxiv.org/html/2605.08599#bib.bib29)\]\. For instance, Padovano et al\.\[[31](https://arxiv.org/html/2605.08599#bib.bib31)\]combined BIM and sensor data to build pedestrian flow simulations and used LSTM to predict congestion, triggering automated alerts that reduced emergency response time by 40%\.
Nevertheless, existing studies predominantly focus on the deduction of normal instances and lack research on emergency instances\. Due to the scarcity of emergency instances, models struggle to learn the unique evolutionary patterns of them, which leads to biases in prediction and poor emergency decision\-making effectiveness\. WLDS alleviates it by transferring the knowledge of emergency instances from other domains to the target domain to support emergency instances deduction\.
### 2\.3LMs\-Driven Simulation Technology
In recent years, traditional platforms such as RLBench\[[21](https://arxiv.org/html/2605.08599#bib.bib34)\]and CALVIN\[[27](https://arxiv.org/html/2605.08599#bib.bib35)\]rely on manual design or simple randomization, which cannot meet the demands of complex tasks\. Leveraging their powerful semantic understanding and cross\-modal reasoning capabilities, LMs provide a new impetus for advancing simulation technologies\[[35](https://arxiv.org/html/2605.08599#bib.bib12),[48](https://arxiv.org/html/2605.08599#bib.bib39)\]\.
Recent research has explored integrating LMs with simulation\[[47](https://arxiv.org/html/2605.08599#bib.bib36),[18](https://arxiv.org/html/2605.08599#bib.bib37),[45](https://arxiv.org/html/2605.08599#bib.bib38)\]\. Grutopia\[[37](https://arxiv.org/html/2605.08599#bib.bib32)\]constructs object–spatial relationship graphs for large\-scale indoor scenario generation\. RoboCasa\[[29](https://arxiv.org/html/2605.08599#bib.bib33)\]incorporates human demonstrations to optimize scenario layouts\. LLMScenario\[[8](https://arxiv.org/html/2605.08599#bib.bib3)\]employs prompt engineering and evaluation–feedback tuning to expand extreme cases in natural driving scenarios\.
However, in highly specialized domains such as autonomous driving and urban rail transit, LMs are prone to factual deviation and logical deviation\. To address this issue, WLDS introduces the dual calibration mechanism to ensure that generated content adheres to physical laws and maintains logical rigor while preserving diversity\.
## 3Method
Figure 2:The framework of the proposed WLDS\. Step 1: WLDS uses LMs to generate the initial emergency instance for the target domain based on knowledge bases from other domains\. Step 2: WLDS takes the generated emergency instances as the initial instance using LMs to generate descriptions of multiple potential scenarios\. Step 3: For the potential scenarios, WLDS introduces the dual calibration mechanism to address factual deviation and logical deviation, and then users can select one scenario as the direction for deduction\. Step 4: WLDS matches the corrected text with a keyframe image library\. If matching fails, it will invoke a text\-to\-image model to generate images, and filter out images with significant semantic deviations via a text\-image alignment discriminator\.As shown in Figure\.[2](https://arxiv.org/html/2605.08599#S3.F2), we propose WLDS which can achieve high\-precision and high\-fidelity simulation and deduction of emergency instances\. It comprises four core steps: emergency instances knowledge transformation, world line deduction, world line calibration, and world line visualization\. In this section, we will further elaborate on the implementation process of the proposed WLDS\.
### 3\.1Knowledge Transformation
To address the scarcity of emergency instances in specialized domains, we leverage the cross\-domain knowledge transfer capability of LMs\. We first collect emergency instances from different domains to construct an accident dataset𝒟acc\\mathcal\{D\}\_\{\\mathrm\{acc\}\}, which serves as the knowledge base for transfer\. Given a target domainBB, WLDS employs the LMs to generate domain\-specific emergency instances under a tailored prompt\.
The generation prompt is as follows:“Please generate possible emergency instances descriptions of the domainBBbased on the accident dataset\.”The generation process can be formalized as:
eB∼pθ\(⋅∣𝒟acc,𝒦B,π\)\.e\_\{B\}\\sim p\_\{\\theta\}\\big\(\\cdot\\mid\\mathcal\{D\}\_\{\\mathrm\{acc\}\},\\mathcal\{K\}\_\{B\},\\pi\\big\)\.\(1\)whereeBe\_\{B\}denotes an emergency instance in the target domainBB,pθp\_\{\\theta\}is the conditional probability distribution defined by the LMs with parametersθ\\theta,𝒦B\\mathcal\{K\}\_\{B\}represents the domain\-specific knowledge base, andπ\\pidenotes the prompt\.
By iteratively applying this process, we obtain a set ofNNgenerated domain\-specific instances:𝒟trans=\{e1,e2,…,eN\}\\mathcal\{D\}\_\{\\mathrm\{trans\}\}=\\\{e\_\{1\},e\_\{2\},\\dots,e\_\{N\}\\\}\. This transferred dataset𝒟trans\\mathcal\{D\}\_\{\\mathrm\{trans\}\}provides the foundation for subsequent world line deduction\.
### 3\.2World Line Deduction
To generate diverse world lines, WLDS introducescontrolled randomnessinto the LMs generation process\. This randomness is governed by atemperature parameterτk\\tau\_\{k\}, which balances logical plausibility with deduction diversity\. Given an initial events0∈𝒟transs\_\{0\}\\in\\mathcal\{D\}\_\{\\mathrm\{trans\}\}, the LMs generate multiple possible subsequent scenarios under a domain\-specific prompt:
sk∼pθ,τk\(⋅∣s0,𝒦B\),k=1,2,…,Ms\_\{k\}\\sim p\_\{\\theta,\\tau\_\{k\}\}\\big\(\\cdot\\mid s\_\{0\},\\mathcal\{K\}\_\{B\}\\big\),\\quad k=1,2,\\dots,M\(2\)where𝒦B\\mathcal\{K\}\_\{B\}is the target domain knowledge base, andτk\>0\\tau\_\{k\}\>0controls the diversity of generation\. A largerτk\\tau\_\{k\}yields more divergent scenarios, while a smallerτk\\tau\_\{k\}leads to more deterministic outputs\.
The randomness introduced by the temperature parameterτk\\tau\_\{k\}can be formalized as:
pθ,τk\(wi∣h\)=exp\(zi\(h\)/τk\)∑jexp\(zj\(h\)/τk\)\.p\_\{\\theta,\\tau\_\{k\}\}\(w\_\{i\}\\mid h\)=\\frac\{\\exp\\left\(z\_\{i\}\(h\)/\\tau\_\{k\}\\right\)\}\{\\sum\_\{j\}\\exp\\left\(z\_\{j\}\(h\)/\\tau\_\{k\}\\right\)\}\.\(3\)wherewiw\_\{i\}is theii\-th candidate token,hhdenotes the current context, andzi\(h\)z\_\{i\}\(h\)is the unnormalized logit for tokenii\.
All generated scenarios form𝒮=\{s1,s2,…,sM\}\\mathcal\{S\}=\\\{s\_\{1\},s\_\{2\},\\dots,s\_\{M\}\\\}\. The user then selects a scenariossel∈𝒮s\_\{\\mathrm\{sel\}\}\\in\\mathcal\{S\}to form the initial world line:W=\[s0,ssel\]W=\[\\,s\_\{0\},\\,s\_\{\\mathrm\{sel\}\}\\,\]\. This step provides a diverse foundation for subsequent calibration\.
### 3\.3World Line Calibration
In multi\-step deducing, event sequences generated by the model are prone to deviations caused by insufficient knowledge coverage or broken logical chains, which may lead to results that violate physical laws or domain\-specific common sense\. To address this, WLDS incorporates thedual calibration mechanismafter the initial construction of the world line, improving both its reliability and interpretability from two complementary perspectives: factual calibration and logical calibration\.
\(1\) Factual Calibration: It focuses on ensuring the consistency between individual events and the domain knowledge base𝒦B\\mathcal\{K\}\_\{B\}\. For each eventss, the system retrieves the most relevant factf\(s\)f\(s\)from𝒦B\\mathcal\{K\}\_\{B\}and computes a factual consistency scoreϕfact\(s,f\(s\)\)∈\[0,1\]\\phi\_\{\\mathrm\{fact\}\}\(s,f\(s\)\)\\in\[0,1\]\. Ifϕfact\(s,f\(s\)\)<δfact\\phi\_\{\\mathrm\{fact\}\}\(s,f\(s\)\)<\\delta\_\{\\mathrm\{fact\}\}, the event is revised under factual constraints:
s′∼pθ\(⋅∣s,f\(s\),𝒦B\)\.s^\{\\prime\}\\sim p\_\{\\theta\}\\big\(\\cdot\\mid s,f\(s\),\\mathcal\{K\}\_\{B\}\\big\)\.\(4\)whereδfact\\delta\_\{\\mathrm\{fact\}\}is the factual consistency threshold\. This step corrects explicit factual errors and ensures that each event is well\-grounded in the underlying knowledge base\.
\(2\) Logical Calibration: It targets the causal and temporal relationships between consecutive events\. For each adjacent pair\(si,sj\)\(s\_\{i\},s\_\{j\}\), a logical consistency functionψlogic\(si,sj\)∈\{valid,invalid\}\\psi\_\{\\mathrm\{logic\}\}\(s\_\{i\},s\_\{j\}\)\\in\\\{\\text\{valid\},\\text\{invalid\}\\\}is applied\. Ifψlogic\(si,sj\)=invalid\\psi\_\{\\mathrm\{logic\}\}\(s\_\{i\},s\_\{j\}\)=\\text\{invalid\}, the subsequent event is regenerated with logic calibration:
s′′∼pθ\(⋅∣si,𝒦B,logic\_fix\)\.s^\{\\prime\\prime\}\\sim p\_\{\\theta\}\\big\(\\cdot\\mid s\_\{i\},\\mathcal\{K\}\_\{B\},\\text\{logic\\\_fix\}\\big\)\.\(5\)This mechanism mitigates accumulated reasoning errors, avoiding illogical jumps or contradictions in the world line\.
\(3\) World Line Update: After both factual and logical calibration, the updated world line is:
W∗=\[s0,scalibrated\]\.W^\{\*\}=\[\\,s\_\{0\},\\,s\_\{\\mathrm\{calibrated\}\}\\,\]\.\(6\)wherescalibrateds\_\{\\mathrm\{calibrated\}\}satisfies both factual and logical constraints\.
By combining these two forms of calibration, WLDS significantly mitigates the potential hallucination issue that LMs may exhibit, producing world lines that are both factually accurate and logically coherent\.
### 3\.4World Line Visualization
To make the deduced world line more interpretable and accessible, WLDS incorporates a text\-image integrated visualization mechanism\. This mechanism not only generates keyframe images that are highly aligned with the semantics of each event, but also enhances the user’s perception and understanding of scenario evolution through multimodal fusion\.
First, a keyframe image library is constructed:ℐ=\{I1,I2,…,IP\}\\mathcal\{I\}=\\\{I\_\{1\},I\_\{2\},\\dots,I\_\{P\}\\\}, and a text\-image alignment functionα\(s,I\)∈\[0,1\]\\alpha\(s,I\)\\in\[0,1\]with a matching thresholdδalign\\delta\_\{\\mathrm\{align\}\}is defined\. For each eventss, the maximum alignment score is computed:
αmax\(s\)=maxI∈ℐα\(s,I\),\\alpha\_\{\\max\}\(s\)=\\max\_\{I\\in\\mathcal\{I\}\}\\alpha\(s,I\),\(7\)
Figure 3:The simulation and deduction results of the proposed WLDS in the aircraft’s emergency instance\. The red part represents the world line selected by the user\.Ifαmax\(s\)<δalign\\alpha\_\{\\max\}\(s\)<\\delta\_\{\\mathrm\{align\}\}, a new candidate imageI^∼pφ\(⋅∣s,𝒦B\)\\hat\{I\}\\sim p\_\{\\varphi\}\(\\cdot\\mid s,\\mathcal\{K\}\_\{B\}\)is generated using a text\-to\-image model and added to the extended candidate setℐ\+\(s\)\\mathcal\{I\}^\{\+\}\(s\)\. This ensures that events without suitable existing images can still be visually represented\.
The final keyframe selection is:
I∗\(s\)=\{argmaxI∈ℐ\+\(s\)α\(s,I\),ifmaxI∈ℐ\+\(s\)α\(s,I\)≥δalign,∅,otherwise\.I^\{\*\}\(s\)=\\begin\{cases\}\\arg\\max\\limits\_\{I\\in\\mathcal\{I\}^\{\+\}\(s\)\}\\alpha\(s,I\),&\\text\{if \}\\max\\limits\_\{I\\in\\mathcal\{I\}^\{\+\}\(s\)\}\\alpha\(s,I\)\\geq\\delta\_\{\\mathrm\{align\}\},\\\\\[6\.00006pt\] \\varnothing,&\\text\{otherwise\}\.\\end\{cases\}\(8\)whereℐ\+\(s\)\\mathcal\{I\}^\{\+\}\(s\)is the extended candidate set\.
The visualized world line can be formalized as:
V=\[\(s0,I∗\(s0\)\),\(s1,I∗\(s1\)\),…,\(sQ,I∗\(sQ\)\)\]\.V=\\big\[\(s\_\{0\},I^\{\*\}\(s\_\{0\}\)\),\\;\(s\_\{1\},I^\{\*\}\(s\_\{1\}\)\),\\;\\dots,\\;\(s\_\{Q\},I^\{\*\}\(s\_\{Q\}\)\)\\big\]\.\(9\)
Finally, the visualization sequence𝒱=\{V0,V1,…,VQ\}\\mathcal\{V\}=\\\{V\_\{0\},V\_\{1\},\\ldots,V\_\{Q\}\\\}is constructed by synchronizing the filtered keyframes with the text descriptions of events in𝒲∗\\mathcal\{W\}^\{\*\}\. This multimodal representation enables intuitive visualization of the world line’s evolution\. The demo of multiple world lines output by WLDS is shown in Figure\.[3](https://arxiv.org/html/2605.08599#S3.F3)\.
### 3\.5Design of Evaluation Metrics
To comprehensively evaluate the performance of WLDS, we consider two complementary dimensions:factual consistency\(FC\) andlogical consistency\(LC\)\.
\(1\) Factual Consistency \(FC\): Letℰ=\{e1,e2,…,eT\}\\mathcal\{E\}=\\\{e\_\{1\},e\_\{2\},\\dots,e\_\{T\}\\\}be the set of events in a world line, andϕfact\(e\)∈\{0,1\}\\phi\_\{\\mathrm\{fact\}\}\(e\)\\in\\\{0,1\\\}indicate whether eventeealigns with the domain knowledge base\. FC is defined as:
FC=\|\{e∈ℰ∣ϕfact\(e\)=1\}\|\|ℰ\|,\\mathrm\{FC\}=\\frac\{\\left\|\\\{e\\in\\mathcal\{E\}\\mid\\phi\_\{\\mathrm\{fact\}\}\(e\)=1\\\}\\right\|\}\{\|\\mathcal\{E\}\|\},\(10\)where the numerator counts factually consistent events, and the denominator is the total number of events\.
This metric measures the proportion of events that are factually correct, reflecting the system’s accuracy and reliability under knowledge constraints\.
\(2\) Logical Consistency \(LC\): Let𝒫=\{\(ei,ei\+1\)\}i=1T−1\\mathcal\{P\}=\\\{\(e\_\{i\},e\_\{i\+1\}\)\\\}\_\{i=1\}^\{T\-1\}be the set of adjacent event pairs, andψlogic\(ei,ej\)∈\{0,1\}\\psi\_\{\\mathrm\{logic\}\}\(e\_\{i\},e\_\{j\}\)\\in\\\{0,1\\\}indicate whether the pair is logically valid\. LC is defined as:
LC=\|\{\(ei,ej\)∈𝒫∣ψlogic\(ei,ej\)=1\}\|\|𝒫\|\.\\mathrm\{LC\}=\\frac\{\\left\|\\\{\(e\_\{i\},e\_\{j\}\)\\in\\mathcal\{P\}\\mid\\psi\_\{\\mathrm\{logic\}\}\(e\_\{i\},e\_\{j\}\)=1\\\}\\right\|\}\{\|\\mathcal\{P\}\|\}\.\(11\)LC evaluates the stability and coherence of the reasoning chain across multi\-step deductions\.
Both FC and LC take values in\[0,1\]\[0,1\], with higher scores indicating better factual alignment and logical coherence\. This dual\-metric design provides a robust and interpretable basis for quantitative performance analysis\. The entire workflow of WLDS is summarized in Algorithm 1\.
Algorithm 1Workflow of the LMs\-driven World Line Divergence System \(WLDS\)0:Accident dataset
𝒟acc\\mathcal\{D\}\_\{\\mathrm\{acc\}\}, domain\-specific knowledge base
𝒦B\\mathcal\{K\}\_\{B\}, prompt template
π\\pi
0:Visualized world line
VV, evaluation metrics \(FC, LC\)
1:Generate domain\-specific emergency instances using Eq\.[1](https://arxiv.org/html/2605.08599#S3.E1), and construct transferred dataset
𝒟trans\\mathcal\{D\}\_\{\\mathrm\{trans\}\}\.
2:Select initial event
s0∈𝒟transs\_\{0\}\\in\\mathcal\{D\}\_\{\\mathrm\{trans\}\}, generate candidate scenarios under different temperature settings by Eq\.[2](https://arxiv.org/html/2605.08599#S3.E2)and Eq\.[3](https://arxiv.org/html/2605.08599#S3.E3), form candidate set
𝒮\\mathcal\{S\}and initial world line
WW\.
3:For each event, compute factual consistency score by
ϕfact\(s,f\(s\)\)\\phi\_\{\\mathrm\{fact\}\}\(s,f\(s\)\)\. If
ϕfact\(s,f\(s\)\)<δfact\\phi\_\{\\mathrm\{fact\}\}\(s,f\(s\)\)<\\delta\_\{\\mathrm\{fact\}\}, revise the event using Eq\.[4](https://arxiv.org/html/2605.08599#S3.E4)\.
4:For each adjacent event pair, check logical validity by
ψlogic\(si,sj\)\\psi\_\{\\mathrm\{logic\}\}\(s\_\{i\},s\_\{j\}\)\. If invalid, regenerate the subsequent event using Eq\.[5](https://arxiv.org/html/2605.08599#S3.E5), and update the world line as in Eq\.[6](https://arxiv.org/html/2605.08599#S3.E6)\.
5:For each event, compute maximum alignment score by Eq\.[7](https://arxiv.org/html/2605.08599#S3.E7)\. If below
δalign\\delta\_\{\\mathrm\{align\}\}, generate additional image using a text\-to\-image model and select final keyframe by Eq\.[8](https://arxiv.org/html/2605.08599#S3.E8)\. Construct visualized world line
VVusing Eq\.[9](https://arxiv.org/html/2605.08599#S3.E9)\.
6:Compute factual consistency \(FC\) and logical consistency \(LC\) using Eq\.[10](https://arxiv.org/html/2605.08599#S3.E10)and Eq\.[11](https://arxiv.org/html/2605.08599#S3.E11)\.
7:return
VV, \(FC, LC\)
## 4Experiment
In our experiments, we aim to: \(1\) evaluate whether WLDS can achieve high\-precision and high\-fidelity simulation and deduction in specific domains represented by carrier\-based aircraft and urban rail transit, \(2\) assess the effectiveness of the factual calibration mechanism in real\-time calibration of factually deviation content by relying on domain knowledge bases, \(3\) evaluate the effectiveness of the logical calibration mechanism in calibration logically deviation content by analyzing causal relationships, \(4\) validate the effectiveness of the EID benchmark dataset in the performance evaluation and optimization of emergency instances deduction models\. All experiments were run on two A6000 GPUs\. The professional knowledge base employed by the factual calibration mechanism is constructed based on professional books and instruction manual\. The code and data for the proposed method are provided for research purposes\.111Code is included in the supplemental material and will be released upon the paper acceptance\.
### 4\.1Introduction of EID benchmark dataset
Figure 4:Statistical distribution of the 10 sub\-datasets of EID benchmark dataset\.Table 1:Quantitative comparison between WLDS and baseline models in the 10 sub\-datasets of EID benchmark dataset\. The evaluated models include WLDS\+H \(WLDS combined with Hunyuan\-Turbos\), WLDS\+G \(WLDS combined with GLM\-4\-Plus\), and WLDS\+Q \(WLDS combined with Qwen\-Max\)\. The performance of each model is measured in terms of FC, LC and the scenario prediction accuracy on the sub\-datasets of EID benchmark dataset\.DatasetsMethodsFCLCEIDCarrier\-based aircraftHunyuan85\.11%82\.92%84\.20%WLDS\+H90\.42%89\.58%88\.60%GLM85\.42%83\.75%85\.40%WLDS\+G89\.33%87\.43%89\.20%Qwen86\.25%82\.08%84\.80%WLDS\+Q90\.29%88\.24%88\.60%Urban rail transitHunyuan86\.67%83\.33%82\.40%WLDS\+H93\.75%91\.67%90\.20%GLM87\.08%85\.42%81\.20%WLDS\+G92\.29%90\.62%89\.60%Qwen88\.33%84\.58%80\.60%WLDS\+Q91\.43%91\.18%88\.60%Autonomous drivingHunyuan90\.00%88\.75%85\.20%WLDS\+H94\.17%93\.33%90\.80%GLM89\.19%87\.50%84\.00%WLDS\+G93\.75%91\.17%87\.80%Qwen90\.32%87\.88%85\.80%WLDS\+Q93\.55%91\.29%89\.80%Construction siteHunyuan87\.50%84\.17%82\.22%WLDS\+H92\.92%91\.25%89\.40%GLM86\.25%84\.58%81\.56%WLDS\+G92\.57%90\.91%88\.22%Qwen87\.14%83\.75%80\.89%WLDS\+Q91\.84%90\.62%89\.11%Chemical plantHunyuan87\.10%85\.00%82\.75%WLDS\+H91\.25%90\.42%90\.75%GLM88\.33%84\.83%83\.00%WLDS\+G91\.98%89\.58%91\.50%Qwen88\.75%85\.96%82\.75%WLDS\+Q92\.56%90\.32%89\.25%DatasetsMethodsFCLCEIDNuclear power plantHunyuan85\.83%85\.71%82\.00%WLDS\+H90\.32%90\.00%89\.14%GLM85\.71%86\.11%83\.71%WLDS\+G89\.97%90\.62%87\.43%Qwen85\.42%84\.58%82\.86%WLDS\+Q89\.43%89\.57%88\.29%Unmanned boatHunyuan88\.57%87\.21%86\.60%WLDS\+H93\.33%91\.41%90\.60%GLM89\.58%87\.10%86\.40%WLDS\+G92\.29%91\.14%91\.20%Qwen88\.33%86\.67%87\.20%WLDS\+Q93\.57%92\.50%90\.40%MineHunyuan87\.08%82\.92%85\.71%WLDS\+H92\.08%90\.32%88\.57%GLM85\.83%84\.17%84\.29%WLDS\+G91\.18%90\.32%88\.00%Qwen86\.67%83\.33%85\.43%WLDS\+Q90\.29%89\.86%89\.43%Biochemical laboratoryHunyuan87\.33%86\.01%86\.00%WLDS\+H91\.67%90\.91%89\.75%GLM86\.67%87\.50%86\.25%WLDS\+G91\.14%91\.71%89\.75%Qwen85\.42%84\.07%84\.75%WLDS\+Q89\.74%88\.29%88\.75%Automated portHunyuan89\.07%87\.54%87\.43%WLDS\+H91\.55%92\.92%90\.85%GLM88\.57%87\.92%86\.29%WLDS\+G91\.71%90\.83%90\.29%Qwen89\.58%86\.58%87\.71%WLDS\+Q92\.05%91\.88%91\.43%
To enhance the multi\-step deduction capability of reasoning models for emergency instances, we propose Emergency Instances Deduction \(EID\) benchmark dataset which is a benchmark dataset focused on multi\-step emergency instances deduction\. As shown in Figure\.[4](https://arxiv.org/html/2605.08599#S4.F4), EID benchmark dataset contains 4,300 high\-quality three\-step deduction data entries of emergency instances, covering 10 sub\-datasets such as EID\-Carrier\-based aircraft and EID\-Urban rail transit\. Each data entry starts from an initial emergency instance and forms 14 diverse branch scenarios through a three\-stage structured deduction\. The labels annotated by humans include the most probable scenario, as well as the probability and loss severity of each world line\. This dataset fills the gap in the dynamic modeling of emergency instances deduction in existing benchmark datasets\. Furthermore, it provides high\-quality data for training and evaluating the multi\-step reasoning capability, scenario prediction accuracy, and domain\-adaptive reasoning performance of emergency instances deducing models\.
### 4\.2Quantitative Analysis of WLDS
In the quantitative experiment, Fact Consistency \(FC\) and Logical Consistency \(LC\) were employed to quantitatively verify the effectiveness of WLDS from the two dimensions of fact and logic\. Meanwhile, comparative experiments between WLDS and three LMs \(Hunyuan\-Turbos, GLM\-4\-Plus, and Qwen\-Max\), are conducted to verify the supporting role of the benchmark dataset EID in the performance evaluation of reasoning models\. All models are invoked through APIs\.
The experimental results are shown in Table\.[1](https://arxiv.org/html/2605.08599#S4.T1)\. In terms of FC and LC, WLDS achieves an average FC of 91\.75% and an average LC of 90\.66% across all tested domains, significantly outperforming the other comparison methods, demonstrating its effectiveness in emergency instances deduction\. Specifically, in the domain of urban rail transit, WLDS demonstrates significant improvements over the baseline model\. The FC and LC of Hunyuan\-Turbos are 86\.67% and 83\.33% respectively\. In contrast, when combined with Hunyuan\-Turbos, WLDS can achieve FC and LC of 93\.75% and 91\.67%, representing relative improvements of 7\.08% and 8\.34%\. This improvement indicates that WLDS can effectively enhance factual consistency and logical consistency during the deduction process in complex emergency instances deduction\. Furthermore, WLDS achieved the highest scenario prediction accuracy of 91\.50% in EID\-Chemical plant, a notable improvement of 8\.50% compared to the baseline model GLM\. This result not only validates WLDS’s robustness in multi\-step emergency instances deduction but also highlights the supporting role of the EID benchmark dataset in performance evaluation\. Additionally, WLDS’s FC of 94\.17% and LC of 93\.33% in the autonomous driving domain represent the best performance across all tested domains\. In other domains, WLDS maintains an FC above 90%, further proving its stability and reliability in emergency instances deduction\. The outstanding performance of WLDS in various specific domains reflects its powerful capabilities in emergency instances deduction, providing a new approach for the simulation and deduction in various domains\.
### 4\.3Performance of WLDS
Figure 5:The results of emergency instances deduction by WLDS in the urban rail transit domain\. The red path denotes the user\-selected world line\.Figure\.[5](https://arxiv.org/html/2605.08599#S4.F5)presents the simulation and deduction results of WLDS in the emergency insatances of the urban rail transit domain\. WLDS takes ”A waste bin on the subway platform caught fire, emitting thick smoke”, as the initial emergency instance, and generates 7 world lines, sequentially named World Line 0 to World Line 6\. Under the same initial emergency instance, each world line incorporates controlled randomness and diversity, generating multi\-stage evolutionary processes ranging from mildly controllable fire situations to high\-risk states involving crowd stampedes\.
\(a\)Visualization of the probability and loss severity of the WLDS\-generated world lines\.
\(b\)Knowledge graph constructed from the seven world lines\.
Figure 6:WLDS outputs: \(a\) probability and loss severity across the seven generated world lines; \(b\) the corresponding knowledge graph\.From the perspective of event evolution logic, the 7 world lines depicted in Figure\.[5](https://arxiv.org/html/2605.08599#S4.F5)exhibit a progressive evolutionary feature from low risk to high risk, with distinct differences in response strategies: World Line 0 represents a typical low\-risk scenario, where the fire was effectively extinguished by staff using fire extinguishers at the initial stage, enabling the rapid restoration of normal operations\. World Line 1 falls into a low\-to\-moderate risk scenario which shows that the initial fire suppression failed to fully control the fire, leading to the activation of emergency plans and passenger evacuation\. After fire department intervention extinguished the fire, the smoke exhaust system was activated for post\-incident handling\. Both World Line 2 and World Line 3 represent moderate\-risk scenarios, where the fire spread rapidly, forcing trains to stop entering the station\. The fire department conducted extinguishing operations using equipment such as dry powder and carbon dioxide fire extinguishers\. The difference lies in that World Line 2 resumed operations directly after fire extinguishment and smoke exhaust, while World Line 3 implemented equipment inspection and platform maintenance after fire control to optimize subsequent operational safety\. World Line 4 shows a risk\-recession path which shows that after heavy smoke triggered the fire alarm system and passengers were evacuated, staff found that the fire had extinguished itself, and then shifted to fire cause investigation and full\-station safety inspection to eliminate potential hazards\. World Line 5 belongs to a moderate\-to\-high risk scenario, where fire expansion prompted simultaneous evacuation and fire\-fighting actions, with operations resumed after extinguishment and confirmation of no re\-ignition risk\. World Line 6 represents the highest\-risk scenario which shows that during a fire development process similar to that of World Line 5, a stampede occurred during the evacuation phase\. In addition to fire extinguishment, the fire department also assisted in crowd management and safe evacuation, reflecting the need for multi\-departmental emergency coordination in complex disaster scenarios\.
From the perspectives of factual accuracy and operational standardization, each step complies with subway safety regulations\. Direct fire suppression in low\-risk scenarios aligns with the principle of on\-site rapid disposal for initial fires\. The procedures such as operation suspension, evacuation, smoke exhaust, and equipment inspection in moderate\-to\-high risk scenarios are consistent with subway operational safety standards\. And the crowd control and medical assistance in stampede incidents meet the emergency response requirements for sudden crowd accidents\.
Figure 7:Partial screenshots of a conventional urban rail transit simulation system\.In addition, for the 7 world lines deduced by WLDS, we employed the GLM4\-9B model, trained on the EID benchmark dataset, to analyze each world line from the dimensions of probability and loss severity, and visualized the results to assist the user in understanding each world line\. As shown in Figure\.[6a](https://arxiv.org/html/2605.08599#S4.F6.sf1), the 7 world lines generated by WLDS exhibit different probability and loss severity\. This demonstrates that WLDS not only generates diverse deduction directions but also presents a broader range of risk evolution patterns while maintaining high fidelity\.
As shown in Figure\.[6b](https://arxiv.org/html/2605.08599#S4.F6.sf2), WLDS generates a knowledge graph representation of the 7 deduced world lines\. The nodes cover multiple categories such as objects, roles, devices, events, phenomena, and states, while the edges represent their logical and causal relationships\. Through this knowledge graph, the causal chains and interactions of emergency instances under different deduction paths can be more intuitively revealed, thereby helping users understand the complex logical dynamics in multi\-path deductions\.
Table 2:Results of the ablation study, comparing WLDS performance after removing logical calibration or factual calibration, demonstrating the contribution of each core module\.Figure\.[7](https://arxiv.org/html/2605.08599#S4.F7)illustrates the performance of the existing traditional urban rail transit simulation system\. From the figure, the shortcomings can be clearly observed: interpenetration between human models, visually unrealistic flame effects, duplicated avatars leading to a lack of character distinctiveness, and human postures that deviate from physical laws\. These defects not only weaken the understanding of crowd intentions but also reduce the reliability of risk assessment and causal reasoning\. In contrast, WLDS offers significant advantages in the following three aspects: \(1\) Factual and logical constraints: Through dual factual and logical calibration, WLDS confines deduction to states and processes consistent with domain knowledge, reducing the non\-physical or program\-logic errors frequently seen in the traditional simulation\. \(2\) Diverse world\-line generation: By leveraging LMs for emergency instances deduction, WLDS generates multiple reasonable world lines under knowledge constraints, while maintaining both randomness and diversity\. \(3\) Semantically aligned keyframe visualization: By employing text–image alignment assessment and image generation, WLDS selects or synthesizes keyframe images consistent with the narrative of deduction, making high expressiveness the primary objective of simulation and deduction\. In summary, the 7 world lines deduced by WLDS cover the complete risk spectrum, ranging from slightly controllable fires to incidents involving personal injuries\. These results highlight WLDS’s capability for high\-fidelity and diversified scenario generation and offer a visualized and verifiable experimental basis for risk assessment and decision support in the urban rail transit domain\.
### 4\.4Ablation Study: Importance of Dual Calibration Mechanism
Figure 8:Comparative deduction results of WLDS in the urban rail transit domain without logical calibration and without factual calibration\.WLDS introduces factual calibration and logical calibration to address factual deviation and logical deviation\. To quantitatively evaluate the specific contributions of these two mechanisms to the performance of WLDS, this section conducts the ablation study\. We removed factual calibration and logical calibration respectively to perform deduction on emergency instances\. All initial scenarios of the experiments are derived from emergency instances in the domain of urban rail transit\. The quantitative results are presented in Table\.[2](https://arxiv.org/html/2605.08599#S4.T2)\.
The results indicate that the WLDS model achieves a factual consistency of 93\.75%, a logical consistency of 91\.67%, and an accuracy of 90\.2% on the EID benchmark dataset, all of which outperform the performance of models using only the factual calibration mechanism or the logical calibration mechanism individually\. This validates that the dual calibration mechanism not only ensures the factual reliability of individual events through factual calibration but also maintains the causal continuity between consecutive events via logical calibration, thereby collectively enabling WLDS to overcome the hallucination issues of LMs and achieve high\-precision and high\-fidelity simulation and deduction of emergency incidents\.
Table 3:Evaluation criteria used by domain experts for scoring WLDS deduction performance\.Figure\.[8](https://arxiv.org/html/2605.08599#S4.F8)further demonstrates the significance of the dual calibration mechanism\. In the scenario without logical calibration \(left figure\), the sequence exhibits a distinct logical discontinuity\. For example, when the event progresses to ”the fire was quickly brought under control and the station resumed normal operations”, the subsequent step describes ”staff beginning to evacuate the crowd to a safe location”\. This constitutes a logical inconsistency, as the evacuation following the station’s return to normal operation lacks causal justification\. In the scenario without factual calibration \(right panel\), a factual deviation emerges, where the model deduces that ”the fire department extinguished the fire using fire trucks”\. This is a factual inconsistency because, in the enclosed underground space of a subway platform, fire trucks cannot access the area, making it impossible to use fire trucks for fire extinguishing operations\. The qualitative observations from Figure\.[8](https://arxiv.org/html/2605.08599#S4.F8)further confirm the critical role of the dual calibration mechanism\.
### 4\.5Evaluation by Domain Experts
To further validate the effectiveness of WLDS in practical applications, 20 experts from the relevant domains were invited to conduct an evaluation\. The evaluation dimensions include four aspects: diversity, precision, rigor and feasibility, adopting a 5\-point scoring system where 0 point represents the worst performance and 5 points represent the best\. The relevant evaluation criteria are presented in Table\.[3](https://arxiv.org/html/2605.08599#S4.T3)\.
Figure 9:Visualization of the expert evaluation scores for WLDS on the 10 sub\-datasets of the EID benchmark dataset\.Figure\.[9](https://arxiv.org/html/2605.08599#S4.F9)shows that the deduction in different domains of WLDS received consistently high ratings from experts\. Specifically, the autonomous driving domain received the highest average score of 4\.8 which reflected strong expert recognition about WLDS’s ability to generate diverse, accurate, and logically rigorous emergency instances deduction in this domain\. The average score reached 4\.4, demonstrating that WLDS consistently meets the practical requirements of emergency instances deduction in diverse specific domains\.
## 5Discussion
Implications:The proposed WLDS addresses key challenges in emergency instances deduction of specific domains, including the scarcity of emergency instances, limited scenario diversity, and logical inconsistency in simulation outputs\. By introducing cross\-domain knowledge transfer in the instance generation phase, WLDS can produce highly relevant and realistic initial emergency instances deduction even in the absence of sufficient in\-domain samples\. The incorporation of controlled randomness during deduction ensures logical plausibility while enhancing diversity, and the dual calibration mechanism guarantees both factual accuracy and logical coherence\. Furthermore, the integration of keyframe\-based visualization improves interpretability, providing an effective tool for training and evaluating decision\-making in response to emergency instances\.
Limitations and future work:Although WLDS can achieve excellent performance in short\-step deduction, logical coherence tends to decrease to some extent as the number of reasoning steps increases\. In addition, While the current image\-text integrated visualization approach can intuitively present the evolution of scenarios, it remains insufficient in expressing dynamic interactions, temporal rhythms, and immersive experiences\. In future work, we plan to improve WLDS in two directions: \(1\) By incorporating more detailed domain knowledge representations and multi\-stage deduction mechanisms, we seek to enhance the accuracy and logical consistency of long\-chain deduction tasks\. \(2\) We plan to incorporate dynamic visualization methods, such as animated scenarios, interactive environments, and video generation, to improve temporal continuity and overall immersion\.
## 6Conclusion
This paper proposes WLDS which is a LMs\-driven system for simulation and deduction of emergency instances\. WLDS generates highly relevant emergency instances via cross\-domain knowledge transfer, introduces controlled randomness to produce diverse scenarios, and applies the dual calibration mechanism to ensure both factual accuracy and logical coherence\. Keyframe\-based visualization further enhances interpretability and user understanding\. Experimental results demonstrate that WLDS achieves superior performance in factual consistency, logical consistency, and scenario diversity compared with existing approaches, confirming its effectiveness and applicability in emergency instances deduction\.
## References
- \[1\]V\. Astarita, G\. Guido, S\. S\. Haghshenas, and S\. S\. Haghshenas\(2024\)Risk reduction in transportation systems: the role of digital twins according to a bibliometric\-based literature review\.Sustainability16\(8\)\.External Links:ISSN 2071\-1050,[Document](https://dx.doi.org/10.3390/su16083212)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p2.1)\.
- \[2\]I\. Bae, J\. Lee, and H\. Jeon\(2025\)Continuous locomotive crowd behavior generation\.External Links:2504\.04756,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2504.04756)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p2.1)\.
- \[3\]G\. Bagschik, T\. Menzel, and M\. Maurer\(2018\)Ontology based scene creation for the development of automated vehicles\.In2018 IEEE Intelligent Vehicles Symposium \(IV\),Vol\.,pp\. 1813–1820\.External Links:[Document](https://dx.doi.org/10.1109/IVS.2018.8500632)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p2.1)\.
- \[4\]G\. Baudry, C\. Macharis, and T\. Vallée\(2018\)Range\-based multi\-actor multi\-criteria analysis: a combined method of multi\-actor multi\-criteria analysis and monte carlo simulation to support participatory decision making under uncertainty\.European Journal of Operational Research264\(1\),pp\. 257–269\.External Links:ISSN 0377\-2217,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ejor.2017.06.036)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p1.1)\.
- \[5\]M\. Bäumler, F\. Linke, and G\. Prokop\(2024\)Categorizing data\-driven methods for test scenario generation to assess automated driving systems\.IEEE Access12\(\),pp\. 52030–52050\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2024.3385646)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p3.1)\.
- \[6\]M\. Bäumler and G\. Prokop\(2024\)Test scenario fusion: how to fuse scenarios from accident and traffic observation data\.IEEE Access12\(\),pp\. 16354–16374\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2023.3340442)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p3.1)\.
- \[7\]J\. Cai, S\. Yang, and H\. Guang\(2024\)A review on scenario generation for testing autonomous vehicles\.In2024 IEEE Intelligent Vehicles Symposium \(IV\),Vol\.,pp\. 3371–3376\.External Links:[Document](https://dx.doi.org/10.1109/IV55156.2024.10588675)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p1.1)\.
- \[8\]C\. Chang, S\. Wang, J\. Zhang, J\. Ge, and L\. Li\(2024\)LLMScenario: large language model driven scenario generation\.IEEE Transactions on Systems, Man, and Cybernetics: Systems54\(11\),pp\. 6581–6594\.External Links:[Document](https://dx.doi.org/10.1109/TSMC.2024.3392930)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p2.1)\.
- \[9\]D\. Chen, Z\. Hu, P\. Fan, Y\. Zhuang, Y\. Li, Q\. Liu, X\. Jiang, and M\. Xu\(2025\)KKA: improving vision anomaly detection through anomaly\-related knowledge from large language models\.External Links:2502\.14880,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2502.14880)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[10\]D\. Chen, S\. Zhang, F\. Gao, Y\. Zhuang, S\. Tang, Q\. Liu, and M\. Xu\(2024\)Logic distillation: learning from code function by function for planning and decision\-making\.External Links:2407\.19405,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2407.19405)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[11\]D\. Chen, S\. Zhang, Y\. Zhuang, S\. Tang, Q\. Liu, H\. Wang, and M\. Xu\(2024\)Improving large models with small models: lower costs and better performance\.External Links:2406\.15471,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2406.15471)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[12\]D\. Chen, Y\. Zhuang, S\. Zhang, J\. Liu, S\. Dong, and S\. Tang\(2024\-Mar\.\)Data shunt: collaboration of small and large models for lower costs and better performance\.Proceedings of the AAAI Conference on Artificial Intelligence38\(10\),pp\. 11249–11257\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v38i10.29003)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[13\]X\. Cheng, K\. Zhang, T\. Wu, Z\. Xu, and X\. Gou\(2024\)An opinions\-updating model for large\-scale group decision\-making driven by autonomous learning\.Information Sciences662,pp\. 120238\.External Links:ISSN 0020\-0255,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ins.2024.120238)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[14\]R\.T\.H\. Chin, S\.\-P\.A\. van Houten, and A\. Verbraeck\(2005\)Towards a simulation and visualization portal to support multi\-actor decision making in mainports\.InProceedings of the Winter Simulation Conference, 2005\.,Vol\.,pp\. 6 pp\.–\.External Links:[Document](https://dx.doi.org/10.1109/WSC.2005.1574544)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p1.1)\.
- \[15\]A\. J\. G\. de Azambuja, T\. Giese, K\. Schützer, R\. Anderl, B\. Schleich, and V\. R\. Almeida\(2024\)Digital twins in industry 4\.0 – opportunities and challenges related to cyber security\.Procedia CIRP121,pp\. 25–30\.Note:11th CIRP Global Web Conference \(CIRPe 2023\)External Links:ISSN 2212\-8271,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.procir.2023.09.225)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p2.1)\.
- \[16\]W\. de Paula Ferreira, F\. Armellini, and L\. A\. De Santa\-Eulalia\(2020\)Simulation in industry 4\.0: a state\-of\-the\-art review\.Computers & Industrial Engineering149,pp\. 106868\.External Links:ISSN 0360\-8352,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cie.2020.106868)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p1.1)\.
- \[17\]C\. H\. dos Santos, J\. A\. de Queiroz, F\. Leal, and J\. A\. B\. Montevechi\(2022\)Use of simulation in the industry 4\.0 context: creation of a digital twin to optimise decision making on non\-automated process\.Journal of Simulation16\(3\),pp\. 284–297\.External Links:[Document](https://dx.doi.org/10.1080/17477778.2020.1811172)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p1.1)\.
- \[18\]W\. Gan, M\. Dao, and K\. Zettsu\(2025\)Case\-based reasoning augmented large language model framework for decision making in realistic safety\-critical driving scenarios\.External Links:2506\.20531,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2506.20531)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p2.1)\.
- \[19\]A\. L\. Hananto, A\. Tirta, S\. G\. Herawan, M\. Idris, M\. E\. M\. Soudagar, D\. W\. Djamari, and I\. Veza\(2024\)Digital twin and 3d digital twin: concepts, applications, and challenges in industry 4\.0 for digital twin\.Computers13\(4\)\.External Links:ISSN 2073\-431X,[Document](https://dx.doi.org/10.3390/computers13040100)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p2.1)\.
- \[20\]S\. Hawking and G\. EllisThe large scale struc ture of space tim\-e \(cambridge university press, cambridge, england, 1973\)\.B\. Carter, Phys\. Rev\.2,pp\. 174\.Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p4.1)\.
- \[21\]S\. James, Z\. Ma, D\. R\. Arrojo, and A\. J\. Davison\(2020\)RLBench: the robot learning benchmark & learning environment\.IEEE Robotics and Automation Letters5\(2\),pp\. 3019–3026\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2020.2974707)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p1.1)\.
- \[22\]A\. Li, S\. Chen, L\. Sun, N\. Zheng, M\. Tomizuka, and W\. Zhan\(2022\)SceGene: bio\-inspired traffic scenario generation for autonomous driving testing\.IEEE Transactions on Intelligent Transportation Systems23\(9\),pp\. 14859–14874\.External Links:[Document](https://dx.doi.org/10.1109/TITS.2021.3134661)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p2.1)\.
- \[23\]S\. Li, T\. Azfar, and R\. Ke\(2024\)ChatSUMO: large language model for automating traffic scenario generation in simulation of urban mobility\.IEEE Transactions on Intelligent Vehicles\(\),pp\. 1–12\.External Links:[Document](https://dx.doi.org/10.1109/TIV.2024.3508471)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[24\]H\. Liu, L\. Zhang, S\. K\. Sastry Hari, and J\. Zhao\(2024\)Safety\-critical scenario generation via reinforcement learning based editing\.In2024 IEEE International Conference on Robotics and Automation \(ICRA\),Vol\.,pp\. 14405–14412\.External Links:[Document](https://dx.doi.org/10.1109/ICRA57147.2024.10611555)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p2.1)\.
- \[25\]X\. Liu and I\. David\(2024\)AI simulation by digital twins: systematic survey of the state of the art and a reference framework\.InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems,MODELS Companion ’24,New York, NY, USA,pp\. 401–412\.External Links:ISBN 9798400706226,[Document](https://dx.doi.org/10.1145/3652620.3688253)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p2.1)\.
- \[26\]F\. Lu, F\. Meng, and H\. Bi\(2025\)Scenario deduction of explosion accident based on fuzzy dynamic bayesian network\.Journal of Loss Prevention in the Process Industries96,pp\. 105613\.External Links:ISSN 0950\-4230,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jlp.2025.105613)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p1.1)\.
- \[27\]O\. Mees, L\. Hermann, E\. Rosete\-Beas, and W\. Burgard\(2022\)CALVIN: a benchmark for language\-conditioned policy learning for long\-horizon robot manipulation tasks\.IEEE Robotics and Automation Letters7\(3\),pp\. 7327–7334\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2022.3180108)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p1.1)\.
- \[28\]D\. Mourtzis\(2020\)Simulation in the design and operation of manufacturing systems: state of the art and new trends\.International Journal of Production Research58\(7\),pp\. 1927–1949\.External Links:[Document](https://dx.doi.org/10.1080/00207543.2019.1636321),https://doi\.org/10\.1080/00207543\.2019\.1636321Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p1.1)\.
- \[29\]S\. Nasiriany, A\. Maddukuri, L\. Zhang, A\. Parikh, A\. Lo, A\. Joshi, A\. Mandlekar, and Y\. Zhu\(2024\)RoboCasa: large\-scale simulation of everyday tasks for generalist robots\.External Links:2406\.02523,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2406.02523)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p2.1)\.
- \[30\]T\. Niu, K\. Zhang, Z\. Gan, and W\. Ding\(2024\)Planning by simulation: motion planning with learning\-based parallel scenario prediction for autonomous driving\.External Links:2411\.09887,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2411.09887)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p1.1)\.
- \[31\]A\. Padovano, F\. Longo, L\. Manca, and R\. Grugni\(2024\)Improving safety management in railway stations through a simulation\-based digital twin approach\.Computers & Industrial Engineering187,pp\. 109839\.External Links:ISSN 0360\-8352,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cie.2023.109839)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p2.1)\.
- \[32\]N\. C\. Rajashekar, Y\. E\. Shin, Y\. Pu, S\. Chung, K\. You, M\. Giuffre, C\. E\. Chan, T\. Saarinen, A\. Hsiao, J\. Sekhon, A\. H\. Wong, L\. V\. Evans, R\. F\. Kizilcec, L\. Laine, T\. Mccall, and D\. Shung\(2024\)Human\-algorithmic interaction using a large language model\-augmented artificial intelligence clinical decision support system\.InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems,CHI ’24,New York, NY, USA\.External Links:ISBN 9798400703300,[Document](https://dx.doi.org/10.1145/3613904.3642024)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[33\]M\. Rissanen, L\. Metso, K\. Elfvengren, and T\. Sinkkonen\(2020\)Serious games for decision\-making processes: a systematic literature review\.InEngineering Assets and Public Infrastructures in the Age of Digitalization,J\. P\. Liyanage, J\. Amadi\-Echendu, and J\. Mathew \(Eds\.\),Cham,pp\. 330–338\.Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p1.1)\.
- \[34\]C\. Roumeliotis, M\. Dasygenis, V\. Lazaridis, and M\. Dossis\(2024\)Blockchain and digital twins in smart industry 4\.0: the use case of supply chain\-a review of integration techniques and applications\.Designs8\(6\)\.External Links:ISSN 2411\-9660,[Document](https://dx.doi.org/10.3390/designs8060105)Cited by:[§2\.2](https://arxiv.org/html/2605.08599#S2.SS2.p2.1)\.
- \[35\]T\. V\. Samak, C\. V\. Samak, B\. Li, and V\. Krovi\(2025\)When digital twins meet large language models: realistic, interactive, and editable simulation for autonomous driving\.External Links:2507\.00319,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2507.00319)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p1.1)\.
- \[36\]S\. Thal, R\. Henze, R\. Hasegawa, H\. Nakamura, H\. Imanaga, J\. Antona\-Makoshi, and N\. Uchida\(2022\)Generic detection and search\-based test case generation of urban scenarios based on real driving data\.In2022 IEEE Intelligent Vehicles Symposium \(IV\),Vol\.,pp\. 694–701\.External Links:[Document](https://dx.doi.org/10.1109/IV51971.2022.9827198)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p3.1)\.
- \[37\]H\. Wang, J\. Chen, W\. Huang, Q\. Ben, T\. Wang, B\. Mi, T\. Huang, S\. Zhao, Y\. Chen, S\. Yang, P\. Cao, W\. Yu, Z\. Ye, J\. Li, J\. Long, Z\. Wang, H\. Wang, Y\. Zhao, Z\. Tu, Y\. Qiao, D\. Lin, and J\. Pang\(2024\)GRUtopia: dream general robots in a city at scale\.External Links:2407\.10943,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2407.10943)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p2.1)\.
- \[38\]S\. Wang, M\. Han, Z\. Jiao, Z\. Zhang, Y\. N\. Wu, S\. Zhu, and H\. Liu\(2024\)LLM3: large language model\-based task and motion planning with motion failure reasoning\.In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),Vol\.,pp\. 12086–12092\.External Links:[Document](https://dx.doi.org/10.1109/IROS58592.2024.10801328)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[39\]C\. Xu, A\. Petiushko, D\. Zhao, and B\. Li\(2025\-Apr\.\)DiffScene: diffusion\-based safety\-critical scenario generation for autonomous vehicles\.Proceedings of the AAAI Conference on Artificial Intelligence39\(8\),pp\. 8797–8805\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i8.32951)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p2.1)\.
- \[40\]Y\. Yu, Y\. Wang, Y\. Zhang, H\. Qu, and D\. Liu\(2025\)InclusiViz : visual analytics of human mobility data for understanding and mitigating urban segregation\.IEEE Transactions on Visualization and Computer Graphics31\(6\),pp\. 3836–3849\.External Links:[Document](https://dx.doi.org/10.1109/TVCG.2025.3567117)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p1.1)\.
- \[41\]J\. Yuan, X\. Ma, D\. Chen, K\. Kuang, F\. Wu, and L\. Lin\(2022\)Label\-efficient domain generalization via collaborative exploration and generalization\.InProceedings of the 30th ACM International Conference on Multimedia,MM ’22,New York, NY, USA,pp\. 2361–2370\.External Links:ISBN 9781450392037,[Document](https://dx.doi.org/10.1145/3503161.3548059)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[42\]J\. Yuan, X\. Ma, D\. Chen, K\. Kuang, F\. Wu, and L\. Lin\(2023\)Domain\-specific bias filtering for single labeled domain generalization\.International Journal of Computer Vision131\(2\),pp\. 552–571\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1007/s11263-022-01712-7)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[43\]J\. Yuan, X\. Ma, D\. Chen, F\. Wu, L\. Lin, and K\. Kuang\(2023\)Collaborative semantic aggregation and calibration for federated domain generalization\.IEEE Transactions on Knowledge and Data Engineering35\(12\),pp\. 12528–12541\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2023.3271851)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[44\]J\. Yuan, X\. Zhang, H\. Zhou, J\. Wang, Z\. Qiu, Z\. Shao, S\. Zhang, S\. Long, K\. Kuang, K\. Yao,et al\.\(2023\)Hap: structure\-aware masked image modeling for human\-centric perception\.Advances in Neural Information Processing Systems36,pp\. 50597–50616\.Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p3.1)\.
- \[45\]J\. Zhang, C\. Xu, and B\. Li\(2024\)ChatScene: knowledge\-enabled safety\-critical scenario generation for autonomous vehicles\.External Links:2405\.14062,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2405.14062)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p2.1)\.
- \[46\]X\. Zhang, S\. Zeinali, and G\. Schildbach\(2025\)Interaction\-aware traffic prediction and scenario\-based model predictive control for autonomous vehicles on highways\.IEEE Transactions on Control Systems Technology33\(4\),pp\. 1235–1245\.External Links:[Document](https://dx.doi.org/10.1109/TCST.2024.3458817)Cited by:[§2\.1](https://arxiv.org/html/2605.08599#S2.SS1.p1.1)\.
- \[47\]S\. Zhao, J\. Zhang, N\. Masoud, H\. Huang, X\. Hou, and C\. He\(2025\)SACA: a scenario\-aware collision avoidance framework for autonomous vehicles integrating llms\-driven reasoning\.External Links:2504\.00115,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2504.00115)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p2.1)\.
- \[48\]W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, X\. Wang, Y\. Hou, Y\. Min, B\. Zhang, J\. Zhang, Z\. Dong, Y\. Du, C\. Yang, Y\. Chen, Z\. Chen, J\. Jiang, R\. Ren, Y\. Li, X\. Tang, Z\. Liu, P\. Liu, J\. Nie, and J\. Wen\(2025\)A survey of large language models\.External Links:2303\.18223,[Document](https://dx.doi.org/https%3A//arxiv.org/abs/2303.18223)Cited by:[§2\.3](https://arxiv.org/html/2605.08599#S2.SS3.p1.1)\.
- \[49\]A\. Zlocki, A\. König, J\. Bock, H\. Weber, H\. Muslim, H\. Nakamura, S\. Watanabe, J\. Antona\-Makoshi, and S\. Taniguchi\(2022\)Logical scenarios parameterization for automated vehicle safety assessment: comparison of deceleration and cut\-in scenarios from japanese and german highways\.IEEE Access10\(\),pp\. 26817–26829\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2022.3154415)Cited by:[§1](https://arxiv.org/html/2605.08599#S1.p2.1)\.Similar Articles
The DAWN of World-Action Interactive Models
This paper introduces DAWN, a latent generative baseline for World-Action Interactive Models (WAIMs) that jointly models scene evolution and action generation through recursive refinement, achieving strong long-horizon planning in autonomous driving scenarios.
Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin
The paper presents a hybrid Discrete Event Simulation and Agent-Based Model framework for emergency departments, validated against real-world data, and integrates a multi-agent system for autonomous resource allocation optimization.
Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence
This paper proposes a validation framework for using Large Language Models to extract causal relations from social media posts during disasters. It evaluates the effectiveness of LLMs in identifying cause-effect relationships and compares them against expert-grounded reference graphs to assess reliability and risks.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.
Enhanced and Efficient Reasoning in Large Learning Models
This paper proposes a method for improving reasoning in large language models by recoding data to explicitly represent relationships, enabling efficient principled reasoning with polynomial-time learnability for relational rules, which addresses hallucinations and supports sound reasoning across multiple calls.