Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

arXiv cs.CL 05/13/26, 04:00 AM Papers
Summary
This paper proposes a validation framework for using Large Language Models to extract causal relations from social media posts during disasters. It evaluates the effectiveness of LLMs in identifying cause-effect relationships and compares them against expert-grounded reference graphs to assess reliability and risks.
arXiv:2605.11348v1 Announce Type: new Abstract: During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:11 AM
# Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence
Source: [https://arxiv.org/html/2605.11348](https://arxiv.org/html/2605.11348)
Ujun Jeong1,Saketh Vishnubhatla1,Bohan Jiang1, Andre Harrison2,Adrienne Raglin2,Huan Liu1 1Arizona State University, Tempe, Arizona, USA 2DEVCOM Army Research Laboratory, Adelphi, Maryland, USA \{ujeong1, svishnu6, bjiang14, huanliu\}@asu\.edu \{andre\.v\.harrison2\.civ, adrienne\.raglin2\.civ\}@army\.mil

###### Abstract

During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts\. However, disaster\-related posts are often informal, fragmented, and context\-dependent, and they may describe personal experiences rather than explicit causal relations\. In this work, we examine whether Large Language Models \(LLMs\) can effectively extract causal relations from disaster\-related social media posts\. To this end, we \(1\) propose an expert\-grounded evaluation framework that compares LLM\-generated causal graphs with reference graphs derived from disaster\-specific reports and \(2\) assess whether the extracted relations are supported by post\-event evidence or instead reflect model priors\. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision\-support systems\.

Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

Ujun Jeong1, Saketh Vishnubhatla1, Bohan Jiang1,Andre Harrison2,Adrienne Raglin2,Huan Liu11Arizona State University, Tempe, Arizona, USA2DEVCOM Army Research Laboratory, Adelphi, Maryland, USA\{ujeong1, svishnu6, bjiang14, huanliu\}@asu\.edu\{andre\.v\.harrison2\.civ, adrienne\.raglin2\.civ\}@army\.mil

## 1Introduction

Causality is a crucial yet underexplored dimension of disaster intelligence, as highlighted in a 2021 systematic reviewWiegmannet al\.\([2020](https://arxiv.org/html/2605.11348#bib.bib69)\)\. Agencies such as NOAA produce detailed post\-event reports documenting hazards, impacts, and contributing factors\. However, these reports are retrospective and labor\-intensive, and they often provide only limited insight into the rapidly evolving dynamics of disastersKryvasheyeuet al\.\([2016](https://arxiv.org/html/2605.11348#bib.bib17)\)\.

In contrast, social media platforms such as Twitter provide real\-time disaster information through user observations, warnings, and damage reportsVieweget al\.\([2010](https://arxiv.org/html/2605.11348#bib.bib21)\)\. Hurricanes Harvey and Irma, for example, demonstrated how social media posts captured impacts such as flooding, transportation disruptions, and power outages as events unfolded, highlighting the potential of crowd\-sourced data to reveal early signals of cause–effect relations during disastersKing \([2018](https://arxiv.org/html/2605.11348#bib.bib83)\)\.

However, extracting causal relations from social media remains challenging\. Posts are often short, informal, fragmented, and context\-dependent, with causality frequently implied rather than explicitly stated\. Rule\-based methods that rely on markers such as “caused by” or “led to” therefore miss many relevant relations\. Although NLP methods have advanced structured event and knowledge extractionLiuet al\.\([2023](https://arxiv.org/html/2605.11348#bib.bib48)\); Luet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib65)\), many still depend on human\-annotated templates or domain\-specific schemas, limiting their generalization to complex and rapidly changing disaster contexts\.

Large Language Models \(LLMs\) offer a promising alternative because they can process noisy text, reason over incomplete context, and infer unstated relationsHaoet al\.\([2026](https://arxiv.org/html/2605.11348#bib.bib20)\)\. Yet, this capability introduces a key risk: LLMs may generate plausible causal links from prior knowledge rather than evidence in the posts\. This motivates our research question:“Can LLMs extract valid causal relations from social media posts about disaster events?”

To address this question, we develop an evaluation framework that compares LLM\-generated causal graphs with expert\-grounded reports and examines whether model outputs are supported by post\-level evidence or influenced by prior knowledge\. This framework enables three contributions:

- •Expert\-grounded causal graphs:We construct disaster\-specific ground\-truth causal graphs by aligning an impact\-chain framework with post\-event reports to retain expert\-grounded evidence\.
- •Social\-media causal extraction:We evaluate LLMs to extract causal relations among a fixed set of variables for social media analysis, distinguishing this from open\-ended causal discovery\.
- •Evidence versus prior knowledge:We examine how social media post quality affects LLM’s causal extraction and reveal the risk of plausible but ungrounded outputs driven by model priors\.

## 2Related Work

### 2\.1Social Media for Disaster Intelligence

Social media has been widely used for disaster intelligence, including event detection, damage assessment, crisis classification, geolocation, and situational awarenessMaet al\.\([2025](https://arxiv.org/html/2605.11348#bib.bib10)\); Vishnubhatlaet al\.\([2025b](https://arxiv.org/html/2605.11348#bib.bib9)\); Zouet al\.\([2023](https://arxiv.org/html/2605.11348#bib.bib40)\); Imranet al\.\([2016](https://arxiv.org/html/2605.11348#bib.bib46)\)\. CrisisMMD and HumAID provide annotated disaster posts for relevance and impact classificationAlamet al\.\([2018](https://arxiv.org/html/2605.11348#bib.bib35),[2021](https://arxiv.org/html/2605.11348#bib.bib41)\)\. Prior approaches range from keyword and location\-based clusteringAtefeh and Khreich \([2015](https://arxiv.org/html/2605.11348#bib.bib68)\); Beckeret al\.\([2011](https://arxiv.org/html/2605.11348#bib.bib75)\)to neural models that capture semantic and temporal patternsZhaoet al\.\([2017](https://arxiv.org/html/2605.11348#bib.bib76)\); Leeet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib78)\); Zhouet al\.\([2022](https://arxiv.org/html/2605.11348#bib.bib79)\)\. leveraged GPT\-2 to analyze social media posts but without validation\.

### 2\.2Causal Relation Extraction from Text

Causal relation extraction has been extensively studied, from lexical cue\-based systems to neural event extraction frameworksLiuet al\.\([2023](https://arxiv.org/html/2605.11348#bib.bib48)\); Louet al\.\([2023](https://arxiv.org/html/2605.11348#bib.bib67)\); Luet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib65)\); Susanti and Färber \([2025](https://arxiv.org/html/2605.11348#bib.bib54)\)\. Recent work has also explored using LLMs for causal graph construction from natural text\. Closest to our setting,[Sakladet al\.](https://arxiv.org/html/2605.11348#bib.bib15)evaluate whether LLMs can infer causal graphs from academic documents, requiring external judgment to align the generated graph with the reference graph\.

Our work is distinct in two ways\. First, we use an impact\-chain framework to define the target variables and their level of abstraction in advance, which removes ambiguity about node equivalence and enables direct graph evaluation\. Second, to the best of our knowledge, this is the first study to extract disaster\-related causal relationships from social media and validate them using post\-event evidence documented in standardized expert reports\.

## 3Building Ground Truth Causal Graph

### 3\.1Expert Evidence Matching and Record

Ground\-truth causal graphs for specific disaster events, defined here as sets of directed\(Cause, Effect\)relations, are not readily available\. To address this gap, we adopt amatch\-and\-recordapproach that combines established disaster\-management frameworks with event\-specific expert evidenceZebischet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib22)\)\. We start with a general causal graph based on the impact\-chain framework for each disaster type, then refine it using causality evidence from post\-event reports\.

1. 1\.Initial variables and edges:We build on the impact\-chain framework, which represents causal pathways among hazards, exposure, vulnerability, and impacts in disaster risk analysisZebischet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib22),[2022](https://arxiv.org/html/2605.11348#bib.bib16)\)\. Specifically, we use the storm\-related framework proposed by[Pittoreet al\.](https://arxiv.org/html/2605.11348#bib.bib14)as the initial set of variables and edges to ensure the level of abstraction\.
2. 2\.Collect post\-event expert evidence:For each disaster event, we consult official, standardized expert reports, such as NOAA’s post\-event reportsCangialosiet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib26),[2018](https://arxiv.org/html/2605.11348#bib.bib25)\), rather than aggregating ad hoc sources\. These reports provide expert\-grounded evidence of causal relations specific to the disaster event\.
3. 3\.Evidence\-based pruning:We retain only the edges that are explicitly supported by textual evidence in the reports and remove unsupported relations\. The causality evidence is managed in a tabulated document for review\.

Figure[1](https://arxiv.org/html/2605.11348#S3.F1)summarizes the construction of the ground\-truth causal graphs from the impact\-chain framework and authoritative post\-event reports\. The resulting graphs and supporting evidence tables were reviewed by two Ph\.D\.\-level researchers in disaster response and prediction, and will be released with the data sources upon acceptance\.

### 3\.2Disaster Cases and Data Curation

We focus on disasters that satisfy two criteria: \(1\) the availability of publicly accessible social media data and \(2\) the existence of standardized, authoritative post\-event reports that enable consistent causal analysis\. Using the CrisisMMDAlamet al\.\([2018](https://arxiv.org/html/2605.11348#bib.bib35)\)and HumAIDAlamet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib41)\)benchmarks, we identified Hurricanes Irma and Harvey as the only events meeting both requirements\. This constraint led our study to focus on U\.S\.\-based disasters documented by NOAA reports, which provide structured, event\-specific evidence of causal mechanisms\. After removing duplicate post IDs, the curated datasets contain 9,824 posts for Hurricane Irma and 10,662 posts for Hurricane Harvey\.

![Refer to caption](https://arxiv.org/html/2605.11348v1/causal_ground_truth.png)Figure 1:Matching impact\-chain with disaster\-specific post\-event report to record ground\-truth causal relations\.

## 4Causal Relation Extraction

### 4\.1LLMs and Causal Graph Generation

We compare LLMs along two dimensions: \(1\) access to social media data and \(2\) model openness, distinguishing between closed\-source and open\-weight systems\. This design allows us to examine whether platform\-specific contextual knowledge and model scale affect causal relation extraction\.

- •Grok4\.3: Closed\-source LLM by xAI with reasoning and native social media access for historical and real\-time disaster\-related posts\.
- •GPT\-5\.5: Closed\-source LLM by OpenAI with reasoning and long\-context capabilities, but no native social\-media access\.
- •Mistral\-7B: Open\-weight LLM by Mistral AI representing smaller, resource\-constrained models with limited context windows\.

Because smaller models such as Mistral\-7B have limited context windows, we process posts in batches of 20 rather than as a single input\. This approach follows a chunking\-based strategy for long\-context LLM processing, where smaller segments are analyzed independently and later combined\. For each batch, the model extracts directed causal pairs, which are then aggregated across batches with duplicate edges mergedRatneret al\.\([2023](https://arxiv.org/html/2605.11348#bib.bib18)\)\. In practice, batching also helps mitigate the lost\-in\-the\-middle effect that can occur when processing long\-context promptsFriedmanet al\.\([2022](https://arxiv.org/html/2605.11348#bib.bib19)\)\.

### 4\.2Prompt Template Standardization

The prompt design follows established practices in instruction\-based information extraction by specifying the task, constraining the output schema, and grounding the extraction in event\-specific context\. In this study, the scope of causal relations is restricted to the target variables defined in the impact\-chain framework for each disaster event type\. Each model receives three inputs: the disaster event type, the list of canonical target variables from the impact\-chain framework, and a batch of posts\. The prompt instructs the model to identify causal mentions in the posts, map them to the supplied variables, and output directed causal pairs\.

Prompt Template: Causal Graph GenerationTask:Identify cause and effect relations from social media posts related to\[Disaster Event\]\.Instructions:•If available, conduct a native social media search for posts related to\[Disaster Event\]; otherwise, rely solely on the provided posts\.•Restrict all causes and effects to these variables:\[Variables for Disaster Type\]\.•Extract causal relations that are explicitly stated or reasonably implied in the posts\.•Represent each causal relation as a directed edge in the format of\(Cause, Effect\)\.Input:\[A batch of social media posts\]\.Output:A list of causal relations\.

### 4\.3Evaluation Metrics

To align with standard practices in causal discovery, we evaluate the generated causal graphs using the comparison metrics adopted by[Sakladet al\.](https://arxiv.org/html/2605.11348#bib.bib15)\. These include node and edge precision and recall, F1 score, structural Hamming distance \(SHD\), and normalized SHD \(nSHD\)\. Due to space constraints, detailed definitions are provided in Appendix[B](https://arxiv.org/html/2605.11348#A2)\.

When interpreting these metrics, it is important to note that our setting differs from that of[Sakladet al\.](https://arxiv.org/html/2605.11348#bib.bib15)\. In their setup, LLMs generate both variables and edges from long\-form text, requiring an external judge \(e\.g\., another LLM\) to resolve semantic mismatches between the generated and reference variables\. In contrast, our evaluation provides a fixed set of canonical variables based on the impact chain framework\. Our task is limited to identifying the relevant variables to include and extracting whether causal relations exist between them\.

Table 1:LLM performance on causal relation extraction from disaster\-related posts across 10 runs\. Boldface denotes significantly better in performance betweenGrok4\.3andGPT\-5\.5based on paired t\-tests \(p<0\.05p<0\.05\)\.Table 2:Controlled evaluation of LLMs’ prior\-knowledge reliance using non\-informative posts across 10 runs\. N/A indicates that the model refuses to generate a causal graph due to insufficient evidence in the non\-informative posts\.

## 5Results and Analysis

### 5\.1Extraction from Relevant Disaster Posts

To confirm that LLMs extract meaningful causal relations from social media posts rather than relying on random guessing, we compared their performance against a random baseline\. This random model generates causal graphs using the Erdős–Rényi modelErd6s and Rényi \([1960](https://arxiv.org/html/2605.11348#bib.bib82)\)over a fixed set of canonical variables provided in the prompt, connecting pairs with a probability of 0\.5\. As shown in Table[2](https://arxiv.org/html/2605.11348#S4.T2), all evaluated LLMs substantially outperform the random baseline across both disaster events\. These findings demonstrate that the models benefit from understanding social media posts to infer disaster\-specific causal relations\.

Among LLMs,Grok4\.3shows the highest F1 scores \(0\.75 on Irma and 0\.71 on Harvey\), driven mainly by higher edge recall while maintaining strong node precision\. This superior performance arises from the combination of both its advanced reasoning capabilities and native access to social media data\.GPT\-5\.5performs competitively and leads among models without native social media access, though its lower edge recall reflects a more conservative extraction approach\.Mistral\-7Bshows the weakest overall performance, with notably lower edge precision and recall\. Differences across models appear primarily at the edge level rather than in node identification, as the prompt already supplies the canonical variable set\.

### 5\.2LLMs’ Reliance on Prior Knowledge

While LLMs can generate coherent causal graphs, their reasoning may rely more on internal priors than on the provided data\. To evaluate this effect, we conducted a controlled experiment using only non\-informative posts from the CrisisMMD dataset for Hurricane Irma \(n=677n=677\) and Hurricane Harvey \(n=799n=799\)\. Since these posts contain no relevant information to disasters and any external search functions were disabled, any inferred causality reflect the models’ knowledge priors\.

As shown in Table[2](https://arxiv.org/html/2605.11348#S4.T2), model performance drops substantially in the ablation setting, particularly in edge recall and overall F1, indicating that LLMs fall back on pre\-trained priors when social media evidence is insufficient\.GPT\-5\.5maintains relatively higher F1 scores \(0\.58 on Irma and 0\.41 on Harvey\), suggesting greater reliance on prior knowledge to produce plausible structures\.Mistral\-7Bexhibits the weakest performance overall\. In contrast,Grok4\.3refuses to generate any causal graph for Harvey and displays high selectivity on Irma, reflecting stricter evidence thresholds and better calibration against unsupported inferences\. This behavior highlights the need of conservative grounding mechanism before high\-stakes disaster intelligence deployment\.

## 6Conclusion and Future Work

We proposed a framework to validate LLM\-generated causal relations extracted from social media posts against a ground\-truth graph built from expert disaster reports\. Results show thatGrok4\.3achieves the highest F1 score using native social media access\. WhileGPT\-5\.5performs closely among models relying on provided disaster posts, our ablation study highlights a key limitation: without sufficient grounding evidence, even advanced models rely on internal priors to generate plausible but unverified causal links\. Therefore, despite the promise LLMs show for disaster\-specific extraction, robust evidence\-grounding is essential before their deployment in decision\-support systems\.

We will extend the framework to other disaster types as corresponding NOAA reports become available, investigate the temporal causal relations on social media as crises unfoldVishnubhatlaet al\.\([2025a](https://arxiv.org/html/2605.11348#bib.bib7),[b](https://arxiv.org/html/2605.11348#bib.bib9)\); Shethet al\.\([2021](https://arxiv.org/html/2605.11348#bib.bib13)\), and develop a human\-AI collaborative verification pipeline to enhance the reliability of extracted causal graphs\.

## Limitations

Our study is limited to disasters with standardized NOAA reports, which constrains the evaluation primarily to U\.S\.\-based events\. Although other disasters, such as earthquakes in CrisisMMD and HumAID, are well documented, their reports are often distributed across heterogeneous sources \(e\.g\., EERI and regional agencies\), making the consistent construction of expert\-grounded causal graphs challenging\. In addition, models differ not only in capability but also in data\-access conditions\. Accordingly, our evaluation reflects each model together with its accessible data sources, rather than isolating intrinsic reasoning ability alone\. In particular, xAI’s access restrictions prevent downloading certain social\-media inputs \(e\.g\., posts retrieved byGrok4\.3\), limiting cross\-model comparisons under identical social\-media access conditions\. To address the concern that performance differences may be driven solely by data quality, we additionally conducted a controlled experiment in whichGrok4\.3was evaluated using the same public post benchmark; the results are provided in Appendix[A](https://arxiv.org/html/2605.11348#A1)\.

## Ethical Considerations

Social media posts can be noisy, incomplete, and unevenly distributed across communities and locationsJeong \([2026](https://arxiv.org/html/2605.11348#bib.bib11)\); Jeonget al\.\([2025b](https://arxiv.org/html/2605.11348#bib.bib3),[a](https://arxiv.org/html/2605.11348#bib.bib12),[2024a](https://arxiv.org/html/2605.11348#bib.bib4),[2024b](https://arxiv.org/html/2605.11348#bib.bib5),[2024c](https://arxiv.org/html/2605.11348#bib.bib1),[2022b](https://arxiv.org/html/2605.11348#bib.bib6),[2022a](https://arxiv.org/html/2605.11348#bib.bib2),[c](https://arxiv.org/html/2605.11348#bib.bib8)\)\. Posts may contain subjective experience, so the extracted causal graphs should be used only as decision\-support tools rather than standalone decision systems\. Practical deployment requires source validation, cross\-referencing with official emergency communications, and expert review of high\-impact relationships\. For user privacy, this work uses social media posts only to construct aggregated data for disaster analysis and does not perform user profiling or infer private attributes\.

## Acknowledgment

This work supported by, or in part by the U\.S\. Army Materiel Command under Grant Award Number W911NF24\-2\-0175 and by the U\.S\. Army Research Laboratory under Grant Award Number W911NF2020124\. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies of the U\.S\. Army Materiel Command or the U\.S\. Army Research Laboratory\.

## References

- F\. Alam, F\. Ofli, and M\. Imran \(2018\)Crisismmd: multimodal twitter datasets from natural disasters\.InICWSM,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2605.11348#S3.SS2.p1.1)\.
- F\. Alam, U\. Qazi, M\. Imran, and F\. Ofli \(2021\)Humaid: human\-annotated disaster incidents data from twitter with deep learning benchmarks\.InICWSM,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2605.11348#S3.SS2.p1.1)\.
- F\. Atefeh and W\. Khreich \(2015\)A survey of techniques for event detection in twitter\.Computational Intelligence\.Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- H\. Becker, M\. Naaman, and L\. Gravano \(2011\)Beyond trending topics: real\-world event identification on twitter\.InICWSM,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- J\. P\. Cangialosi, A\. S\. Latto, and R\. J\. Berg \(2018\)Tropical cyclone report: hurricane harvey\.Technical reportNational Hurricane Center\.External Links:[Link](https://www.nhc.noaa.gov/data/tcr/AL092017_Harvey.pdf)Cited by:[item 2](https://arxiv.org/html/2605.11348#S3.I1.i2.p1.1)\.
- J\. P\. Cangialosi, A\. S\. Latto, and R\. Berg \(2021\)Tropical cyclone report: hurricane irma\.Technical reportTechnical ReportAL112017,National Hurricane Center\.External Links:[Link](https://www.nhc.noaa.gov/data/tcr/AL112017_Irma.pdf)Cited by:[item 2](https://arxiv.org/html/2605.11348#S3.I1.i2.p1.1)\.
- P\. Erd6s and A\. Rényi \(1960\)On the evolution of random graphs\.Publ\. Math\. Inst\. Hungar\. Acad\. Sci5\(1\),pp\. 17–61\.Cited by:[§5\.1](https://arxiv.org/html/2605.11348#S5.SS1.p1.1)\.
- S\. Friedman, I\. Magnusson, V\. Sarathy, and S\. Schmer\-Galunder \(2022\)From unstructured text to causal knowledge graphs: a transformer\-based approach\.arXiv preprint arXiv:2202\.11768\.Cited by:[§4\.1](https://arxiv.org/html/2605.11348#S4.SS1.p1.2)\.
- Z\. Hao, Z\. Chen, J\. Lu, S\. Yu, G\. Hu, K\. Zhang, R\. Cai, and B\. Xu \(2026\)SERE: structural example retrieval for enhancing llms in event causality identification\.arXiv preprint arXiv:2605\.03701\.Cited by:[§1](https://arxiv.org/html/2605.11348#S1.p4.1)\.
- M\. Imran, P\. Mitra, and C\. Castillo \(2016\)Twitter as a lifeline: human\-annotated twitter corpora for nlp of crisis\-related messages\.InACL,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- U\. Jeong, Z\. Alghamdi, K\. Ding, L\. Cheng, B\. Li, and H\. Liu \(2022a\)Classifying covid\-19 related meta ads using discourse representation through a hypergraph\.InSBP\-BRiMS,Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong, A\. Beigi, A\. Tahir, S\. X\. Tang, H\. R\. Bernard, and H\. Liu \(2025a\)Fediverse sharing: cross\-platform interaction dynamics between threads and mastodon users\.InASONAM,Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong, K\. Ding, L\. Cheng, R\. Guo, K\. Shu, and H\. Liu \(2022b\)Nothing stands alone: relational fake news detection with hypergraph neural networks\.InIEEE Big Data\),Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong, B\. Jiang, Z\. Tan, H\. R\. Bernard, and H\. Liu \(2024a\)Descriptor: a temporal multi\-network dataset of social interactions in bluesky social \(bluetempnet\)\.IEEE Data Descriptions\.Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong, L\. H\. X\. Ng, K\. M\. Carley, and H\. Liu \(2025b\)Navigating decentralized online social networks: an overview of technical and societal challenges in architectural choices\.arXiv preprint arXiv:2504\.00071\.Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong, A\. Nirmal, K\. Jha, S\. X\. Tang, H\. R\. Bernard, and H\. Liu \(2024b\)User migration across multiple social media platforms\.SDM\.Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong, P\. Rajak, V\. M\. Tammali, V\. R\. Surabhi, O\. Boz,et al\.\(2025c\)Reinforcement learning assisted dynamic large scale graph learning\.InWorkshop on Differentiable Learning of Combinatorial Algorithms,Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong, P\. Sheth, A\. Tahir, F\. Alatawi, H\. R\. Bernard, and H\. Liu \(2024c\)Exploring platform migration patterns between twitter and mastodon: a user behavior study\.InICWSM,Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- U\. Jeong \(2026\)User behavior across platforms: modeling groups and migration\.Ph\.D\. Thesis,Arizona State University\.Cited by:[Ethical Considerations](https://arxiv.org/html/2605.11348#Sx2.p1.1)\.
- X\. Jiang, X\. Li, Q\. Zhou, and Q\. Wang \(2024\)GRACE: generating cause and effect of disaster sub\-events from social media text\.InWWW,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- L\. J\. King \(2018\)Social media use during natural disasters: an analysis of social media usage during hurricanes harvey and irma\.Cited by:[§1](https://arxiv.org/html/2605.11348#S1.p2.1)\.
- Y\. Kryvasheyeu, H\. Chen, N\. Obradovich, E\. Moro, P\. Van Hentenryck, J\. Fowler, and M\. Cebrian \(2016\)Rapid assessment of disaster damage using social media activity\.Science advances2\(3\),pp\. e1500779\.Cited by:[§1](https://arxiv.org/html/2605.11348#S1.p1.1)\.
- J\. Lee, Y\. Chen, Y\. Wang, and X\. Ren \(2021\)NewEvent: weakly supervised new event type induction and tagging\.InNACCL,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- J\. Liu, Z\. Zhang, K\. Wei, Z\. Guo, X\. Sun, L\. Jin, and X\. Li \(2023\)Event causality extraction via implicit cause\-effect interactions\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2605.11348#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.11348#S2.SS2.p1.1)\.
- J\. Lou, Y\. Lu, D\. Dai, W\. Jia, H\. Lin, X\. Han, L\. Sun, and H\. Wu \(2023\)Universal information extraction as unified semantic matching\.InAAAI,Cited by:[§2\.2](https://arxiv.org/html/2605.11348#S2.SS2.p1.1)\.
- Y\. Lu, H\. Lin, J\. Xu, X\. Han, J\. Tang, A\. Li, L\. Sun, M\. Liao, and S\. Chen \(2021\)Text2Event: controllable sequence\-to\-structure generation for end\-to\-end event extraction\.InACL,Cited by:[§1](https://arxiv.org/html/2605.11348#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.11348#S2.SS2.p1.1)\.
- P\. Ma, C\. Zhao, B\. Jiang, S\. Vishnubhatla, U\. Jeong, A\. Beigi, A\. Raglin, and H\. Liu \(2025\)CAMO: causality\-guided adversarial multimodal domain generalization for crisis classification\.arXiv preprint arXiv:2512\.08071\.Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- M\. Pittore, P\. Campalani, K\. Renner, M\. Plörer, and F\. Tagliavini \(2023\)Border\-independent multi\-functional, multi\-hazard exposure modelling in alpine regions\.Natural Hazards\.Cited by:[item 1](https://arxiv.org/html/2605.11348#S3.I1.i1.p1.1)\.
- N\. Ratner, Y\. Levine, Y\. Belinkov, O\. Ram, I\. Magar, O\. Abend, E\. Karpas, A\. Shashua, K\. Leyton\-Brown, and Y\. Shoham \(2023\)Parallel context windows for large language models\.InACL,Cited by:[§4\.1](https://arxiv.org/html/2605.11348#S4.SS1.p1.2)\.
- R\. Saklad, A\. Chadha, O\. Pavlov, and R\. Moraffah \(2025\)Can large language models infer causal relationships from real\-world text?\.arXiv preprint arXiv:2505\.18931\.Cited by:[§2\.2](https://arxiv.org/html/2605.11348#S2.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.11348#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.11348#S4.SS3.p2.1)\.
- P\. Sheth, U\. Jeong, R\. Guo, H\. Liu, and K\. S\. Candan \(2021\)CauseBox: a causal inference toolbox for benchmarkingtreatment effect estimators with machine learning methods\.InCIKM,Cited by:[§6](https://arxiv.org/html/2605.11348#S6.p2.1)\.
- Y\. Susanti and M\. Färber \(2025\)Paths to causality: finding informative subgraphs within knowledge graphs for knowledge\-based causal discovery\.InKDD,Cited by:[§2\.2](https://arxiv.org/html/2605.11348#S2.SS2.p1.1)\.
- S\. Vieweg, A\. L\. Hughes, K\. Starbird, and L\. Palen \(2010\)Microblogging during two natural hazards events: what twitter may contribute to situational awareness\.InSIGCHI,pp\.\.Cited by:[§1](https://arxiv.org/html/2605.11348#S1.p2.1)\.
- S\. Vishnubhatla, A\. Beigi, R\. H\. Foo, U\. Goel, U\. Jeong, B\. Jiang, A\. Raglin, and H\. Liu \(2025a\)An interventional approach to real\-time disaster assessment via causal attribution\.InCIKM,Cited by:[§6](https://arxiv.org/html/2605.11348#S6.p2.1)\.
- S\. Vishnubhatla, U\. Jeong, B\. Jiang, P\. Sheth, Z\. Tan, A\. Raglin, and H\. Liu \(2025b\)Assessing on\-the\-ground disaster impact using online data sources\.arXiv preprint arXiv:2509\.11634\.Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1),[§6](https://arxiv.org/html/2605.11348#S6.p2.1)\.
- M\. Wiegmann, J\. Kersten, H\. Senaratne, M\. Potthast, F\. Klan, and B\. Stein \(2020\)Opportunities and risks of disaster data from social media: a systematic review of incident information\.Natural Hazards and Earth System Sciences Discussions\.Cited by:[§1](https://arxiv.org/html/2605.11348#S1.p1.1)\.
- M\. Zebisch, S\. Schneiderbauer, K\. Fritzsche, P\. Bubeck, S\. Kienberger, W\. Kahlenborn, S\. Schwan, and T\. Below \(2021\)The vulnerability sourcebook and climate impact chains–a standardised framework for a climate vulnerability and risk assessment\.International Journal of Climate Change Strategies and Management\.Cited by:[item 1](https://arxiv.org/html/2605.11348#S3.I1.i1.p1.1),[§3\.1](https://arxiv.org/html/2605.11348#S3.SS1.p1.1)\.
- M\. Zebisch, S\. Terzi, M\. Pittore, K\. Renner, and S\. Schneiderbauer \(2022\)Climate impact chains—a conceptual modelling approach for climate risk assessment in the context of adaptation planning\.Cited by:[item 1](https://arxiv.org/html/2605.11348#S3.I1.i1.p1.1)\.
- J\. Zhao, B\. Salehi, and et al\. \(2017\)EventFact: visual analytics of temporal relationships between online news events\.InCHI,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- Z\. Zhou, M\. Zhang, and et al\. \(2022\)A weakly supervised framework for news event detection with distant supervision signals\.InACL Findings,Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.
- L\. Zou, D\. Liao, N\. S\. Lam, M\. A\. Meyer, N\. G\. Gharaibeh, H\. Cai, B\. Zhou, and D\. Li \(2023\)Social media for emergency rescue: an analysis of rescue requests on twitter during hurricane harvey\.International journal of disaster risk reduction\.Cited by:[§2\.1](https://arxiv.org/html/2605.11348#S2.SS1.p1.1)\.

## Appendix AGrok4\.3without Social\-Media Access

To separate the effect of native social\-media access from the effect of model capability, we conduct an additional diagnostic experiment\.Grok4\.3receives exactly the same public post batches used forGPT\-5\.5, bypassing native social\-media access\. As shown in Table[3](https://arxiv.org/html/2605.11348#A1.T3),Grok4\.3shows comparable performance toGPT\-5\.5in F1 scores in this controlled setting\. This result indicates that the public posts in the benchmark contain sufficient causal information to support expert\-aligned graph construction, yet native access is more advantageous\.

Table 3:Evaluation forGrok4\.3across two disasters when the public\-post benchmark is provided as input\.
## Appendix BCausal Relation Extraction Metrics

We evaluate causal graph construction at the node and edge levels\. LetGref=\(Vref,Eref\)G\_\{\\mathrm\{ref\}\}=\(V\_\{\\mathrm\{ref\}\},E\_\{\\mathrm\{ref\}\}\)denote the expert\-grounded reference graph andGpred=\(Vpred,Epred\)G\_\{\\mathrm\{pred\}\}=\(V\_\{\\mathrm\{pred\}\},E\_\{\\mathrm\{pred\}\}\)denote the LLM\-generated graph, where nodes represent causal variables and directed edges represent causal relations\. For graph elementsX∈\{V,E\}X\\in\\\{V,E\\\}, precision and recall are defined as follows:

PX=\|Xref∩Xpred\|\|Xpred\|,RX=\|Xref∩Xpred\|\|Xref\|\.P\_\{X\}=\\frac\{\|X\_\{\\mathrm\{ref\}\}\\cap X\_\{\\mathrm\{pred\}\}\|\}\{\|X\_\{\\mathrm\{pred\}\}\|\},\\qquad R\_\{X\}=\\frac\{\|X\_\{\\mathrm\{ref\}\}\\cap X\_\{\\mathrm\{pred\}\}\|\}\{\|X\_\{\\mathrm\{ref\}\}\|\}\.Node\-level scores are obtained withX=VX=V, and edge\-level scores withX=EX=E\. Precision measures the validity of generated elements, while recall measures coverage of the reference causal graph\.

We compute an overall micro\-averaged F1 score across both nodes and edges as

F1=2\(\|Vref∩Vpred\|\+\|Eref∩Epred\|\)\|Vref\|\+\|Eref\|\+\|Vpred\|\+\|Epred\|\.F\_\{1\}=\\frac\{2\(\|V\_\{\\mathrm\{ref\}\}\\cap V\_\{\\mathrm\{pred\}\}\|\+\|E\_\{\\mathrm\{ref\}\}\\cap E\_\{\\mathrm\{pred\}\}\|\)\}\{\|V\_\{\\mathrm\{ref\}\}\|\+\|E\_\{\\mathrm\{ref\}\}\|\+\|V\_\{\\mathrm\{pred\}\}\|\+\|E\_\{\\mathrm\{pred\}\}\|\}\.We also report structural Hamming distance \(SHD\), defined as the number of edge edits required to transformGpredG\_\{\\mathrm\{pred\}\}intoGrefG\_\{\\mathrm\{ref\}\}\. Let

ℛ=\{\(u,v\)∈Eref:\(v,u\)∈Epred\}\\mathcal\{R\}=\\\{\(u,v\)\\in E\_\{\\mathrm\{ref\}\}:\(v,u\)\\in E\_\{\\mathrm\{pred\}\}\\\}denote the set of reversed causal relations\. Then,

SHD=\|Epred∖Eref\|\+\|Eref∖Epred\|−\|ℛ\|\.\\mathrm\{SHD\}=\|E\_\{\\mathrm\{pred\}\}\\setminus E\_\{\\mathrm\{ref\}\}\|\+\|E\_\{\\mathrm\{ref\}\}\\setminus E\_\{\\mathrm\{pred\}\}\|\-\|\\mathcal\{R\}\|\.We normalize SHD by the number of possible directed edges among reference variables:

nSHD=SHD\|Vref\|\(\|Vref\|−1\)\.\\mathrm\{nSHD\}=\\frac\{\\mathrm\{SHD\}\}\{\|V\_\{\\mathrm\{ref\}\}\|\(\|V\_\{\\mathrm\{ref\}\}\|\-1\)\}\.Subtracting\|ℛ\|\|\\mathcal\{R\}\|ensures that each reversed edge is counted as one edit rather than separately as a false positive and a false negative\.
Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

Similar Articles

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Depression Risk Assessment in Social Media via Large Language Models

Reflections and New Directions for Human-Centered Large Language Models

Large Language Models over Networks: Collaborative Intelligence under Resource Constraints

Submit Feedback

Similar Articles

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Depression Risk Assessment in Social Media via Large Language Models
Reflections and New Directions for Human-Centered Large Language Models
Large Language Models over Networks: Collaborative Intelligence under Resource Constraints