LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

arXiv cs.AI Papers

Summary

This paper presents an empirical evaluation of LLM-guided semi-supervised learning for classifying social media crisis data. It demonstrates that LG-CoTrain outperforms classical baselines in low-resource settings and highlights the potential of transferring knowledge from LLMs to smaller, deployable models for disaster response.

arXiv:2605.08448v1 Announce Type: new Abstract: Semi-supervised learning approaches have been investigated as a means to enhance the analysis of social media data in disaster management contexts. In this work, we present the first empirical evaluation of large language model (LLM) guided semi-supervised learning for crisis related tweet classification. We compare two recent LLM assisted semi-supervised methods, VerifyMatch and LLM guided Co-Training ( LG-CoTrain), against established semi-supervised baselines. Our results show that LG-CoTrain significantly outperforms classical semi-supervised approaches in low resource settings with 5, 10 and 25 labeled examples per class, achieving the highest averaged Macro F1 across events. VerifyMatch achieves competitive performance while also demonstrating strong calibration properties. As the number of labeled examples increases, the performance gap narrows and Self Training emerges as a strong baseline. We further observe that compact semi-supervised models can, in some cases, outperform very large LLMs operating in zero-shot settings. This finding highlights the potential of transferring knowledge from LLMs into smaller and more deployable models through LLM guided semi-supervised learning, offering a practical pathway for real world disaster response applications. Our project repository on Github is here.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 07:14 AM

# 1 Introduction
Source: [https://arxiv.org/html/2605.08448](https://arxiv.org/html/2605.08448)
\\iscramset

WiPe Paper 2026=Social Media & Crisis Communication: narratives, signals, and sentiments, title=LLM\-guided Semi\-Supervised Approaches for Social Media Crisis Data Classification, short title=LLM\-guided Semi\-Supervised for Crisis, author= short name=Ativo et al\., full name=Jacob Ativo††thanks:These authors contributed equally to this work\., affiliation= Department of Computer Science California State University, East Bay Hayward CA 94542, USA [jativo@horizon\.csueastbay\.edu](https://arxiv.org/html/2605.08448v1/mailto:[email protected]), , author= short name=Balasubramaniyam, full name=Bharaneeshwar Balasubramaniyam11footnotemark:1, affiliation= Department of Computer Science Kansas State University [bharanibala@ksu\.edu](https://arxiv.org/html/2605.08448v1/mailto:[email protected]), , author= short name=Tran, full name=Anh Tran11footnotemark:1, affiliation= Independent Researcher [anhtranst@gmail\.com](https://arxiv.org/html/2605.08448v1/mailto:[email protected]), , author= short name=Gupta, full name=Khushboo Gupta, affiliation= Department of Computer Science University of Illinois at Chicago [kgupta27@uic\.edu](https://arxiv.org/html/2605.08448v1/mailto:[email protected]), , author= short name=Li, full name=Hongmin Li, affiliation= Department of Computer Science California State University, East Bay Hayward CA 94542, USA [hongmin\.li@csueastbay\.edu](https://arxiv.org/html/2605.08448v1/mailto:[email protected]), , author= short name=D\. Caragea, full name=Doina Caragea, affiliation= Department of Computer Science Kansas State University [dcaragea@ksu\.edu](https://arxiv.org/html/2605.08448v1/mailto:[email protected]), , author= short name=C\. Caragea, full name=Cornelia Caragea, affiliation= Department of Computer Science University of Illinois at Chicago [cornelia@uic\.edu](https://arxiv.org/html/2605.08448v1/mailto:[email protected]), ,

###### Abstract

Semi\-supervised learning approaches have been investigated as a means to enhance the analysis of social media data in disaster management contexts\. In this work, we present the first empirical evaluation of large language model \(LLM\) guided semi\-supervised learning for crisis related tweet classification\. We compare two recent LLM assisted semi\-supervised methods, VerifyMatch and LLM guided Co\-Training \(LG\-CoTrain\), against established semi\-supervised baselines\. Our results show thatLG\-CoTrainsignificantly outperforms classical semi\-supervised approaches in low resource settings with 5, 10 and 25 labeled examples per class, achieving the highest averaged Macro F1 across events\. VerifyMatch achieves competitive performance while also demonstrating strong calibration properties\. As the number of labeled examples increases, the performance gap narrows and Self Training emerges as a strong baseline\. We further observe that compact semi\-supervised models can, in some cases, outperform very large LLMs operating in zero\-shot settings\. This finding highlights the potential of transferring knowledge from LLMs into smaller and more deployable models through LLM guided semi\-supervised learning, offering a practical pathway for real world disaster response applications\. Our project repository on Github is[here](https://github.com/deeplearning-lab-csueb/LLM-guided-SSL-Crisis-Tweets-Classification/tree/main)\.

###### keywords:

Semi\-supervised learning, large language model, social media crisis data, model calibration, disaster response

During emergency events, individuals increasingly turn to social media platforms such as X \(formerly Twitter\), Reddit, and Instagram to seek information and share updates\. From a communication perspective, these platforms function bidirectionally: authorities disseminate critical disaster response information \(e\.g\., warnings or evacuation orders\) to the public, while the public also provides firsthand reports and situational updates that can be mined to enhance situational awareness\[reuter2018\_survey,DBLP:phd/dnb/Reuter22a,jeroen2021\]\. Consequently, both researchers and practitioners recognize the substantial value of such user\-generated content for crisis response\. However, effectively integrating social media streams into real\-time operations remains challenging due to information overload characterized by high volume, velocity, and varying levels of veracity\[purohit2025engagemobilizeunderstandingevolving\]\.

To address these challenges, extensive research over the past decade has focused on applying Machine Learning \(ML\) and Natural Language Processing \(NLP\) techniques to automatically classify social media data into actionable categories, such as infrastructure damage or requests for rescue\. A wide range of ML models have been proposed for these classification tasks, including statistical learning approaches and supervised deep learning models\[starbirdPHV2010,ImranECDM13,CarageaSSNT14,NguyenAJSIM16,BurelAlani2018,DBLP:conf/iscram/KerstenKWK19,DBLP:journals/ipm/GhafarianY20\]\. However, supervised models typically require substantial amounts of high\-quality human\-labeled data to achieve strong performance, which is often scarce in the time\-sensitive context of disaster response\.

To mitigate this limitation, researchers have explored domain adaptation, transfer learning, and semi\-supervised learning approaches\. Domain adaptation methods leverage labeled data from previous disaster events to alleviate label scarcity in newly emerging events\[LiJCCM2017,imran2028domainadaptation\]\. In contrast, semi\-supervised learning methods aim to train effective models by combining a small amount of labeled data with a large volume of unlabeled data through pseudo\-labeling strategies\. In a typical teacher–student semi\-supervised learning framework, a teacher model trained on the limited labeled data first generates pseudo\-labels for the unlabeled instances, and these pseudo\-labeled examples are subsequently used to train a student model\[Li\-iscram\-21,zou2023crisismatch,gupta\_2025\_calibrated\]\. In general, the performance of semi\-supervised approaches depends heavily on the quality of the pseudo\-labels\. Therefore, a central research question in semi\-supervised learning is how to effectively leverage unlabeled data to generate high\-quality pseudo\-labels that improve downstream model performance\.

With the rapid advancement of Large Language Models \(LLMs\), recent studies have explored leveraging LLMs to improve pseudo\-labeling of semi\-supervised models built on smaller pre\-trained language models such as BERT\[DBLP:journals/corr/abs\-1810\-04805\-bert\], particularly for text classification tasks\[park\_2024\_verifymatch,rahman\_caragea\_2025\_llm\]\. In the context of social media crisis data analysis, there has been a surge of work employing LLMs in zero\-shot \(i\.e\., making predictions using the LLM without task\-specific labeled examples\), few\-shot \(i\.e\., provide the LLM a small number of labeled examples in the prompt\), and fine\-tuning \(i\.e\., updating a small sized LLM model parameters on task\-specific labeled data\) to identify informative social media content for disaster management\[imran2024\-openai,Soudabeh\-Caragea\-2024,mcdaniel2024\-zeroshot\-crisisbench,yin2025\-crisisSense,shrestha\_crisis\_tweets\_2025\-thesis,DBLP:conf/cogsima/SalfingerS24,lei2025harnessing,guo\_2025\_asonam\]\. However, to the best of our knowledge, no prior work has investigated semi\-supervised models guided by LLMs in the crisis domain\.

To this end, we study two BERT\-based semi\-supervised models enhanced with LLM\-generated pseudo\-labels for social media crisis classification: \(1\) VerifyMatch\[park\_2024\_verifymatch\], originally proposed for Natural Language Inference \(NLI\), and \(2\) LLM\-guided Co\-training \(LG\-CoTrain\)\[rahman\_caragea\_2025\_llm\], designed for general text classification\. Following the experimental protocol of\[gupta\_2025\_calibrated\], which evaluates several semi\-supervised methods on 10 disaster events from the HumAID dataset\[alam\_2021\_humaid\], we experiment with VerifyMatch andLG\-CoTrain, enhanced with GPT\-4o pseudo labels, on the same benchmark\.

Specifically, we evaluate the performance of the VerifyMatch andLG\-CoTrainapproaches using the Macro\-F1 score, as well as the Expected Calibration Error \(ECE\), and compare the results with those of the existing baselines from\[gupta\_2025\_calibrated\]to form a more comprehensive study for semi\-supervised learning algorithms on social media crisis data classification\. To summarize, our main contributions are as follows:

- •We evaluate two semi\-supervised approaches, VerifyMatch andLG\-CoTrain, using zero\-shot pseudo\-labels generated by GPT\-4o on 10 disaster events from the HumAID dataset, a benchmark of disaster\-related tweets annotated with humanitarian categories such as damage, injured people, and requests or urgent needs\. We further compare these models against all semi\-supervised methods examined by\[gupta\_2025\_calibrated\]\.
- •Our experimental results show thatLG\-CoTrainsignificantly outperforms other approaches in low\-resource settings \(e\.g\., 5 or 10 labeled examples per category\)\. Moreover, it demonstrates good model calibration\. However, as the amount of labeled data increases \(e\.g\., 50 labeled examples per category\), the performance gap betweenLG\-CoTrainand other semi\-supervised models narrows, and Self\-training emerges as a competitive baseline that is difficult to surpass\.
- •Semi\-supervised models based on smaller pre\-trained language models outperform zero\-shot GPT\-4o only on a subset of disaster events\. This may be due to limited and potentially unrepresentative unlabeled data in the HumAID benchmark, including missing class examples in the sampled unlabeled sets—an issue that can also arise in real\-world scenarios\. Larger and more representative unlabeled datasets could help mitigate these limitations\.

Still all these findings highlight the potential of transferring knowledge from LLMs into smaller and more deployable models through LLM guided semi\-supervised learning, offering a practical pathway for real world disaster response applications\.

## 2Related Work

There is a vast body of literature on semi\-supervised learning \(SSL\) in machine learning\. In this section, we first provide an overview of SSL approaches, and then review prior work applying SSL and large language models \(LLMs\) to social media disaster data analysis\.

SSL Overview\. A wide range of SSL approaches have been proposed for text classification, beginning with the original idea of Self\-Training and pseudo\-labeling\[scudder1965probability\]\. In self\-training, a teacher model is first trained on limited labeled data and then used to generate pseudo\-labels for unlabeled instances\. These pseudo\-labeled examples are subsequently incorporated into training a student model, often in an iterative manner\. Two critical design choices in this framework are \(1\) how to select pseudo\-labeled examples for inclusion in training and \(2\) whether to use hard labels \(the most likely class per example\) or soft labels \(predicted class probabilities\)\. Incorporating low\-confidence pseudo\-labels may lead to error propagation and degrade student model performance\.

Various pseudo\-label selection strategies have been proposed\. For example, FixMatch\[sohn2020fixmatch\]and related self\-training methods adopt fixed confidence thresholds to filter pseudo\-labels, while Uncertainty\-Aware Self\-Training \(UST\)\[mukherjee2020uncertainty\]employs more sophisticated uncertainty estimation techniques grounded in probability theory\. However, threshold\-based filtering may restrict the student model’s access to potentially informative unlabeled data\. To alleviate this limitation, methods such as MixMatch\[berthelot2019mixmatch\]and SoftMatch\[chen2023softmatchaddressingquantityqualitytradeoff\]were introduced\. SoftMatch retains all pseudo\-labeled samples but assigns lower weights to low\-confidence instances during training, thereby balancing data quantity and label quality\. MixMatch similarly leverages soft pseudo\-labels and further incorporates MixUp\[zhang2017mixup\], which interpolates pseudo\-labeled and human\-annotated examples to generate smoother and potentially higher\-quality training signals\.

AUM\-based Self\-Training \(AUM\-ST\)\[sosea2022leveraging\]takes a different perspective by filtering low\-quality pseudo\-labeled examples through tracking training dynamics using the Area Under the Margin \(AUM\)\. Building upon this idea, AUM\-based MixUp in Self\-Training \(AUM\-ST\-Mixup\)\[gupta\_2025\_calibrated\]integrates MixUp and additional confidence\-tracking mechanisms on top of AUM\-ST to further enhance pseudo\-label reliability\. Confidence\-based Mixup in Self\-Training \(Conf\-ST\-Mixup\)\[gupta\_2025\_calibrated\]enhances pseudo\-labeling by defining prediction confidence as the probability gap between the top two classes, where a larger gap indicates higher confidence, allowing the model to distinguish easy\-to\-learn \(reliable\) from hard\-to\-learn \(ambiguous\) samples\. It applies mixup across labeled, high\-confidence, and low\-confidence pseudo\-labeled data to regularize training, reduce error propagation, and promote smoother decision boundaries\.

Despite these improvements, methods relying on a single model for pseudo\-label generation remain vulnerable to reinforcing incorrect high\-confidence predictions, particularly in early training stages\[rahman\_caragea\_2025\_llm\]\. To mitigate this issue, VerifyMatch\[park\_2024\_verifymatch\]incorporates LLM\-generated pseudo\-labels alongside a verifier model, enabling cross\-validation of pseudo\-label quality\. Combined with MixUp, VerifyMatch achieves competitive performance in low\-resource settings\. Finally, LLM\-guided Co\-Training \(LG\-CoTrain\)\[rahman\_caragea\_2025\_llm\]integrates LLM\-generated pseudo\-labels within a dual\-model co\-training framework, where two models iteratively learn from each other while incorporating LLM guidance\. Unlike MixUp\-based approaches,LG\-CoTrainretains all pseudo\-labeled data without modification\.LG\-CoTrainoutperforms the zero\-shot Phi\-3 and other SSL approaches and achieves state\-of\-the\-art performance on four out of five text classification benchmark datasets\.

SSL for Social Media Crisis Data Analysis\. Several studies have applied SSL techniques to social media crisis data analysis\. For example,\[alam2018graph\]proposed a graph\-based semi\-supervised CNN model for Twitter data from two disaster events\.\[Li\-iscram\-21\]applied self\-training with BERT and CNN models to the CrisisLexT6 and CrisisLexT26 datasets\[OlteanuCDV14,Olteanu2015\], which contain tweets from various disaster events\. CrisisLexT6 is annotated for whether tweets are related to the disaster, while CrisisLexT26 includes coarse\- and fine\-grained humanitarian labels similar to those in the HumAID dataset\.\[sirbu2022multimodal\]extended FixMatch by incorporating soft\-labeling for multimodal disaster tweet classification \(text and images\) on the CrisisMMD dataset\[crisismmd\_2018\_icwsm\]\.\[zou2023crisismatch\]proposes CrisisMatch which differs from MixMatch by using hard pseudo\-labeling for entropy maximization instead of sharpening for text classification on the HumAID dataset\. Meanwhile,\[zou2023decrisismb\]proposes a novel approach by using memory bank — DeCrisisMB\[zou2023decrisismb\]— to addresses the bias in SSL which assigns disproportionate pseudo\-labels for more occurring instances in highly imbalanced datasets such as crisis\-related tweet classification\.

More recently,\[gupta\_2025\_calibrated\]proposed Confidence\-based and AUM\-based MixUp with Self\-Training \(AUM\-ST\-MixUp\) and conducted a systematic evaluation across 10 disaster events in the HumAID dataset, comparing these methods with several SSL baselines discussed above\. In addition,\[gupta\_2025\_calibrated\]employed Expected Calibration Error \(ECE\) to measure model calibration\. ECE quantifies the extent to which predicted probabilities align with observed outcomes\. A well\-calibrated model produces confidence estimates that align with empirical correctness rates; for example, predictions assigned 85% confidence should be correct approximately 85% of the time\. Calibration is particularly important for interpreting model outputs and supporting reliable decision\-making in high\-stakes contexts such as disaster response\.

LLMs for Social Media Crisis Data Analysis\. With the rapid development of LLMs, increasing attention has been devoted to applying them to social media data analysis\[lei2025harnessing,sanchez\_2025\_llmcifsc\]\. For instance,\[guo\_2025\_asonam\]compared fine\-tuned Llama 3\.2 11B models with zero\-shot GPT\-4o and prior approaches, demonstrating that fine\-tuned Llama models achieve state\-of\-the\-art performance on multimodal crisis classification using the CrisisMMD dataset\. Most closely related to our work is\[imran2024\-openai\], which systematically evaluated the robustness of several LLMs—including GPT\-3\.5, GPT\-4, GPT\-4o, Llama\-2 13B, Llama\-3 8B, and Mistral 7B—on the HumAID dataset under zero\-shot and few\-shot settings\. Their results indicate that few\-shot prompting does not consistently improve performance, and GPT\-4 achieved the strongest overall results among the evaluated models\.

In this work, we compare LLM\-guided SSL approaches against the SSL methods evaluated in\[gupta\_2025\_calibrated\]with respect to model prediction accuracy, as well as calibration error \(ECE metric\)\.

## 3Dataset

We use the same 10 disaster events from the HumAID dataset as our benchmark following\[gupta\_2025\_calibrated\], as shown in Table[1](https://arxiv.org/html/2605.08448#S3.T1)\. HumAID is a human\-annotated Twitter dataset comprising 77,196 tweets from 19 disaster events, categorized into 11 classes\. Excluding the “Don’t know or can’t judge” category, the remaining 10 primary humanitarian categories are:1\. Caution and advice; 2\. Sympathy and support; 3\. Requests or urgent needs; 4\. Displaced people and evacuations; 5\. Injured or dead people; 6\. Missing or found people; 7\. Infrastructure and utility damage; 8\. Rescue, volunteering, or donation effort; 9\. Other relevant; 10\. Not humanitarian\.

Table[1](https://arxiv.org/html/2605.08448#S3.T1)and[2](https://arxiv.org/html/2605.08448#S3.T2)also reports the statistics of the training, validation, and test splits, number of classes for each event as well as class distribution\. For detailed class distributions within each event, we refer readers to\[gupta\_2025\_calibrated\]\. Consistent with their experimental setup, we adopt the same data split configuration, in which the training set is further divided into labeled and unlabeled subsets, as described in the Experimental Setup section\.

Disaster Event/Data SplitCTrainValTest5 lb/cl10 lb/cl25 lb/cl50 lb/clLLUULLUULLUULLUUCalifornia Wildfires 20181051637521461505113100506325049135004663Canada Wildfires 20168156922844540152980148918913803641205Cyclone Idai 2019102753401779502703100265323825154532300Hurricane Dorian 201995329776150845528490523922551044424887Hurricane Florence 201894384639124145433990429422541594383946Hurricane Harvey 201796378929180545633390628822561534505928Hurricane Irma 201796579954186245653490648922563544506129Hurricane Maria 201795094742144245504990500422548694504644Kaikoura Earthquake 20169153622443545149190144621713194171119Kerala Floods 201895588814158245554390549822553634395149

Table 1:Data distribution and splits under different labels\-per\-class \(lb/cl\) settings for each of the 10 events in the HumAID dataset, whereCCstands for the number of classes,LLstands for the number of labeled instances andUUstands for the number of unlabeled instances\.Disaster EventClassesClass Distribution12345678910California Wildfires 20181097330552581362125295991727923Canada Wildfires 2016874113142660017665321855Cyclone Idai 201910623381004030313248130828556Hurricane Dorian 201999587581255614205716911011612Hurricane Florence 201899173303844620802241034445742Hurricane Harvey 20179379444233482488085219761237287Hurricane Irma 20179429397885286260131711131651430Hurricane Maria 2017915447049892211099913841097189Kaikoura Earthquake 201693453021761730218145218157Kerala Floods 20189975854133925402073005669319Table 2:Disaster events and the corresponding number of classes, number of tweets in train split per event with the class distribution
## 4Methods

We study semi\-supervised learning \(SSL\) for crisis\-related tweet classification under limited labeled data\. All methods operate on a shared experimental setting consisting of a small labeled subset𝒟L=\{\(xi,yi\)\}i=1nL\\mathcal\{D\}\_\{L\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\_\{L\}\}and a larger unlabeled subset𝒟U=\{xj\}j=1nU\\mathcal\{D\}\_\{U\}=\\\{x\_\{j\}\\\}\_\{j=1\}^\{n\_\{U\}\}drawn from the same disaster event\. Our goal is to learn a classifierfθ​\(x\)f\_\{\\theta\}\(x\)that generalizes well to held\-out test data while maintaining calibrated confidence estimates\. Unless otherwise specified, all neural models use BERTweet as the encoder backbone, due to its overall good supervised performance\[gupta\_2025\_calibrated\], followed by a task\-specific classification head\.

### 4\.1Supervised Baselines

We include two supervised baselines to contextualize SSL performance\.

- •Limited\-Label Supervision \- Lower Bound\.A BERTweet classifier trained solely on𝒟L\\mathcal\{D\}\_\{L\}serves as a lower\-bound reference, representing performance achievable without unlabeled data\.
- •Full\-Supervision \- Upper Bound\.A BERTweet model trained on the complete labeled training split \(including𝒟L\\mathcal\{D\}\_\{L\}and𝒟U\\mathcal\{D\}\_\{U\}subsets\) provides an approximate upper bound for in\-domain performance \(Table[3](https://arxiv.org/html/2605.08448#S6.T3)BERTweet All\)\.

### 4\.2Zero\-Shot LLM Baseline

Zero\-shot setting: To contextualize LLM\-guided SSL performance, we evaluate GPT\-4o in a zero\-shot classification setting, where the LLM directly predicts class labels for test instances without any labeled training example and no task\-specific fine\-tuning\. This baseline measures the standalone capability of LLMs relative to compact supervised and semi\-supervised models\. Specifically, we experimented with GPT\-4\.1, GPT\-4o, GPT\-4o mini, and GPT\-5\.1, and among them, GPT\-4o consistently achieved better performance on both training and test splits\. Therefore, we use GPT\-4o to generate pseudo\-labels for the entire unlabeled training set\. We attempted to reproduce the best overall result obtained with zero\-shot GPT\-4 by\[imran2024\-openai\], however, due to API deprecation, exact replication was not possible\. Nevertheless, our zero\-shot GPT\-4o results are comparable \(slightly better\) on average to the zero\-shot GPT\-4o reported by\[imran2024\-openai\]as shown in Table[3](https://arxiv.org/html/2605.08448#S6.T3)\. While GPT\-4o\-mini produced competitive results, it generated a number of out\-of\-source \(OOS\) predictions\.

Prompt engineering: We evaluate three prompts with GPT\-4o on the validation splits of all 10 events, ranging from simple category definitions from the HumAID dataset to more detailed descriptions\. As performance remains nearly identical across prompts \(Macro\-F1 range: 0\.601–0\.613\) with the simplest version being the best, we adopt the simplest version\. The detailed prompt

### 4\.3Classical Semi\-Supervised Learning

We also used representative SSL approaches previously evaluated on the HumAID dataset\[gupta\_2025\_calibrated\]\.

- •Self\-Training \(ST\)\.A teacher model trained on𝒟L\\mathcal\{D\}\_\{L\}generates pseudo\-labelsy^j\\hat\{y\}\_\{j\}for unlabeled examples in𝒟U\\mathcal\{D\}\_\{U\}\. High\-confidence pseudo\-labeled instances are iteratively incorporated into training\.
- •Uncertainty\-Aware Self\-Training \(UST\)\.UST refines pseudo\-label selection by incorporating uncertainty estimation, reducing the influence of noisy high\-confidence predictions\.
- •MixMatch\.MixMatch integrates soft pseudo\-labeling with MixUp interpolation between labeled and unlabeled examples, encouraging smoother decision boundaries\.
- •AUM\-based Self\-Training \(AUM\-ST\) and AUM\-ST\-MixUp\.AUM\-ST tracks training dynamics via Area Under the Margin \(AUM\) to filter unreliable pseudo\-labels and AUM\-ST\-MixUp combines this filtering with MixUp\-based regularization\.
- •Confidence\-based MixUp in Self\-Training \(Conf\-ST\-MixUp\)\.Conf\-ST\-MixUp enhances pseudo\-labeling by defining prediction confidence as the probability gap between the top two classes, where a larger gap indicates a more reliable pseudo\-label\. It then applies MixUp across labeled, high\-confidence, and low\-confidence pseudo\-labeled data to regularize training and reduce error propagation\.

### 4\.4LLM\-Guided Semi\-Supervised Learning

We investigate two SSL frameworks that incorporate zero\-shot pseudo\-labels generated by an LLM, in our case, GPT\-4o\. These pseudo\-labels are used to augment or guide the training of smaller, task\-specific models\.

- •VerifyMatch\.VerifyMatch integrates LLM\-generated pseudo\-labels with a verifier model that cross\-validates predictions before incorporating them into training\. Combined with confidence\-aware MixUp, this mechanism aims to reduce confirmation bias and mitigate overconfident errors, two common issues in pseudo\-labeling\. Confirmation bias refers to the tendency of a model to reinforce its own incorrect pseudo\-labels during self\-training, for example, repeatedly assigning the same wrong label to similar inputs\. Overconfident errors refer to incorrect predictions made with high confidence, for example, assigning a wrong label with near\-certain probability\.
- •LLM\-Guided Co\-Training \(LG\-CoTrain\)\.LG\-CoTrainemploys a dual\-model co\-training architecture in which two classifiers iteratively exchange pseudo\-labels while incorporating LLM guidance\. Unlike MixUp\-based methods,LG\-CoTrainretains all pseudo\-labeled examples and relies on cross\-model agreement to stabilize learning\. This framework is particularly effective in extremely low\-resource settings, where model\-generated pseudo\-labels alone may be unreliable\.

Overall, the evaluated methods differ along three dimensions: \(1\) pseudo\-label generation source \(model\-based vs LLM\-based\), \(2\) pseudo\-label filtering strategy \(confidence thresholding, uncertainty\-aware weighting, verification, or co\-training\), and \(3\) representation regularization \(none, MixUp, or cross\-model consistency\)\. This structured comparison enables analysis of both predictive performance and calibration behavior under label scarcity\.

## 5Experimental Setup

We run experiments with all the approaches described in the Methods section\. For each event, we use the same train/validation/test splits as\[gupta\_2025\_calibrated\]\. We simulate low\-resource settings by selecting a fixed number of labeled examples per class \(lb/cl\) from the training split\. We evaluate four label budgets: 5, 10, 25, and 50 lb/cl\. The remaining training instances are treated as unlabeled and are used by SSL methods according to their respective learning objectives\. This split configuration is kept consistent across all methods to ensure a fair comparison\.

We use the same metrics as in\[gupta\_2025\_calibrated\], specifically, Macro\-F1 and the Expected Calibration Error \(ECE\) averaged across the 10 disaster events, to evaluate all methods\. Macro\-F1 captures balanced performance across classes, while ECE quantifies how well predicted probabilities align with empirical correctness\.

Hyperparameter tuning was performed using Weights & Biases and the Optuna package over learning rate, batch size, number of epochs, and additional some main model\-specific parameters\. Detailed configurations will be released in the project repository\. For Weights & Biases, we employed Bayesian sweeps to automate the search; however, we observed occasional optimization instability\. Prior work\[liu\-wang\-2021\-empirical\]shows that, in low\-resource transformer fine\-tuning, automated hyperparameter optimization may fail to outperform simple grid\-search under limited search budgets due to overfitting and instability\. Moreover, repeated tuning on a small validation set can lead to meta\-overfitting, where configurations that perform well on the development set do not generalize to test performance or calibration\. This may also explain small discrepancies between our results and those reported in\[gupta\_2025\_calibrated\]\.

## 6Results and Discussion

We show our experimental results in Table[3](https://arxiv.org/html/2605.08448#S6.T3)and Table[4](https://arxiv.org/html/2605.08448#S6.T4), as well as Figure[1](https://arxiv.org/html/2605.08448#S6.F1)\. Table[3](https://arxiv.org/html/2605.08448#S6.T3)reports the zero\-shot performance of GPT\-4o, along with the performance of the fully supervised BERTweet upper bound\. Table[4](https://arxiv.org/html/2605.08448#S6.T4)summarizes the Macro\-F1 and ECE scores averaged across the 10 disaster events under varying label budgets\.

ModelZero\-shot GPT\-4o\[imran2024\-openai\]Zero\-shot GPT\-4otrainZero\-shot GPT\-4otestBERTweet \- AllF1↑\\uparrow0\.6120\.6280\.6410\.678ECE↓\\downarrow\-\-\-0\.110Table 3:Performance results for Zero\-shot GPT\-4o and supervised BERTweet trained on the whole training splitMethod/MetricF1↑\\uparrowECE↓\\downarrow\# Label51025505102550BERTweet0\.4230\.5170\.5630\.6060\.2060\.2220\.2470\.256ST0\.4480\.5480\.6250\.6550\.3050\.2310\.1840\.165UST0\.4650\.5460\.6090\.6410\.3420\.2710\.2250\.191MixMatch0\.4590\.5530\.6240\.6470\.3740\.2970\.2460\.228AUM\-ST0\.4240\.5050\.5720\.5950\.2640\.2060\.2040\.194Conf\-ST\-MixUp0\.4210\.5330\.6230\.6430\.4080\.3210\.2440\.246AUM\-ST\-MixUp0\.4760\.5320\.6110\.6390\.1900\.0690\.0570\.064VerifyMatch0\.4630\.5490\.6160\.6440\.1270\.0860\.0830\.100LG\-CoTrain0\.6080\.6190\.6310\.6450\.1740\.1600\.1220\.108Table 4:BERTweet and SSL performance on the 10 HumAID disaster events\. The results are reported in terms of Macro\-F1 and ECE values averaged over the 10 events in the dataset\. The best result for each setup is highlighted in bold\.### 6\.1Model Comparisons

#### Zero\-Shot GPT\-4o vs\. BERTweet\-all \(upper bound\)\.

Table[3](https://arxiv.org/html/2605.08448#S6.T3)shows that GPT\-4o achieves an averaged Macro\-F1 of 0\.628 on the training split and 0\.641 on the test split across the 10 disaster events, slightly outperforming the GPT\-4o results reported by\[imran2024\-openai\]\. These results provide an overall indication of the pseudo\-label quality generated by GPT\-4o\. Analyzing performance by category over the combined training splits of all 10 events shows that GPT\-4o performs well on clear and concrete categories, achieving Macro\-F1 scores above 0\.6\. For instance, it reaches 0\.885 onInjured or dead peopleand 0\.827 onRescue, volunteering, or donation effort\. However, GPT\-4o struggles on three categories \(F1<<0\.6\), includingOther relevant information\(0\.276\),Requests or urgent needs\(0\.526\), andNot humanitarian\(0\.569\), which are broader or less precise\. For the remaining categories, the performance is 0\.634 forCaution and advice, 0\.739 forSympathy and support, 0\.766 forDisplaced people and evacuations, 0\.698 forMissing or found people, and 0\.704 forInfrastructure and utility damage\. A detailed breakdown is provided in the project repository\. Overall, categories with more accurate pseudo\-labels are expected to benefit more from LLM\-guided SSL approaches such asLG\-CoTrain\.

Despite its relative good zero\-shot performance, GPT\-4o remains below the fully supervised BERTweet model trained on the complete training set, which achieves a Macro\-F1 of 0\.678 and serves as an approximate upper bound\. This comparison suggests that while zero\-shot LLMs offer competitive performance without task\-specific training, further gains can be achieved through supervised or semi\-supervised adaptation\.

VerifyMatch vs\. classical SSL approaches\.Table[4](https://arxiv.org/html/2605.08448#S6.T4)shows that VerifyMatch yields competitive performance relative to the classical SSL baselines but lower ECE, especially at low 25–50 lb/cl budgets, indicating that combining LLM pseudo\-labels with an explicit verification mechanism can improve robustness to noisy pseudo\-labels\. Compared to threshold\-based or training\-dynamics filtering strategies, VerifyMatch provides a more conservative but stable learning signal\.

LG\-CoTrainvs\. other SSL approaches\.Table[4](https://arxiv.org/html/2605.08448#S6.T4)also shows that under the most challenging conditions,LG\-CoTrainperforms best among all SSL approaches: 0\.608 Macro\-F1 at 5 lb/cl and 0\.619 at 10 lb/cl\. This is a substantial improvement over classical SSL baselines such as ST, UST, and MixMatch, and also surpasses the best\-performing baseline from\[gupta\_2025\_calibrated\]\(Conf\-ST\-MixUp, AUM\-ST\-MixUp\) and VerifyMatch in terms of Macro\-F1 at both 5 and 10 lb/cl\. These results suggest that incorporating LLM pseudo\-labels within a co\-training framework is particularly effective when the labeled set is too small for reliable self\-training\.

At higher label budgets, the performance gap betweenLG\-CoTrainand other SSL baselines narrows, but with the average Macro F1 of 0\.631 at 25 lb/cl,LG\-CoTrainstill outperforms the other SSL approaches\. Notably, Self\-Training becomes a strong baseline and achieves the best Macro\-F1 value of 0\.655 at 50 lb/cl\. In this regime, the advantage ofLG\-CoTraindiminishes, suggesting that once enough labeled data are available, the marginal benefit of LLM\-guided pseudo\-labeling is smaller\. This may be due to the fact that the overall F1 obtained with GPT\-4o is around 0\.630\-0\.640\. Furthermore, this can also be attributed to the limited amount of unlabeled data in the HumAID benchmark, and the fact that it may not be fully representative of minority classes, an issue that can also be encountered in real\-world scenarios\. Larger and more representative unlabeled datasets could help mitigate these limitations\.

GPT\-4o vs\. LLM\-guided SSL approaches\.In low\-resource settings,LG\-CoTrainprovides consistent gains over classical SSL methods on many events, which explains its strong averaged performance\. While zero\-shot GPT\-4o remains highly competitive across events,LG\-CoTrainsurpasses its performance in several cases\. In particular, for the Hurricane Harvey, Irma, and Maria events under the 10, 25, and 50 labeled examples per class settings,LG\-CoTrainachieves higher Macro\-F1 scores than GPT\-4o\. Even under the most constrained setting of 5 labeled examples per class,LG\-CoTrainstill outperforms GPT\-4o on Hurricane Harvey and Hurricane Maria\. This observation reinforces an important takeaway: when deployment constraints allow direct LLM inference, zero\-shot LLMs can serve as strong baselines; however, when practitioners require a smaller deployable model, LLM\-guided SSL can distill and transfer some of the LLM’s strengths into a compact classifier that can be executed effieciently and repeatedly at scale\.

The amortized training paradigm of LLM\-guided SSL approach such asLG Co\-Trainis particularly advantageous in crisis\-response settings, where streaming data volumes are high and rapid, low\-latency and accurate decisions are required\. Thus, the value of LLM\-guided SSL lies not merely in marginal performance gains, but in enabling scalable, controllable, and cost\-efficient deployment\.

SSL approaches vs\. BERTweet \(lower bound\)\.Also worth noting, all SSL approaches show better performance than the BERTweet trained only on the limited amount of labeled data, suggesting that SSL approaches represent a good option for classifying social media crisis data in a low\-data regime\.

Calibration Behavior\.Beyond classification accuracy, calibration is critical in crisis\-response applications, where confidence scores may influence downstream triage and operational decisions\. Table[4](https://arxiv.org/html/2605.08448#S6.T4)shows that AUM\-ST\-MixUp achieves the lowest ECE at 10, 25 and 50 lb/cl, reflecting strong calibration under severe label scarcity\. VerifyMatch also yields consistently low ECE values across all budgets, indicating that its verification mechanism can help control overconfident errors\.LG\-CoTrainprioritizes Macro\-F1 improvements in the lowest\-resource settings, while its calibration improves as more labeled data are available \(ECE decreases from 0\.174 at 5 lb/cl to 0\.108 at 50 lb/cl\)\.

Overall, the results highlight a practical trade\-off: the most accurate method under extreme label scarcity is not always the best calibrated, and practitioners may need to balance performance and confidence reliability depending on operational needs\.

### 6\.2Per\-event Analysis

Figure[1](https://arxiv.org/html/2605.08448#S6.F1)presents per\-event Macro\-F1 scores for all methods across label budgets, providing additional insight beyond the averaged results\. Performance varies substantially across disasters, reflecting differences in event characteristics and class distributions \(shown in Table[2](https://arxiv.org/html/2605.08448#S3.T2)\)\.

![Refer to caption](https://arxiv.org/html/2605.08448v1/images/Macro-F1_20260417.png)Figure 1:Per\-event Macro\-F1 scores for all methods across the 10 HumAID disaster events under different label budget settings\. Values are rounded for readability\.Similar to\[imran2024\-openai\], who find that LLMs struggle with flood events, we observe that both zero\-shot GPT\-4o and the SSL models perform relatively worse on the Kerala Floods 2018 event compared to others\. We hypothesize that this variation may be attributable not only to the disaster type, but also largely to two factors: class imbalance and pseudo\-label quality\. For events where a single class dominates the training set, such as Kerala Floods \(53\.8% of training tweets belong torescue\_volunteering\_or\_donation\_effort\) and Cyclone Idai \(largest class 47\.5%\), the models consistently underperform because rare classes have too few examples for reliable classification, and Macro\-F1 penalizes poor recall on any single class equally\. Conversely, for more balanced events such as Kaikoura Earthquake \(largest class 22\.5%\) and Hurricane Florence \(largest class 23\.6%\), the models achieve better scores overall as compared to the scores for other events\. GPT\-4o pseudo\-label quality also plays a role: events where GPT\-4o achieves higher zero\-shot F1 tend to produce betterLG\-CoTrainresults, since higher\-quality pseudo\-labels provide a stronger training signal for co\-training\. The number of active classes also plays a role: events with only 8 classes \(Canada Wildfires\) or with extremely rare classes, such as Cyclone Idai wheremissing\_or\_found\_peoplehas only 13 training tweets, create near\-zero F1 on those classes, which leads to a disproportionately reduced macro average\.

### 6\.3Ablation Study

To quantify the contribution of the LLM in the LLM\-guided SSL approaches, we compareLG\-CoTrainwith GPT\-4o pseudo\-labels against the same co\-training approach without LLM pseudo\-labels, a variant called Self\-guided CoTrain\(SG\-CoTrain\)\.SG\- CoTrainreplaces GPT\-4o pseudo\-labels with pseudo\-labels from the BERTweet teacher model trained on the small labeled set\. We run\(SG\-CoTrain\)with 5 labeled examples per class across 10 events with 3 runs for each event\. Table[5](https://arxiv.org/html/2605.08448#S6.T5)reports per\-event Macro\-F1 on the test set for these two experiments\.

EventLG\-CoTrainSG\-CoTrainDeltaCalifornia Wildfires 20180\.6080\.4080\.2000\.200Canada Wildfires 20160\.5680\.3550\.2130\.213Cyclone Idai 20190\.5520\.3140\.2370\.237Hurricane Dorian 20190\.5540\.4090\.1450\.145Hurricane Florence 20180\.6550\.3190\.3360\.336Hurricane Harvey 20170\.6360\.4000\.2360\.236Hurricane Irma 20170\.6260\.4220\.2040\.204Hurricane Maria 20170\.6550\.4560\.1990\.199Kaikoura Earthquake 20160\.6690\.5010\.1680\.168Kerala Floods 20180\.5600\.3780\.1820\.182Average0\.6080\.3960\.212\\mathbf\{0\.212\}Table 5:Ablation study for co\-training with 5 labels for class, with LLM pseudo\-labels \(LG\-CoTrain\) and without LLM pseudo\-labels \(SG\-CoTrain\)\. The Delta improvement obtained from the LLM pseudo\-labels is also shown\.LG\-CoTrainconsistently outperformsSG\-CoTrainacross all events, with gains ranging from 0\.145 to 0\.336 in F1 \(average improvement of 0\.212\)\. Since both models share the same co\-training pipeline and hyperparameter search space, this performance gap can be attributed primarily to differences in pseudo\-label quality\. This result is expected: a BERTweet teacher trained on only 50 labeled examples \(5 lb/cl\) produces pseudo\-labels that are too noisy for effective co\-training and even degrade performance through error propagation\. In contrast, GPT\-4o’s zero\-shot predictions are sufficiently accurate to provide a strong initial signal, enabling more effective learning\.

To further examineSG\-CoTrain, we also run an simple experiment using fixed hyper\-parameter for all events with filtering pseudo\-labeled data from the BERTweet teacher by retaining only the 50 most confident predictions per class\. That preliminary results show that applying confidence filtering improves performance, despite using fewer pseudo\-labeled samples\. And we also observe the gap betweenLG\-CoTrainand non\-LLM variants tends to narrow as the amount of labeled data increases, which suggests that with sufficient labeled data, confidence\-filtered self\-guidance can partially substitute for LLM guidance\. But we will leave the verification of this hypothesis and more comprehensive analysis to future work\.

### 6\.4Deployment Considerations

In our experiments, GPT\-4o zero\-shot classification of the evaluation set \(6,463 tweets across 10 events\) cost $17\.83 \(via the OpenAI Batch API\)\. The unlabeled training pool totals 44,373 tweets \(approximately 4,400 per event on average\), so we estimate pseudo\-labeling cost at roughly $12 per event at this scale\. Including hyperparameter tuning and model training on a dataset of this size, the full pipeline for a single new disaster event requires under one hour on an 8\-GPU NVIDIA H100 cluster, cost of which could start from $24 depending on the chosen cloud platform\. The total deployment cost is therefore approximately $36 per event\. More affordable GPU may reduce this cost but would require more training time\. These estimates are based on our dataset with an average tweet length typical of Twitter/X posts; actual costs will vary with the number of tweets, their average length, as well as the LLM API cost per token\.

Taken together, our experimental results suggest three practical implications\. First, LLM\-guided SSL \(especiallyLG\-CoTrain\) is most beneficial when labeled data are extremely scarce \(5–10 lb/cl\), although the cost of using LLMs should be carefully considered in real\-world settings with large volumes of noisy social media data\. Second, as the amount of labeled data increases, simpler SSL methods such as Self\-Training become highly competitive and can even outperform more complex approaches, making them strong baselines in moderate\-resource settings\. Third, calibration varies substantially across methods, and approaches such as AUM\-ST\-MixUp and VerifyMatch may be preferred when reliable confidence estimates are important\.

In practice, this suggests the following\. When only a very small labeled set is available \(e\.g\., 5 lb/cl\) and rapid deployment is needed, a cost\-effective strategy is to generate a limited amount of pseudo\-labeled data with an LLM and then train anLG\-CoTrainmodel to filter actionable social media content\. When a moderate amount of labeled data is available \(e\.g\., 50 lb/cl\), it may be preferable to avoid LLM costs and instead rely on task\-specific SSL methods such as Self\-Training\. Prior work such as\[guo\_2025\_asonam\]also shows that task\-specific models are often preferred when feasible, as they can outperform LLMs in zero\-shot settings\. Finally, when interpretability and well\-calibrated confidence estimates are critical, methods such as AUM\-ST\-MixUp provide a more suitable choice\.

## 7Conclusion and Future Work

This paper provides the first empirical evaluation of LLM\-guided semi\-supervised learning for social media crisis classification on the HumAID benchmark\. We compare two recent LLM\-assisted SSL methods, VerifyMatch andLG\-CoTrain, against widely used SSL baselines under multiple label budgets and evaluate the results in terms of predictive performance \(Macro\-F1\) and as well as reliability \(ECE\)\.

Our results show thatLG\-CoTrainachieves the strongest Macro\-F1 in extremely low\-resource settings \(5–10 labeled examples per class\), demonstrating that LLM pseudo\-labels can meaningfully improve SSL when labeled data are severely limited\. VerifyMatch provides competitive classification performance and strong calibration, while AUM\-ST\-MixUp remains a strong choice when calibration is the primary concern\. As the label budget increases, the performance advantage of LLM\-guided SSL decreases and Self\-Training becomes a difficult\-to\-beat baseline, emphasizing that method choice should reflect the available annotation budget and deployment requirements\. We discuss corresponding recommendations for practical deployment\.

Future work will explore three directions\. First, we will investigate how unlabeled data scale and representativeness affect LLM\-guided SSL, especially under missing\-class or distribution\-shift scenarios that can arise in real operations\. Second, we will study calibration\-aware training objectives and pseudo\-label filtering strategies tailored to crisis informatics, aiming to improve both accuracy and reliability\. Finally, we will extend the evaluation to multimodal crisis datasets and cross\-event generalization settings, where LLMs may provide even greater benefits as a source of transferable knowledge\.

## 8Acknowledgment

This work is supported by a collaborative CAHSI\-Google Institutional Research Program award\.

## References

Similar Articles

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

arXiv cs.CL

This study evaluates the use of open-source LLMs for inductive coding of interviews with Black firearm violence survivors, finding that while LLMs can identify some codes, overall relevance remains low and guardrails cause significant narrative erasure. The research highlights both potential and ethical limitations of applying AI to qualitative research involving vulnerable populations.