AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse
Summary
This paper introduces AraHopeCorpus, the first annotated dataset of hope speech in Arabic social media, collected from YouTube comments about the war on Gaza. It provides a detailed annotation framework and analysis, showing that hopeful language dominates crisis discourse.
View Cached Full Text
Cached at: 05/25/26, 09:00 AM
# AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse Source: [https://arxiv.org/abs/2605.23325](https://arxiv.org/abs/2605.23325) [View PDF](https://arxiv.org/pdf/2605.23325) > Abstract:Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication\. While hate speech and misinformation have been widely studied, expressions that promote resilience, solidarity, and optimism remain underexplored, particularly in Arabic contexts\. This paper introduces AraHopeCorpus, the first annotated dataset of Arabic hope speech collected from ten thousand YouTube comments related to the war on Gaza between 2023 and 2024\. Using a detailed annotation framework, comments were classified into three categories: hope speech, no hope speech, and neutral or unclear discourse\. The dataset shows that hopeful language dominates, accounting for more than sixty four percent of all comments\. These expressions of hope appear mainly as religious encouragement, collective solidarity, and optimism for endurance and justice\. No hope speech, representing about thirteen percent, reflects despair and disillusionment, while the rest of the comments contain neutral or mixed content\. Inter\-Annotator Agreement reached substantial levels \(Cohen's Kappa equals 0\.71\), though dialectal variation, sarcasm, and implicit meaning posed annotation challenges\. A comparative analysis between human annotators and ChatGPT revealed that large language models can support annotation but remain limited in handling dialectal and culturally embedded expressions\. AraHopeCorpus will be released for research purposes under an open and non commercial license\. It provides a valuable resource for studying constructive digital discourse, enabling further research on hope speech detection, crisis communication, and resilience in Arabic social media\. ## Submission history From: Wajdi Zaghouani \[[view email](https://arxiv.org/show-email/05a31949/2605.23325)\] **\[v1\]**Fri, 22 May 2026 07:39:21 UTC \(426 KB\)
Similar Articles
Cohesion-6K: An Arabic Dataset for Analyzing Social Cohesion and Conflict in Online Discourse
Introduces Cohesion-6K, a manually and ChatGPT-assisted annotated dataset of 6,000 Arabic Facebook posts about the Israeli Occupation of Palestine, spanning conflict to cohesion categories. Analysis shows conflict-oriented posts receive 2-4x more engagement than resolution-oriented ones.
Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus
This paper presents the Arabic Women and Society Corpus, a ten-year collection of over 250,000 Arabic Facebook posts related to women's empowerment and social wellbeing, with engagement metrics for analyzing gender discourse and sentiment.
ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination
ArabDiscrim is a decade-long lexical resource and corpus of 293K Arabic Facebook posts about racism and discrimination, with engagement signals, morphological regex families, and discrimination axes, supporting fairness-oriented Arabic NLP research.
BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon
This paper introduces BOUTEF, a large-scale multilingual corpus for studying fake news in Algeria and Tunisia, covering Arabic dialects, Arabizi, French, English, and code-switching. It includes empirical analysis of linguistic strategies and engagement dynamics.
Linear Semantic Segmentation for Low-Resource Spoken Dialects
This paper introduces a benchmark for semantic segmentation in low-resource dialectal Arabic and proposes a model that improves performance on conversational speech compared to standard baselines.