Depression Risk Assessment in Social Media via Large Language Models

arXiv cs.CL 04/23/26, 04:00 AM Papers
Summary
Researchers present a zero-shot LLM system that assesses depression risk from Reddit posts, achieving competitive F1 scores and demonstrating scalable mental-health monitoring.
arXiv:2604.19887v1 Announce Type: new Abstract: Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well-being. In this work, we propose a system based on Large Language Models (LLMs) for depression risk assessment in Reddit posts, through multi-label classification of eight depression-associated emotions and the computation of a weighted severity index. The method is evaluated in a zero-shot setting on the annotated DepressionEmo dataset (~6,000 posts) and applied in-the-wild to 469,692 comments collected from four subreddits over the period 2024-2025. Our best model, gemma3:27b, achieves micro-F1 = 0.75 and macro-F1 = 0.70, results competitive with purpose-built fine-tuned models (BART: micro-F1 = 0.80, macro-F1 = 0.76). The in-the-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences between r/depression and r/anxiety. Our findings demonstrate the feasibility of a cost-effective, scalable approach for large-scale psychological monitoring.
Original Article
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# Depression Risk Assessment in Social Media via Large Language Models
Source: [https://arxiv.org/html/2604.19887](https://arxiv.org/html/2604.19887)
Giorgia Gulino1Manuel Petrucci1 1Guglielmo Marconi University, Department of Human Sciences, Rome, Italy Corresponding:g\.gulino1@studenti\.unimarconi\.it

###### Abstract

Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated\. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well\-being\. In this work, we propose a system based onLarge Language Models\(LLMs\) for depression risk assessment in Reddit posts, throughmulti\-labelclassification of eight depression\-associated emotions and the computation of a weighted severity index\. The method is evaluated in azero\-shotsetting on the annotated*DepressionEmo*dataset \(≈6,000\\approx 6\{,\}000posts\) and appliedin\-the\-wildto469,692469\{,\}692comments collected from four subreddits over the period 2024–2025\. Our best model,gemma3:27b, achieves micro\-F1 = 0\.75 and macro\-F1 = 0\.70, results competitive with purpose\-built fine\-tuned models \(BART: micro\-F1 = 0\.80, macro\-F1 = 0\.76\)\. The in\-the\-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences betweenr/depressionandr/anxiety\. Our findings demonstrate the feasibility of a cost\-effective, scalable approach for large\-scale psychological monitoring\.

Keywords:depression, digital mental health, LLM, Reddit, automatic detection, prompt engineering, severity index\.

## 1Introduction

Depression is one of the most prevalent and debilitating mental disorders globally, recognized by the WHO as a leading cause of disability\[[11](https://arxiv.org/html/2604.19887#bib.bib5),[25](https://arxiv.org/html/2604.19887#bib.bib25)\]\. Despite the availability of effective treatments, a substantial proportion of cases remains undiagnosed or insufficiently treated, particularly among young adults\[[11](https://arxiv.org/html/2604.19887#bib.bib5)\]\. Early detection of depressive distress signals is therefore of critical importance for timely intervention\.

The pervasive digitalization of daily life has made social media a privileged observatory for studying psychological behaviors and emotional expressions\. Individuals routinely share thoughts and emotional states online, constituting an indirect yet relevant source of information on psychological well\-being\[[15](https://arxiv.org/html/2604.19887#bib.bib23),[5](https://arxiv.org/html/2604.19887#bib.bib8),[17](https://arxiv.org/html/2604.19887#bib.bib13)\]\. Computational psychology research has demonstrated that textual content published on platforms such as Reddit and Twitter contains linguistic markers of emotional distress, enabling the early identification of depressive risk\[[6](https://arxiv.org/html/2604.19887#bib.bib6),[8](https://arxiv.org/html/2604.19887#bib.bib11),[1](https://arxiv.org/html/2604.19887#bib.bib12)\]\.

In parallel, advances in Natural Language Processing \(NLP\) have made it possible to systematically analyze large volumes of online text\. The introduction of Transformer\-based architectures\[[7](https://arxiv.org/html/2604.19887#bib.bib26)\]and, more recently,Large Language Models\(LLMs\) has opened new possibilities: these models learn deep linguistic representations from massive text corpora, developing a sensitivity to emotional nuance that is difficult to capture with classical approaches\[[9](https://arxiv.org/html/2604.19887#bib.bib34),[12](https://arxiv.org/html/2604.19887#bib.bib17)\]\.

Compared to fine\-tuned models, LLMs offer a significant operational advantage: they can be deployed inzero\-shotmode without requiring domain\-specific labeled data, drastically reducing development costs while improving adaptability to rapidly evolving linguistic contexts\[[20](https://arxiv.org/html/2604.19887#bib.bib39)\]\.

This paper makes the following contributions:

1. 1\.A weighted depressive severity index based on eight clinically relevant emotion categories, inspired by the standardized PHQ\-9 and BDI\-II scales\.
2. 2\.A prompt engineering methodology for zero\-shot multi\-label classification of depressive emotions via LLMs\.
3. 3\.A systematic evaluation of nine locally\-run LLMs \(0\.6B–27B parameters\) against fine\-tuned baselines from the literature on the*DepressionEmo*dataset\[[21](https://arxiv.org/html/2604.19887#bib.bib36)\]\.
4. 4\.Anin\-the\-wildanalysis of469,692469\{,\}692Reddit posts \(2024–2025\), examining emotion distributions, risk profiles, and longitudinal trends\.

The proposed system is not intended as a replacement for clinical assessment, but as a scalable support and triage tool for large\-scale psychological monitoring\[[4](https://arxiv.org/html/2604.19887#bib.bib14)\], on limited affordable hardware running models in local\.

The paper is structured as follows: Section[2](https://arxiv.org/html/2604.19887#S2)shows the related works in clinical measurement of depression severity, machine learning applied to depression detection and emotions relevant for depression identification\. Section[3](https://arxiv.org/html/2604.19887#S3)presents the our approach with the score and the prompt used to guide LLMs in the detection\. Section[4](https://arxiv.org/html/2604.19887#S4)presents two section of experiments: controlled and in\-the\-wild, which are then discussed in Section[5](https://arxiv.org/html/2604.19887#S5)\. Finally, Section[6](https://arxiv.org/html/2604.19887#S6)poses the conclusion of this work\.

## 2Related Work

This section presents three key areas of prior research that contextualize our approach\. First, we review the clinical measurement of depression severity, focusing on standard psychometric scales and the relative clinical weight of specific symptoms \(Section[2\.1](https://arxiv.org/html/2604.19887#S2.SS1)\)\. Second, we trace the evolution of NLP and machine learning techniques for depression detection, from traditional algorithms to large language models \(Section[2\.2](https://arxiv.org/html/2604.19887#S2.SS2)\)\. Finally, we examine the literature concerning the specific emotional states associated with depression and their linguistic taxonomies \(Section[2\.3](https://arxiv.org/html/2604.19887#S2.SS3)\)\.

### 2\.1Clinical Measurement of Depression Severity

Depression severity is traditionally assessed through standardized psychometric scales\. The DSM\-5 defines nine core symptoms of major depressive disorder, including depressed mood, anhedonia, sleep and appetite disturbances, fatigue, difficulty concentrating, feelings of worthlessness or guilt, and recurrent thoughts of death\[[3](https://arxiv.org/html/2604.19887#bib.bib20),[16](https://arxiv.org/html/2604.19887#bib.bib1)\]\. Instruments such as thePatient Health Questionnaire\-9\(PHQ\-9\) and theBeck Depression Inventory\-II\(BDI\-II\) quantify these symptoms through additive scoring, with established thresholds distinguishing mild, moderate, and severe depression\[[19](https://arxiv.org/html/2604.19887#bib.bib2)\]\.

Although each symptom nominally contributes equally to the total score, clinical practice assigns de facto greater weight to specific indicators\. In particular, persistent hopelessness is recognized as a strong risk factor for suicidal ideation and attempts\[[22](https://arxiv.org/html/2604.19887#bib.bib4)\], while suicide intent warrants immediate clinical intervention\. These clinical considerations informed the weighting scheme adopted in the present work\.

### 2\.2NLP and Machine Learning for Depression Detection

Early computational approaches to depression detection in social media relied on traditional machine learning methods \(SVM, logistic regression\) operating on hand\-crafted linguistic features: frequency of emotionally negative words, first\-person pronouns, and absolutist expressions\[[6](https://arxiv.org/html/2604.19887#bib.bib6),[1](https://arxiv.org/html/2604.19887#bib.bib12),[10](https://arxiv.org/html/2604.19887#bib.bib9)\]\. While effective in controlled settings, these methods exhibited limited generalization to variable and evolving language patterns\.

The advent of deep learning and Transformer architectures\[[7](https://arxiv.org/html/2604.19887#bib.bib26)\]brought substantial improvements\. Domain\-specialized variants such asBERTweetfor social media language\[[18](https://arxiv.org/html/2604.19887#bib.bib38)\]andClinicalBERTfor clinical text\[[2](https://arxiv.org/html/2604.19887#bib.bib37)\]achieve 80–90% accuracy in depression identification when trained on in\-domain data\[[13](https://arxiv.org/html/2604.19887#bib.bib21),[9](https://arxiv.org/html/2604.19887#bib.bib34)\]\. More recently, LLMs such as GPT\-3\.5 and GPT\-4 have been explored in zero\-shot and few\-shot settings, demonstrating notable screening capability without any fine\-tuning\[[20](https://arxiv.org/html/2604.19887#bib.bib39),[23](https://arxiv.org/html/2604.19887#bib.bib18),[14](https://arxiv.org/html/2604.19887#bib.bib19)\]\. A fundamental limitation of LLMs, however, is the strong correlation between model size and performance, which entails significant computational costs for the largest models\[[20](https://arxiv.org/html/2604.19887#bib.bib39)\]\.

### 2\.3Emotions Associated with Depression

The literature has identified a recurring set of emotion categories in the language of individuals experiencing depression\. The*DepressionEmo*dataset\[[21](https://arxiv.org/html/2604.19887#bib.bib36)\]provides an operational taxonomy of eight emotions annotated on Reddit posts:anger,cognitive dysfunction,emptiness,hopelessness,loneliness,sadness,suicide intent, andworthlessness\. Recent studies confirm that these emotions do not operate as independent signals but tend to co\-occur, forming a latent construct attributable to depressive distress\[[21](https://arxiv.org/html/2604.19887#bib.bib36),[24](https://arxiv.org/html/2604.19887#bib.bib32)\]\.

## 3Methodology

### 3\.1Depressive Severity Index

We propose a composite depressive severity index that exploits the emotion labels assigned by an LLM to the eight affective dimensions described in Section[2\.3](https://arxiv.org/html/2604.19887#S2.SS3)\. Each emotion is treated as a binary variable \(present = 1, absent = 0\) and multiplied by a weightwiw\_\{i\}proportional to its clinical relevance\. The resulting index is:

S\\displaystyle S=1⋅anger\+1⋅cog\_dysfunction\\displaystyle=1\\cdot\\text\{anger\}\+1\\cdot\\text\{cog\\\_dysfunction\}\+1⋅emptiness\+2⋅hopelessness\\displaystyle\\quad\+1\\cdot\\text\{emptiness\}\+2\\cdot\\text\{hopelessness\}\+1⋅loneliness\+1⋅sadness\\displaystyle\\quad\+1\\cdot\\text\{loneliness\}\+1\\cdot\\text\{sadness\}\+3⋅suicide\_intent\+2⋅worthlessness\\displaystyle\\quad\+3\\cdot\\text\{suicide\\\_intent\}\+2\\cdot\\text\{worthlessness\}\(1\)
The weight assignments are grounded in clinical considerations: sadness, loneliness, anger, emptiness, and cognitive dysfunction are indicators of distress but non\-specific in isolation \(w=1w=1\); hopelessness and worthlessness signal advanced suffering and are associated with worse prognosis \(w=2w=2\); suicide intent is the highest\-criticality indicator and demands immediate attention \(w=3w=3\)\[[22](https://arxiv.org/html/2604.19887#bib.bib4),[3](https://arxiv.org/html/2604.19887#bib.bib20)\]\.

The valueSSis mapped onto four severity levels, modeled after the clinical thresholds of PHQ\-9 and BDI\-II:

- •S=0S=0–11: minimal or absent depression;
- •S=2S=2–44: mild depression;
- •S=5S=5–66: moderate depression;
- •S≥7S\\geq 7: severe depression \(high alert\)\.

The theoretical maximum score isS=13S=13\(all emotions present\)\.

### 3\.2Prompt Engineering

Emotional analysis is performed viaprompt engineering: each post is paired with a structured textual instruction that guides the LLM toward a consistent and machine\-readable classification output\. The base prompt for multi\-label classification is:

Analyzethesentimentofthefollowing

commentfromReddit:"\{post\}"\.

Classifywhichofthefollowingemotions

apply:\{emotions\}\.

AnswerwithaJSONobject,withTrueor

Falseforeachemotion\.

where\{post\}is the comment text and\{emotions\}is the list of eight emotion labels\. To incorporate direct score computation, the prompt is extended as follows:

Analyzethesentimentofthefollowing

commentfromReddit:"\{post\}"\.

Classifywhichofthefollowingemotions

apply:\{emotions\}\.

AnswerwithaJSONobject,withTrueor

Falseforeachemotion\.Then,computea

severityscoreasfollows:assignweights

toemotions\(suicide\_intent=3,

hopelessness=2,worthlessness=2,

cognitive\_dysfunction=1,sadness=1,

emptiness=1,loneliness=1,anger=1\)\.

Returnalsothefield"severity\_score"

withthesumoftheweightsforthe

emotionsclassifiedasTrue\.

An example of the model output is:

\{

"anger":false,

"cognitive\_dysfunction":true,

"emptiness":false,

"hopelessness":true,

"loneliness":true,

"sadness":true,

"suicide\_intent":false,

"worthlessness":true,

"severity\_score":7

\}

All models are used inzero\-shotmode: no additional examples are provided beyond the prompt instructions\. This choice allows us to assess the intrinsic capabilities of LLMs without any domain\-specific training\[[12](https://arxiv.org/html/2604.19887#bib.bib17),[20](https://arxiv.org/html/2604.19887#bib.bib39)\]\.

## 4Experiments

We evaluated our approach across two distinct phases: a controlled benchmark and a large\-scale, in\-the\-wild analysis\. This section outlines the complete experimental framework used for these evaluations\. First, we detail the datasets and experimental setup, which include the annotated*DepressionEmo*corpus and a custom collection of nearly 470,000 recent mental\-health\-related Reddit posts\. Next, we present the evaluated models, introducing the nine locally hosted large language models and the fine\-tuned baselines selected for comparison\. Finally, we define the evaluation metrics, specifically the micro\- and macro\-averaged variants of precision, recall, and F1\-score used to comprehensively quantify multi\-label classification performance\.

### 4\.1Datasets and Experimental Setup

##### Controlled benchmark: DepressionEmo\.

The*DepressionEmo*dataset\[[21](https://arxiv.org/html/2604.19887#bib.bib36)\]contains≈6,000\\approx 6\{,\}000Reddit posts manually annotated with the eight emotions described in Section[2\.3](https://arxiv.org/html/2604.19887#S2.SS3)under a multi\-label scheme\. We adopt the original 80/20 train/validation split proposed by the authors and evaluate all LLMs exclusively on the validation partition, without any fine\-tuning\.

##### In\-the\-wild benchmark: Reddit 2024–2025\.

Using a custom\-built web scraper, we collected posts from four mental\-health\-themed subreddits:r/anxiety,r/depression,r/depression\_partners, andr/mentalhealth, the same communities used to construct the DepressionEmo dataset\[[21](https://arxiv.org/html/2604.19887#bib.bib36)\]\. The collection spans January 2024 to May 2025, yielding a total of𝟒𝟔𝟗,𝟔𝟗𝟐\\mathbf\{469\{,\}692\}posts \(318,000318\{,\}000in 2024 and151,692151\{,\}692in the first half of 2025\)\. The per\-subreddit distribution is reported in Table[1](https://arxiv.org/html/2604.19887#S4.T1)\.

![Refer to caption](https://arxiv.org/html/2604.19887v1/x1.png)Figure 1:Comparison between LLMs \(zero\-shot\) and fine\-tuned models in terms of precision, recall, and F1\-score\. The two best models in each category are highlighted\. LLMs trained on general\-purpose data achieve competitive results relative to models specifically fine\-tuned on depressive\-domain data\.Table 1:Post distribution per subreddit in the in\-the\-wild dataset\.

### 4\.2Evaluated Models

In the controlled benchmark phase, we compare nine LLMs of varying sizes, runlocallyvia theOllamaframework, against literature baselines \(SVM, LightGBM, XGBoost, GAN\-BERT, BERT, BART\) fine\-tuned on the 80% training split of DepressionEmo\. The evaluated LLMs are:qwen3:0\.6b,phi3:mini,gemma2:2b,llama3\.2:3b,phi4:14b,mistral:7b,samantha\-mistral:7b,qwen3:14b, andgemma3:27b\. For the in\-the\-wild analysis, we employ the best\-performing model identified in the benchmark, namelygemma3:27b\.

### 4\.3Evaluation Metrics

Multi\-label classification performance is evaluated usingprecision,recall, andF1\-score, computed in both themicrovariant \(globally aggregated counts across all classes\) and themacrovariant \(arithmetic mean of per\-class metrics\)\. Their joint use enables assessment of both overall performance and behavior on less frequent classes, a critical consideration in the mental health domain\[[21](https://arxiv.org/html/2604.19887#bib.bib36)\]\.

## 5Results and Discussion

### 5\.1Controlled Benchmark

Full results of the controlled benchmark are reported in Table[2](https://arxiv.org/html/2604.19887#S5.T2)and visualized in Figure[1](https://arxiv.org/html/2604.19887#S4.F1)\.

Table 2:Comparison of zero\-shot LLMs and fine\-tuned literature models on the DepressionEmo dataset \(micro and macro precision, recall, F1\-score\)\. Best results per metric in bold\. ✓= fine\-tuned; ✗= zero\-shot\.Results show that the gap between zero\-shot LLMs and fine\-tuned models is modest:gemma3:27bachieves micro\-F1 = 0\.75 and macro\-F1 = 0\.70, matching fine\-tuned GAN\-BERT and falling only 0\.05 F1 points below BERT and BART\. Overall it’s the more balanced model among LLMs\. The performance decrease relative to fine\-tuned models is expected, yet remains small given the complete absence of domain\-specific training\.

##### Per\-model analysis\.

gemma3:27bis the best\-performing LLM overall, exhibiting the lowest divergence between precision \(0\.73\) and recall \(0\.77\), indicative of balanced behavior\.qwen3:0\.6bachieves very high precision \(0\.83\) but very low recall \(0\.33\), reflecting a conservative labeling strategy\. Conversely,samantha\-mistral:7b, a psychologically\-adapted variant of Mistral, maximizes recall \(0\.90\) at the expense of precision \(0\.56\), a trade\-off consistent with prevention scenarios where the cost of false negatives is high, but with a precision particularly low\. In fact, the F1 score it’s lower than other more powerful models \(i\.e\., gemma3:27b\)\.

##### Clinical implications\.

In screening and prevention contexts, minimizing false negatives \(i\.e\., undetected critical cases\) is the primary objective, making high recall a desirable property\. Larger models generally show better precision–recall balance, suggesting that model scale is the primary predictor of performance, consistent with\[[20](https://arxiv.org/html/2604.19887#bib.bib39)\]\.

##### Advantage of LLMs over fine\-tuned models\.

A key advantage is the ability of LLMs to operate without labeled data: fine\-tuned models require domain\-specific annotations and risk becoming obsolete as online language evolves\. LLMs, having been exposed to vast quantities of general text, generalize more naturally to novel expressions of depressive distress\[[12](https://arxiv.org/html/2604.19887#bib.bib17)\]\.

### 5\.2In\-the\-Wild Analysis

#### 5\.2\.1Emotional Structure of Communities

![Refer to caption](https://arxiv.org/html/2604.19887v1/x2.png)Figure 2:Spearman correlation matrix of detected emotions across the full in\-the\-wild dataset \(p<0\.001p<0\.001\)\. Emotions belonging to the depressive core \(hopelessness,worthlessness,sadness,emptiness\) exhibit significant positive correlations \(ρ=0\.28\\rho=0\.28–0\.530\.53\)\.Figure[2](https://arxiv.org/html/2604.19887#S5.F2)displays the Spearman correlation matrix of detected emotions\.Hopelessness,worthlessness,sadness, andemptinessshow moderate to high positive correlations \(ρ≈0\.28\\rho\\approx 0\.28–0\.530\.53,p<0\.001p<0\.001\), indicating high co\-occurrence within the same posts\. This pattern supports the hypothesis that these emotions are observable manifestations of a shared latent construct corresponding to depressive distress\.

Conversely,angerandcognitive dysfunctionshow near\-zero correlations with most other categories, suggesting weaker internal coherence with the depressive core and higher discriminant validity relative to the primary domain\.

![Refer to caption](https://arxiv.org/html/2604.19887v1/x3.png)Figure 3:Emotion detection rates across the four subreddits\. Inr/depression,sadnessandhopelessnessexceed 80–90% of posts; inr/anxiety, the same emotions appear at significantly lower frequencies\.Figure[3](https://arxiv.org/html/2604.19887#S5.F3)shows how emotional profiles vary substantially across communities\. Inr/depression,sadnessandhopelessnessare detected in the vast majority of posts \(often\>80\>80–9090%\), whereas inr/anxietythe same emotions are considerably less frequent, outlining a distinct affective profile consistent with the differential nature of the two disorders\.

#### 5\.2\.2Risk Score Distributions

![Refer to caption](https://arxiv.org/html/2604.19887v1/x4.png)Figure 4:Distributions of risk scoresSS\(Eq\.[1](https://arxiv.org/html/2604.19887#S3.E1)\) per subreddit\. Inr/depressionthe distribution is shifted toward higher values, with mean and median nearly coinciding \(S≈7S\\approx 7\); inr/anxietythe distribution is strongly concentrated at low values \(median≈\\approx2\)\.![Refer to caption](https://arxiv.org/html/2604.19887v1/x5.png)Figure 5:Box plots of risk scores per subreddit\. Significant differences emerge across communities;r/depressionexhibits both a higher median and greater variability at the extreme values\.Figures[4](https://arxiv.org/html/2604.19887#S5.F4)and[5](https://arxiv.org/html/2604.19887#S5.F5)demonstrate thatSSeffectively discriminates between communities\. Inr/depression, the distribution is shifted toward high values \(mean≈7\\approx 7, approximately 43% of posts≥7\\geq 7\), suggesting a structurally elevated and homogeneously distributed risk\. Inr/anxiety, the distribution is strongly left\-skewed \(median≈2\\approx 2, only≈3\\approx 3% of posts≥7\\geq 7\), consistent with anxiety rather than depressive profiles\. The subredditsr/depression\_partnersandr/mentalhealthexhibit intermediate profiles with greater variability\.

#### 5\.2\.3High\-Risk Post Analysis

![Refer to caption](https://arxiv.org/html/2604.19887v1/x6.png)Figure 6:Comparison of emotion detection rates between the full post set and the high\-risk subset \(S\>7S\>7\)\. In high\-risk posts,sadnessandhopelessnessare detected in nearly all cases; a marked increase inworthlessness,emptiness, andsuicide intentis also observed\.Figure[6](https://arxiv.org/html/2604.19887#S5.F6)confirms that high values ofSSare not attributable to random fluctuations, but reflect the systematic convergence of multiple critical affective indicators\. In posts withS\>7S\>7,suicide intent, which carries the maximum weight of 3, is markedly overrepresented relative to the global distribution, validating the index’s ability to isolate the most critical cases\.

#### 5\.2\.4Temporal Analysis

![Refer to caption](https://arxiv.org/html/2604.19887v1/x7.png)Figure 7:Monthly average risk score from January 2024 to July 2025\. Values remain stable over time per subreddit, with structural inter\-community differences persisting throughout the observation period\. A moderate upward trend is observable in the first half of 2025\.![Refer to caption](https://arxiv.org/html/2604.19887v1/x8.png)Figure 8:Fine\-grained temporal evolution of the mean risk score\. The absence of erratic oscillations indicates that the system does not overreact to episodic language variation\.Figures[7](https://arxiv.org/html/2604.19887#S5.F7)and[8](https://arxiv.org/html/2604.19887#S5.F8)reveal substantial temporal stability of risk scores throughout 2024–2025, with no marked discontinuities\. This stability is a relevant property for a monitoring system: it indicates that the model responds to structured linguistic variation rather than episodic noise\.

A moderate increase in mean scores during the first half of 2025 is observable in several subreddits\. While causal inference is not possible, this trend may reflect an intensification of depressive language associated with broader social or contextual factors\. This reading is consistent with the value of a longitudinal monitoring system capable of detecting drifts in collective language that may carry psychologically meaningful signals\[[4](https://arxiv.org/html/2604.19887#bib.bib14),[17](https://arxiv.org/html/2604.19887#bib.bib13)\]\.

#### 5\.2\.5Limitations

This work has several methodological limitations\. First, the analysis relies exclusively on public text data and does not account for individual or clinical variables of the post authors, limiting the generalizability of estimates at the individual level\. Second, emotion interpretation depends on the semantic capabilities of the LLM, which, however advanced, remain a simplification of the underlying psychological process\. Finally, the self\-referential nature of Reddit content introduces potential self\-selection bias: the monitored communities tend to attract users with already\-manifest distress, limiting the representativeness of the sample with respect to the general population\.

From a system perspective, the gap of approximately 0\.05–0\.06 F1 points relative to fine\-tuned models leaves room for improvement via hybrid approaches \(few labeled examples for few\-shot prompting\) or lightweight fine\-tuning \(LoRA/QLoRA\) on smaller models\. Ethical considerations regarding privacy and informed consent require careful attention before deploying such tools in real\-world applicative contexts\.

## 6Conclusions

We have presented an LLM\-based system for automated depression risk assessment in social media, centered on a clinically\-grounded weighted severity index derived from eight emotionally relevant dimensions\. Evaluation on the*DepressionEmo*benchmark demonstrates that large locally\-run models, in particulargemma3:27b, achieve performance competitive with purpose\-built fine\-tuned models \(micro\-F1 = 0\.75 vs\. 0\.80 for BART\), operating in zero\-shot mode without any domain\-specific training\.

The in\-the\-wild analysis on469,692469\{,\}692Reddit posts confirms the ecological validity of the approach: the system produces coherent and interpretable risk profiles, successfully differentiating semantically distinct communities \(e\.g\.,r/depressionvs\.r/anxiety\) and maintaining stable estimates over time\. The proposed index enables automatic identification of high\-risk posts through the systematic convergence of multiple critical affective indicators, providing a scalable psychological triage tool\.

The main contributions of this work are: \(1\) the definition of a composite depressive severity index inspired by standardized clinical scales but adapted for automated text analysis; \(2\) the demonstration that medium\-to\-large LLMs, runnable locally without specialized hardware, can achieve meaningful performance in depressive emotion recognition; \(3\) the first large\-scale longitudinal analysis of depression risk in major mental\-health subreddits over the 2024–2025 period\.

Our findings suggest that LLMs, by virtue of their domain\-agnostic training and ability to generalize to evolving linguistic expressions, represent a promising resource for large\-scale psychological monitoring, with infrastructure costs substantially lower than proprietary API\-based models\. The proposed system is designed to complement rather than replace clinical judgment, serving as a first\-pass filter in continuous monitoring scenarios\. Future work will include extension to non\-English languages, integration of behavioral signals \(post frequency and timing\), and validation against clinical samples to calibrate index thresholds against certified diagnoses\.

## References

- \[1\]\(2018\)In an Absolute State: Elevated Use of Absolutist Words Is a Marker Specific to Anxiety, Depression, and Suicidal Ideation\.Clinical Psychological Science6\(4\),pp\. 529–542\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p2.1),[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p1.1)\.
- \[2\]E\. Alsentzer, J\. Murphy, W\. Boag, W\. Weng, D\. Jin, T\. Naumann, and M\. McDermott\(2019\)Publicly available clinical bert embeddings\.Proceedings of the 2nd Clinical Natural Language Processing Workshop,pp\. 72–78\.Cited by:[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1)\.
- \[3\]A\. P\. Association\(2022\)Diagnostic and statistical manual of mental disorders \(5th ed\., text rev\.\)\.American Psychiatric Publishing,Washington, DC\.Cited by:[§2\.1](https://arxiv.org/html/2604.19887#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2604.19887#S3.SS1.p3.3)\.
- \[4\]S\. Chancellor and M\. De Choudhury\(2020\)Methods in Predictive Techniques for Mental Health Status on Social Media: A Critical Review\.NPJ Digital Medicine3\(43\),pp\. 1–11\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p6.1),[§5\.2\.4](https://arxiv.org/html/2604.19887#S5.SS2.SSS4.p2.1)\.
- \[5\]M\. Conway and D\. O’Connor\(2016\)Social Media, Big Data, and Mental Health: Current Advances and Ethical Implications\.Current Opinion in Psychology9,pp\. 77–82\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p2.1)\.
- \[6\]M\. De Choudhury, M\. Gamon, S\. Counts, and E\. Horvitz\(2013\)Predicting depression via social media\.InProceedings of the international AAAI conference on web and social media,Vol\.7,pp\. 128–137\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p2.1),[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p1.1)\.
- \[7\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\-HLT\),pp\. 4171–4186\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1)\.
- \[8\]J\. C\. Eichstaedt, R\. J\. Smith, R\. M\. Merchant, L\. H\. Ungar, P\. Crutchley, D\. Preoţiuc\-Pietro, D\. A\. Asch, and H\. A\. Schwartz\(2018\)Facebook Language Predicts Depression in Medical Records\.Proceedings of the National Academy of Sciences115\(44\),pp\. 11203–11208\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p2.1)\.
- \[9\]W\. A\. Gadzama, D\. Gabi, M\. S\. Argungu, and H\. U\. Suru\(2024\)The use of machine learning and deep learning models in detecting depression on social media: a systematic literature review\.Personalized Medicine in Psychiatry45,pp\. 100125\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1)\.
- \[10\]S\. C\. Guntuku, D\. B\. Yaden, M\. L\. Kern, L\. H\. Ungar, and J\. C\. Eichstaedt\(2017\)Detecting Depression and Mental Illness on Social Media: An Integrative Review\.Current Opinion in Behavioral Sciences18,pp\. 43–49\.Cited by:[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p1.1)\.
- \[11\]S\. L\. James, D\. Abate, K\. H\. Abate, S\. M\. Abay, C\. Abbafati, N\. Abbasi, H\. Abbastabar, F\. Abd\-Allah, J\. Abdela, A\. Abdelalim,et al\.\(2018\)Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic analysis for the global burden of disease study 2017\.The lancet392\(10159\),pp\. 1789–1858\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p1.1)\.
- \[12\]Y\. Jin, J\. Liu, P\. Li, B\. Wang, Y\. Yan, H\. Zhang, C\. Ni, J\. Wang, Y\. Li, Y\. Bu, and Y\. Wang\(2025\)The Applications of Large Language Models in Mental Health: Scoping Review\.Journal of Medical Internet Research27\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p3.1),[§3\.2](https://arxiv.org/html/2604.19887#S3.SS2.p6.1),[§5\.1](https://arxiv.org/html/2604.19887#S5.SS1.SSS0.Px3.p1.1)\.
- \[13\]S\. W\. Kelleyet al\.\(2022\)Machine learning of language use on twitter reveals weak generalisability of models of depression symptoms\.NPJ Digital Medicine5\(1\),pp\. 51\.Cited by:[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1)\.
- \[14\]S\. K\. Lho, S\. Park, H\. Lee, D\. Y\. Oh, H\. Kim, S\. Jang, H\. Y\. Jung, S\. Y\. Yoo, S\. M\. Park, and J\. Lee\(2025\)Large Language Models and Text Embeddings for Detecting Depression and Suicide in Patient Narratives\.JAMA Network Open8\(5\)\.Cited by:[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1)\.
- \[15\]D\. Liu, Z\. Zhang, and Y\. Li\(2022\)Detecting and measuring depression on social media using machine learning: systematic review\.JMIR Mental Health9\(3\)\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p2.1)\.
- \[16\]K\. Milintsevich, K\. Sirts, and G\. Dias\(2023\)Towards automatic text\-based estimation of depression through symptom prediction\.Brain informatics10\(1\),pp\. 4\.Cited by:[§2\.1](https://arxiv.org/html/2604.19887#S2.SS1.p1.1)\.
- \[17\]J\. A\. Naslund, K\. A\. Aschbrenner, G\. J\. McHugo, J\. Unützer, L\. A\. Marsch, and S\. J\. Bartels\(2019\)Exploring Opportunities to Support Mental Health Care Using Social Media: A Survey of Social Media Users with Mental Illness\.Early Intervention in Psychiatry13\(4\),pp\. 405–413\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p2.1),[§5\.2\.4](https://arxiv.org/html/2604.19887#S5.SS2.SSS4.p2.1)\.
- \[18\]D\. Q\. Nguyen, T\. Vu, and A\. Tuan Nguyen\(2020\)BERTweet: a pretrained language model for english tweets\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 9–14\.Cited by:[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1)\.
- \[19\]J\. Oh, T\. Lee, E\. S\. Chung, H\. Kim, K\. Cho, H\. Kim, J\. Choi, H\. Sim, J\. Lee, I\. Y\. Choi,et al\.\(2024\)Development of depression detection algorithm using text scripts of routine psychiatric interview\.Frontiers in psychiatry14\.Cited by:[§2\.1](https://arxiv.org/html/2604.19887#S2.SS1.p1.1)\.
- \[20\]J\. Ohse, B\. Hadžić, P\. Mohammed, N\. Peperkorn, M\. Danner, A\. Yorita, N\. Kubota, M\. Rätsch, and Y\. Shiban\(2024\)Zero\-shot strike: testing the generalisation capabilities of out\-of\-the\-box llm models for depression detection\.Computer Speech & Language88\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p4.1),[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1),[§3\.2](https://arxiv.org/html/2604.19887#S3.SS2.p6.1),[§5\.1](https://arxiv.org/html/2604.19887#S5.SS1.SSS0.Px2.p1.1)\.
- \[21\]A\. B\. S\. Rahman, H\. Ta, L\. Najjar, A\. Azadmanesh, and A\. S\. Gönul\(2024\)DepressionEmo: a novel dataset for multilabel classification of depression emotions\.Journal of Affective Disorders366,pp\. 445–458\.Cited by:[item 3](https://arxiv.org/html/2604.19887#S1.I1.i3.p1.1),[§2\.3](https://arxiv.org/html/2604.19887#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2604.19887#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2604.19887#S4.SS1.SSS0.Px2.p1.3),[§4\.3](https://arxiv.org/html/2604.19887#S4.SS3.p1.1),[Table 2](https://arxiv.org/html/2604.19887#S5.T2.4.3.3.1),[Table 2](https://arxiv.org/html/2604.19887#S5.T2.4.4.4.1),[Table 2](https://arxiv.org/html/2604.19887#S5.T2.4.5.5.1),[Table 2](https://arxiv.org/html/2604.19887#S5.T2.4.6.6.1),[Table 2](https://arxiv.org/html/2604.19887#S5.T2.4.7.7.1),[Table 2](https://arxiv.org/html/2604.19887#S5.T2.4.8.8.1)\.
- \[22\]J\. D\. Ribeiro, X\. Huang, K\. R\. Fox, and J\. C\. Franklin\(2018\)Depression and hopelessness as risk factors for suicide ideation, attempts and death: meta\-analysis of longitudinal studies\.The British Journal of Psychiatry212\(5\),pp\. 279–286\.Cited by:[§2\.1](https://arxiv.org/html/2604.19887#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2604.19887#S3.SS1.p3.3)\.
- \[23\]D\. Shin, H\. Kim, S\. Lee, Y\. Cho, and W\. Jung\(2024\)Using Large Language Models to Detect Depression From User\-Generated Diary Text Data as a Novel Approach in Digital Mental Health Screening: Instrument Validation Study\.Journal of Medical Internet Research26\.Cited by:[§2\.2](https://arxiv.org/html/2604.19887#S2.SS2.p2.1)\.
- \[24\]A\. M\. Tackman, D\. A\. Sbarra, A\. L\. Carey, M\. B\. Donnellan, A\. B\. Horn, N\. S\. Holtzman, T\. S\. Edwards, J\. W\. Pennebaker, and M\. R\. Mehl\(2019\)Depression, negative emotionality, and self\-referential language: a multi\-lab, multi\-measure, and multi\-language\-task research synthesis\.\.Journal of personality and social psychology116\(5\),pp\. 817\.Cited by:[§2\.3](https://arxiv.org/html/2604.19887#S2.SS3.p1.1)\.
- \[25\]World Health Organization\(2017\)Depression and other common mental disorders: global health estimates\.World Health Organization,Geneva\.Cited by:[§1](https://arxiv.org/html/2604.19887#S1.p1.1)\.
Depression Risk Assessment in Social Media via Large Language Models

Similar Articles

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Predicting Psychological Well-Being from Spontaneous Speech using LLMs

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Submit Feedback

Similar Articles

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue
Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation
Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning
Predicting Psychological Well-Being from Spontaneous Speech using LLMs
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy