Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

arXiv cs.CL Papers

Summary

This paper presents a method for fine-tuning LLMs to predict PHQ-9 depression severity scores directly from transcripts of conversations with an AI mental health application, achieving strong correlation with clinical thresholds using a augmented dataset of 6,283 users.

arXiv:2606.17973v1 Announce Type: new Abstract: Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 >= 10 clinical threshold. We also find AUC > 0.87 at every severity threshold from PHQ-9 >= 3 to PHQ-9 >= 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:42 AM

# Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue
Source: [https://arxiv.org/html/2606.17973](https://arxiv.org/html/2606.17973)
###### Abstract

Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention\. Validated instruments such as the Patient Health Questionnaire\-9 \(PHQ\-9\) support symptom monitoring at scale, but real\-world completion rates are low, introducing response bias and systematic missingness\. Passive approaches that infer severity from routinely generated data could close this gap\. We address this by predicting PHQ\-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data\. We fine\-tune a Qwen3\.5\-27B backbone with a regression head, augment 3,111 ground\-truth labels with pseudo\-labels generated by a reasoning model \(Claude Opus\) and iteratively trained intermediate models, for a combined dataset of 6,283 users\. On a held\-out test set of 842 users, our best model achieves MAE=2\.6=2\.6, RMSE=4\.0=4\.0, Pearsonr=0\.80r=0\.80, and AUC=0\.91=0\.91at the PHQ\-9≥10\\geq 10clinical threshold\. We also find AUC\>0\.87\>0\.87at every severity threshold from PHQ\-9≥3\\geq 3to PHQ\-9≥24\\geq 24, demonstrating that the model captures depression severity across the full clinical spectrum\. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self\-report measures\.

## 1Introduction

Depression is the leading cause of disability worldwide\(Friedrich,[2017](https://arxiv.org/html/2606.17973#bib.bib23)\), affecting more than 280 million people globally\(WHO,[2023](https://arxiv.org/html/2606.17973#bib.bib21)\), and earlier detection of worsening symptoms is critical for enabling timely intervention\. A growing number of these individuals now interact with AI\-powered mental health platforms that deliver psychoeducation, cognitive\-behavioral techniques, and supportive counseling through multi\-turn text conversations\(Rousmaniere et al\.,[2026](https://arxiv.org/html/2606.17973#bib.bib26)\)\. These interactions generate rich, fully transcribed dialogue that captures users’ self\-reported experiences, emotional language, and behavioral patterns—signals plausibly related to depressive symptom severity\. However, the architecture of these platforms is often focused on making use of existing knowledge, training of LLMs on massive text corpora being a prime example\. As such, they are verycounselor\-focused, at the risk of ignoring the user\. Understanding users better, through modelling\(Zhu et al\.,[2025](https://arxiv.org/html/2606.17973#bib.bib17)\)or by predicting standard clinical constructs, like here and in cited prior research, is just as crucial for ensuring a positive impact\. We seek to enable reliable estimates of depression severity from such conversations\. If successfully implemented in production systems, this technique could enable passive symptom monitoring, early detection of deterioration, adaptive treatment planning, and scalable screening without requiring repeated questionnaire completion\(Teferra et al\.,[2024](https://arxiv.org/html/2606.17973#bib.bib24)\)\.

The PHQ\-9 \(Patient Health Questionnaire\-9\) is a standardized, self\-administered questionnaire that is widely used by healthcare professionals to screen for, diagnose, and measure the severity of depression\. It consists of nine questions covering a range of symptoms such as anhedonia, fatigue, self\-esteem, psychomotor changes, and suicidal ideation\. Existing work on predicting PHQ\-9 scores from text is fragmented across settings, with data sources ranging from structured clinical interviews\(Gratch et al\.,[2014](https://arxiv.org/html/2606.17973#bib.bib8); Ringeval et al\.,[2019](https://arxiv.org/html/2606.17973#bib.bib15)\)and clinical notes\(Alves, P\., et al\.,[2025](https://arxiv.org/html/2606.17973#bib.bib4)\), to user\-generated diary text\(Shin et al\.,[2024](https://arxiv.org/html/2606.17973#bib.bib19)\)and even text message data\(Stamatis et al\.,[2022](https://arxiv.org/html/2606.17973#bib.bib25)\)\. These studies often involve small samples \(n=89 to n=335\) and leverage bespoke methods and models that may not generalize to other settings, which has stymied progress; assessment instruments, populations, and interaction formats differ enough across studies to render direct numerical comparison challenging\. The scale of data generated from a commercially available mental health AI tool \(i\.e\., millions of transcripts of conversations between a user and AI\) presents a unique opportunity to remedy these issues\. However, no prior work on PHQ\-9 prediction has targeted naturalistic AI mental health conversation transcripts at scale\.

Our approach consists of the following stages:

1. 1\.Pseudo\-label augmentation\.Using Claude Opus to label unlabeled conversations, we balance out the initial training set, which skewed heavily towards high PHQ\-9 scores \(severe depression\)\.
2. 2\.Intermediate model pseudo\-labelling\.Claude Opus performs above noise level, but is not perfect\. A model trained on the ground\-truth training set enriched with Claude’s pseudo\-labels does better than just on the ground truth set, and better than Claude itself\. We use that model to nearly double the effective training set, from 3,111 to 6,283 users, enabling more robust regression on limited ground\-truth data\.
3. 3\.Iterative re\-labelling\.The two steps of pseudo\-labelling result in an even better model, as measured on the held\-out set\. We use that model to re\-label all pseudo\-labelled users, and train a new model\.
4. 4\.Ensembling\.PHQ\-9 prediction is a noisy task, and adapter initializations can end up picking up on various spurious signals\. We find that the best performance is obtained by averaging the predictions of an ensemble of models trained on the re\-pseudo\-labelled dataset combined with the ground truth labels\.

Our model achieves AUC\>0\.87\>0\.87at every severity threshold from PHQ\-9≥3\\geq 3to PHQ\-9≥24\\geq 24, enabling clinically meaningful distinctions between mild, moderate, moderately severe, and severe depression—not just case detection\. This cross\-severity discrimination is itself a contribution: even widely cited prior models degrade at the upper end of the range, discriminating more poorly among higher PHQ\-9 scores relative to lower score bands\(Alves, P\., et al\.,[2025](https://arxiv.org/html/2606.17973#bib.bib4)\)\. At the standard PHQ\-9≥10\\geq 10threshold our model achieves an AUC of 0\.91\.

Our work differs from prior research on scale \(∼4,000\{\\sim\}4\{,\}000users vs\. 89\-275 in prior transcript\-based corpora\), ecological validity \(naturalistic help\-seeking dialogue rather than structured assessment\), and the use of the complete PHQ\-9 including item 9 on suicidal ideation, which is absent from the PHQ\-8 target used in most prior transcript\-based work, despite the full 9\-item version being the one used for screening in clinical settings\. Critically, our model infers severity from unguided therapeutic dialogue rather than from language elicited by a clinician administering a structured instrument—a harder inference problem, and the primary setting in which automated estimation is of practical use, since contexts that already deliver a structured assessment have no need for it\.

## 2Related Work

### 2\.1PHQ\-9 and Depression Prediction from Text

A substantial body of NLP work predicts depressive symptom severity from text\. This literature divides along the kind of language it draws on: language*elicited*within a structured clinical assessment, and naturalistic language produced outside any assessment context\. The distinction is consequential for the present work, because only the latter setting is one in which an automated severity estimate adds information that is not already being captured\.

#### Clinical interview settings\.

The most benchmarked setting for PHQ prediction is the DAIC\-WOZ corpus\(Gratch et al\.,[2014](https://arxiv.org/html/2606.17973#bib.bib8)\)and its extension E\-DAIC\(Ringeval et al\.,[2019](https://arxiv.org/html/2606.17973#bib.bib15)\), scripted virtual\-interviewer sessions with 189 and 275 participants respectively, using PHQ\-8 as the regression target\. The strongest text\-only models reach MAE 3\.55\-3\.85\(Schmidt et al\.,[2025](https://arxiv.org/html/2606.17973#bib.bib13)\)\. In a related design,Weber et al\. \([2025](https://arxiv.org/html/2606.17973#bib.bib20)\)predicted individual MADRS items from structured clinical\-interview text using a fine\-tuned BERT regression head over 126 sessions\. While these results are encouraging,Burdisso et al\. \([2024](https://arxiv.org/html/2606.17973#bib.bib11)\)showed that many DAIC models achieve their accuracy by exploiting the interviewer’s scripted prompts rather than genuinely inferring depression from patient speech\. This confound surfaces a more fundamental limitation: in each of these settings, the language exists only because a clinician—or a scripted proxy—is already administering a structured depression assessment\. A model that infers PHQ severity from such an interview presupposes the very instrument it is meant to stand in for, and generalizes only to contexts in which a structured assessment is already being delivered—precisely the contexts in which passive inference is least needed\. Naturalistic help\-seeking dialogue carries no such interviewer signal to exploit, and represents the setting in which an automated severity estimate would be most useful\.

#### Naturalistic language\.

A second line of work predicts depression from language produced outside any assessment context\. Early approaches used handcrafted linguistic features—first\-person pronoun usage\(De Choudhury et al\.,[2013](https://arxiv.org/html/2606.17973#bib.bib5)\), negative\-emotion word frequency\(Resnik et al\.,[2015](https://arxiv.org/html/2606.17973#bib.bib14)\), and absolutist language\(Al\-Mosaiwi and Johnstone,[2018](https://arxiv.org/html/2606.17973#bib.bib1)\)—in social media, clinical notes, and text\-message data, with more recent work substituting pretrained transformer encoders \(BERT, RoBERTa\) for hand\-engineered features\(Jiang et al\.,[2020](https://arxiv.org/html/2606.17973#bib.bib10); Lau et al\.,[2023](https://arxiv.org/html/2606.17973#bib.bib3)\)\. Closer to our setting,Shin et al\. \([2024](https://arxiv.org/html/2606.17973#bib.bib19)\)predicted binary PHQ\-9 depression status \(≥10\\geq 10\) from mental\-health\-app diary entries using GPT\-3\.5 on 91 users,Alves, P\., et al\. \([2025](https://arxiv.org/html/2606.17973#bib.bib4)\)estimated PHQ\-9 from prescriber clinical notes, andStamatis et al\. \([2022](https://arxiv.org/html/2606.17973#bib.bib25)\)predicted depressive symptom severity from naturalistic text\-message language\. Psychotherapy\-transcript studies\(Althoff et al\.,[2016](https://arxiv.org/html/2606.17973#bib.bib2); Ewbank et al\.,[2020](https://arxiv.org/html/2606.17973#bib.bib6); Lalk et al\.,[2024](https://arxiv.org/html/2606.17973#bib.bib7)\)link conversational features to clinical symptom severity but report modest associations, with correlations aroundr=0\.45r=0\.45or lower\. Across this literature, samples are relatively small \(n=89n=89ton=335n=335\), with heterogenous targets and instruments; no prior work has modeled PHQ\-9 severity from naturalistic AI\-therapy dialogue at scale\.

### 2\.2Semi\-Supervised and Pseudo\-Label Approaches

The scarcity of labeled data that constrains the studies above is a recurring obstacle in clinical NLP, and semi\-supervised methods offer one route around it\. Pseudo\-labeling—using a teacher model to generate labels for unlabeled data—has been widely used in language modelling settings where limited labelled data is available\(Xie et al\.,[2020](https://arxiv.org/html/2606.17973#bib.bib22); He et al\.,[2020](https://arxiv.org/html/2606.17973#bib.bib9)\)and applied to medical record classification and symptom extraction in clinical NLP\. We use it here to expand labeled training data via a reasoning model \(Claude Opus\) and iterative self\-labeling\.

## 3Data

### 3\.1Study Population

Our dataset is drawn from Ash, an AI\-powered mental health platform developed by Slingshot AI\. Users engage in multi\-turn text\-based conversations with an AI mental health tool trained to draw from evidence\-based strategies such as cognitive behavioral therapy \(CBT\) and acceptance and commitment therapy \(ACT\)\. In the present study, a randomly selected cohort of new users was selected to complete the PHQ\-9 at baseline, prior to their first conversation with Ash\. We include users who \(a\) completed the full PHQ\-9 at baseline and \(b\) had at least one conversation with Ash during the first seven days of use, with a minimum of 10 messages exchanged\. After these exclusions, our final sample comprises 3,953 users\.

### 3\.2PHQ\-9 Assessments

The Patient Health Questionnaire\-9\(PHQ\-9; Kroenke et al\.,[2001](https://arxiv.org/html/2606.17973#bib.bib12)\)is among the most widely used self\-report instruments for screening and monitoring depressive symptom severity\. Each of the nine items maps to a DSM\-5 criterion for major depressive disorder and is scored from 0 \(“not at all”\) to 3 \(“nearly every day”\), yielding a total score between 0 and 27\. Standard clinical severity thresholds classify scores of 0–4 as minimal, 5–9 as mild, 10–14 as moderate, 15–19 as moderately severe, and 20–27 as severe depression\(Kroenke et al\.,[2001](https://arxiv.org/html/2606.17973#bib.bib12)\)\.

PHQ\-9 scores in our sample are substantially elevated relative to the general population, as expected for a help\-seeking cohort: mean 15\.87 \(SD 6\.42\), median 16\. Table[1](https://arxiv.org/html/2606.17973#S3.T1)shows the distribution using the standard clinical severity bands\.

Table 1:PHQ\-9 severity distribution using standard clinical severity bands\(Kroenke et al\.,[2001](https://arxiv.org/html/2606.17973#bib.bib12)\)\.This distribution is markedly right\-shifted relative to population\-based samples, likely reflecting the help\-seeking nature of the cohort\. The most frequently endorsed items are guilt/worthlessness \(item 5, mean 2\.14\), fatigue \(item 3, mean 2\.13\), and depressed mood \(item 2, mean 2\.01\)\. Psychomotor disturbance \(item 7, mean 1\.21\) and suicidal ideation \(item 9, mean 1\.12\) are the least endorsed on average, though suicidal ideation is still endorsed at a non\-trivial rate, which is relevant given that PHQ\-8 benchmarks omit this item entirely\.

### 3\.3Conversation Data

Each user’s input is constructed from their Ash conversation messages during the first seven days of use\. Messages are formatted as alternating user and assistant turns\. Users with fewer than 10 messages were excluded\. When a user’s message history exceeded 300 messages, we subsampled to exactly 300 by drawing at evenly\-spaced indices across the full first\-week window, preserving the temporal arc of the conversation rather than truncating\. In the final dataset, conversation length ranges from 10 to 300 messages \(mean 125\.6, median 86\)\.

### 3\.4Train/Test Split

We partition users with ground\-truth PHQ\-9 labels at the user level to prevent leakage: 3,111 users are allocated to training and 842 to the held\-out test set\. The split is stratified by PHQ\-9 severity category\. An additional 3,172 unlabeled users are included in training via pseudo\-labeling \(Section[4\.1](https://arxiv.org/html/2606.17973#S4.SS1)\), bringing the total training set to 6,283 users\.

## 4Methods

The study was reviewed by the Biomedical Research Alliance of New York \(BRANY\) independent Institutional Review Board \(IRB\) and determined to be exempt under category 4\(ii\) \(BRANY IRB File \# 26\-081\-2393\)\.

### 4\.1Pseudo\-Label Generation

To supplement the 3,111 labeled users in our training set, and to correct the imbalance that stems from the help\-seeking nature of the population, we generate pseudo\-labels for an additional 3,172 unlabeled Ash conversations, bringing the total training set to 6,283 samples\. Pseudo\-labels are generated from two sources: Claude Opus 4\.6, with a few\-shot \+ reasoning instructions prompt, and an intermediate model\.

#### Claude Opus pseudo\-labels\.

To obtain extra training data, we instructed Anthropic’s Claude Opus 4\.6 to estimate the PHQ\-9 score for unlabelled conversation transcript\. Sampling 3,000 random users, filtering out the ones that had exchanged less than 10 messages with Ash, and selecting the ones that were estimated to have PHQ\-9 scores below 10, we obtained pseudolabels for 605 extra users\. The choice to only retain low\-severity pseudolabels here was made to balance out their significant underrepresentation in our ground truth dataset\. Details are presented in Appendix[A](https://arxiv.org/html/2606.17973#A1)\.

#### Iterative model pseudo\-labels\.

After an initial round of supervised fine\-tuning on the ground\-truth and Opus\-labeled data \(Section[4\.2](https://arxiv.org/html/2606.17973#S4.SS2)\), we use the trained model itself to generate pseudo\-labels for the another 2,567 unlabeled users, sampled and filtered in the same way as for the Opus round\. This set we do fully include \(i\.e\. not just the low\-severity ones\) yielding a combined pseudo\-labeled set of 3,172 users and a total training set of 6,283\. In Table[2](https://arxiv.org/html/2606.17973#S5.T2)we refer to this set as "iterative pseudo\-labels v1"\.

Training a new model on iterative pseudo\-labels v1 significantly improves performance\. Re\-labelling the same set of pseudo\-labelled users with that improved model yields a modest further improvement \(iterative pseudo\-labels v2\)\. Finally, an ensemble of four models trained on the same dataset, simply averaging their predicted scores, gives the best test set performance we have seen\.

### 4\.2PHQ\-9 Regression Model

We attach a learned regression head to the Qwen3\.5\-27B backbone\. The regression head takes the hidden state of the final transformer layer and maps it through a linear layer to a single continuous PHQ\-9 total score prediction\.

Each user’s input consists of their first\-week conversation history, formatted as alternatinguserandassistantturns \(with the 300\-message subsampling described in Section[3\.3](https://arxiv.org/html/2606.17973#S3.SS3)\)\. The sequence is tokenized with the model’s chat template \(apply\_chat\_template\) with a maximum sequence length of 16,384 tokens\. A LoRA adapter is trained for 5 epochs on the adapted backbone using MSE loss on the total\-score regression target, with a learning rate of5​e−55e\-5, a cosine schedule, a warm\-up fraction of 0\.1, an update batch size of 8 \(using gradient accumulation\)\.

### 4\.3Evaluation Metrics

We evaluate our model using the following metrics on the held\-out test set:

- •Mean Absolute Error \(MAE\):Average absolute difference between predicted and true PHQ\-9 scores\.
- •Root Mean Squared Error \(RMSE\):Square root of the mean squared prediction error, more sensitive to large errors\.
- •Pearson correlation \(rr\):Linear correlation between predicted and true scores\.
- •Normalized MAE \(NMAE\):MAE divided by the PHQ\-9 score range \(27\), allowing comparison with PHQ\-8 benchmarks \(range 24\)\.
- •Binary classification:We report the area under the receiver operating characteristic curve \(AUROC, hereafter AUC\), sensitivity, specificity, and F1 at the standard PHQ\-9≥10\\geq 10clinical threshold\. We also sweep all thresholds from≥3\\geq 3to≥24\\geq 24to assess discrimination across the full severity range\.

## 5Results

Table[2](https://arxiv.org/html/2606.17973#S5.T2)shows PHQ\-9 prediction performance across model variants on the held\-out test set\.

Table 2:PHQ\-9 prediction performance on the held\-out test set \(N=842N=842\)\.### 5\.1Contextual Comparison with Related Work

Table[3](https://arxiv.org/html/2606.17973#S5.T3)places our results alongside those from related settings\. Because the datasets, populations, clinical instruments, and task definitions differ across studies, the numbers are not directly comparable; we include them to situate our work within the broader literature and discuss the differences in Section[6\.2](https://arxiv.org/html/2606.17973#S6.SS2)\.

Table 3:Contextual comparison with related work\. Settings, instruments, and evaluation tasks differ across rows; see text for discussion\.
### 5\.2Binary Screening and Multi\-Threshold Discrimination

Table[4](https://arxiv.org/html/2606.17973#S5.T4)shows depression screening performance at the PHQ\-9≥10\\geq 10and≥15\\geq 15thresholds\. Note that at the traditional clinical threshold of 10, the sensitivity\-specificity trade\-off is heavily influenced by the occurrence rates in our ecological sample, which simply did not contain many users with scores below 10\.

Table 4:Binary depression screening performance \(PHQ\-9 below vs above threshold\)\.A key advantage of our approach is that strong discrimination extends across the full PHQ\-9 severity range\. Table[5](https://arxiv.org/html/2606.17973#S5.T5)shows AUC at every threshold from≥3\\geq 3to≥24\\geq 24; AUC remains above 0\.871 at every point\. This contrasts with binary screening models optimised and evaluated at a single cutoff\. The model can therefore support clinical decisions at multiple severity levels—identifying users who cross from mild into moderate \(PHQ\-9≥10\\geq 10\) or from moderate into moderately severe \(PHQ\-9≥15\\geq 15\), each of which may warrant a different clinical response\.

Table 5:AUC scores \(PHQ\-9 below vs above threshold\)\.
### 5\.3Error Analysis

Table[6](https://arxiv.org/html/2606.17973#S5.T6)shows a 5×\\times5 confusion matrix with rows representing true severity band and columns representing predicted severity band \(predicted score binned into the same five standard ranges\)\. Per\-band accuracy and MAE are shown alongside\.

Table 6:5×\\times5 confusion matrix by severity band \(ensemble,N=842N=842\)\. Rows = true band; columns = predicted band \(predicted total score binned into standard ranges\)\. Per\-band in\-band accuracy and MAE are shown in the rightmost columns\.Prediction recall increases with severity: the model correctly bands 69% of severe cases but only 22% of minimal cases\. Most errors are adjacent\-band misclassifications \(e\.g\., minimal predicted as mild, moderate predicted as moderately severe\), with few large cross\-band errors: the within\-1\-band accuracy is 92%\. The low recall for minimal cases reflects the fact that the model, trained predominantly on a high\-severity cohort \(81% cases\), tends to predict scores in the moderate range even for users with few symptoms\. Errors for moderate and moderately\-severe users are smallest in absolute terms \(MAE 2\.45 and 1\.69\), where the training data is densest\.

![Refer to caption](https://arxiv.org/html/2606.17973v1/phq9-calibration-4ens.png)Figure 1:Predicted vs\. true scores\. A clear dominant band of correct predictions appeares, as well as an error pattern that shows the classic reversion to the mean\.

## 6Discussion

### 6\.1Summary of Findings

We demonstrate that PHQ\-9 depression severity can be predicted from naturalistic AI mental health conversations with MAE 2\.6, RMSE 4\.0, Pearsonr=0\.80r=0\.80, and AUC 0\.91 at the clinical screening threshold of PHQ\-9≥10\\geq 10\. We found that the creation of pseudolabels to expand and balance the training data helped increase the accuracy of the final model\. The predictions discriminate depression severity across the full PHQ\-9 range, with AUC above 0\.87 at every threshold from≥3\\geq 3to≥24\\geq 24, enabling graded clinical distinctions rather than binary case detection\.

### 6\.2Contextual Comparison with Related Work

Because no prior work uses the same setting, instrument, and evaluation protocol, the numbers in Table[3](https://arxiv.org/html/2606.17973#S5.T3)are contextual rather than competitive\. Several points are worth noting\.

Compared to transcript\-regression models on DAIC\-WOZ and E\-DAIC\(Schmidt et al\.,[2025](https://arxiv.org/html/2606.17973#bib.bib13)\), our raw MAE \(2\.62 vs\. 3\.55–3\.85\) is lower, and the comparison becomes more favourable when normalised by score range: our NMAE of 0\.097 \(MAE/27\) versus 0\.148–0\.160 \(MAE/24\) for PHQ\-8 benchmarks, a 34–39% improvement\.

Compared toShin et al\. \([2024](https://arxiv.org/html/2606.17973#bib.bib19)\), who predict PHQ\-9 from diary entries in a mental health app, our sensitivity \(0\.966 vs\. 0\.615\) is substantially higher at the cost of lower specificity \(0\.627 vs\. 0\.955\)\. For depression screening, missing a case \(false negative\) is generally considered more harmful than a false alarm, so this trade\-off is clinically appropriate\. Our lower specificity partly reflects calibration toward the help\-seeking cohort \(81\.5% cases in the test set\); raising the decision threshold above 10 yields more balanced sensitivity/specificity without retraining\. The AUC of 0\.90 in our setting vs\. the absence of a reported AUC in theirs also reflects that we are solving the harder continuous regression problem, not only binary classification\.

Compared toWeber et al\. \([2025](https://arxiv.org/html/2606.17973#bib.bib20)\), both approaches fine\-tune LLMs on clinical text and attach a regression head over the backbone’s hidden state\. The decisive difference is the source of the language: their MADRS items are elicited by a clinician through structured prompts, whereas our PHQ\-9 score must be inferred from unguided therapeutic dialogue\. This is not only a harder inference problem but the one that matters in practice—a model that depends on clinician\-administered prompts cannot operate where no clinician is present, which is precisely where passive, scalable screening is needed\.

### 6\.3Cross\-Severity Granularity

Standard binary screening tools are calibrated at a single severity threshold and cannot distinguish between, for example, a user with PHQ\-9 8 \(mild\) and PHQ\-9 18 \(moderately severe\)\. Our model’s AUC exceeds 0\.87 at every threshold across the clinical spectrum, meaning it can support graded clinical decisions: triaging users into severity bands, tracking within\-user changes over time, or triggering escalation when a user crosses from one severity category into the next\. This granularity is a direct consequence of predicting the full continuous score rather than optimising for a single binary cutoff\.

### 6\.4Clinical Implications

If PHQ\-9 prediction from AI mental health support transcripts achieves sufficient accuracy, it could enable several clinical applications:

- •Passive symptom monitoring\.Users could receive ongoing symptom severity estimates without completing questionnaires, reducing assessment burden and increasing measurement frequency, particularly given the known biases in who chooses to complete self\-report questionnaires in real\-world settings
- •Early deterioration detection\.Sudden increases in predicted severity could trigger alerts to human clinicians or prompt the AI tool to recommend professional consultation\.
- •Treatment personalization\.Severity estimates could inform adaptive treatment algorithms that adjust the AI’s intervention strategy if the user’s predicted PHQ scores suggest they ares not responding to the original strategy
- •Population\-level screening\.Aggregated predictions could identify cohorts at elevated risk, informing public health interventions\. Here it is important to note a potential limitation of this work: people using Ash may be self\-selected in ways that are hard to measure reliably, and people using Ash who fill out the PHQ\-9 questionnaire in the app may be a subpopulation of those that isn’t fully representative\. Hence, generalisation to the population at large will require further confirmation, and potentially further tuning of the training procedure\.

## 7Conclusion

We present, to our knowledge, the first large\-scale study of PHQ\-9 score prediction from AI\-mediated therapy conversations\. By fine\-tuning Qwen3\.5\-27B with a regression head, performing domain adaptation via user simulation training, and augmenting labeled data through pseudo\-labeling \(605 conversations from Claude Opus and 2,567 from an iteratively trained model\), we achieve MAE 2\.6, RMSE 4\.0, Pearsonr=0\.80r=0\.80, and AUC 0\.91 on a held\-out test set of 842 users, with AUC above 0\.87 at every PHQ\-9 severity threshold from≥3\\geq 3to≥24\\geq 24\. Our results suggest that conversational text generated in AI mental health sessions contains sufficient signal to estimate depressive symptom severity, opening the door to passive monitoring, adaptive treatment, and scalable mental health screening without requiring users to complete questionnaires at all\. Future work in this direction could evaluate generalization across platforms, predict other clinical constructs, and explore temporal dynamics of predicted severity over the course of treatment\.

## Appendix AAppendix: Opus pseudolabelling

The prompt for Opus to produce the pseudolabels is quite minimal:

> You are an expert clinical psychologist specialising in depression assessment using the Patient Health Questionnaire\-9 \(PHQ\-9\), a validated 9\-item scale scored 0–27\. Severity bands: 0–4 Minimal depression 5–9 Mild depression 10–14 Moderate depression 15–19 Moderately severe depression 20–27 Severe depression Your task: read a conversation between a user and an AI therapist, then estimate the user’s current PHQ\-9 score\. Consider: • Symptom language \(hopelessness, worthlessness, sleep, appetite, energy\) • Emotional tone and affect intensity • Functional impairment described • Clinical vocabulary and self\-awareness • Persistence and pervasiveness of depressive themes Calibration matters — avoid compressing scores toward the middle\. Below are calibration examples with clinician\-verified PHQ\-9 scores\. \[FEW\-SHOT EXAMPLES\] For each new conversation provided, respond with ONLY valid JSON: "phq9\_score": <number 0\-27\>, "reasoning": "<1\-2 sentences\>" Do not include any other text outside the JSON object\.

## References

- Al\-Mosaiwi and Johnstone \[2018\]Al\-Mosaiwi, M\., and Johnstone, T\. \(2018\)\.In an absolute state: Elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation\.*Clinical Psychological Science*, 6\(4\):529–542\.
- Althoff et al\. \[2016\]Althoff, T\., Clark, K\., and Leskovec, J\. \(2016\)\.Large\-scale analysis of counseling conversations: An application of natural language processing to mental health\.*Transactions of the Association for Computational Linguistics*, 4:463–476\.
- Lau et al\. \[2023\]Lau, C\., Zhu, X\., and Chan, W\.\-Y\. \(2023\)\.Automatic depression severity assessment with deep learning using parameter\-efficient tuning\.*Frontiers in Psychiatry*, 14:1160291\.
- Alves, P\., et al\. \[2025\]Alves, P\., et al\. \(2025\)\.A machine learning model using clinical notes to estimate PHQ\-9 symptom severity scores in depressed patients\.*Journal of Affective Disorders*\.
- De Choudhury et al\. \[2013\]De Choudhury, M\., Gamon, M\., Counts, S\., and Horvitz, E\. \(2013\)\.Predicting depression via social media\.In*Proceedings of ICWSM*\.
- Ewbank et al\. \[2020\]Ewbank, M\. P\., Cummins, R\., Tablan, V\., Bateup, S\., Catarino, A\., Martin, A\. J\., and Blackwell, A\. D\. \(2020\)\.Quantifying the association between psychotherapy content and clinical outcomes using deep learning\.*JAMA Psychiatry*, 77\(1\):35–43\.
- Lalk et al\. \[2024\]Lalk, C\., Steinbrenner, T\., Kania, W\., Popko, A\., Wester, R\., Schaffrath, J\., Eberhardt, S\., Schwartz, B\., Lutz, W\., and Rubel, J\. \(2024\)\.Measuring alliance and symptom severity in psychotherapy transcripts using BERT topic modeling\.*Administration and Policy in Mental Health and Mental Health Services Research*, 51\(4\):509–524\.
- Gratch et al\. \[2014\]Gratch, J\., et al\. \(2014\)\.The distress analysis interview corpus of human and computer interviews\.In*Proceedings of LREC*\.
- He et al\. \[2020\]He, J\., et al\. \(2020\)\.Revisiting self\-training for neural sequence generation\.In*Proceedings of ICLR*\.
- Jiang et al\. \[2020\]Jiang, Z\., Levitan, S\. I\., Zomick, J\., and Hirschberg, J\. \(2020\)\.Detection of mental health from Reddit via deep contextualized representations\.In*Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis \(LOUHI\)*\.
- Burdisso et al\. \[2024\]Burdisso, S\., et al\. \(2024\)\.DAIC\-WOZ: On the validity of using the therapist’s prompts in automatic depression detection from clinical interviews\.*arXiv preprint arXiv:2404\.14463*\.
- Kroenke et al\. \[2001\]Kroenke, K\., Spitzer, R\. L\., and Williams, J\. B\. W\. \(2001\)\.The PHQ\-9: Validity of a brief depression severity measure\.*Journal of General Internal Medicine*, 16\(9\):606–613\.
- Schmidt et al\. \[2025\]Schmidt, F\., Ravan, S\., and Vlassov, V\. \(2025\)\.Probabilistic textual time series depression detection\.*arXiv preprint arXiv:2511\.04476*\.
- Resnik et al\. \[2015\]Resnik, P\., et al\. \(2015\)\.Beyond LDA: Exploring supervised topic modeling for depression\-related language in Twitter\.In*Proceedings of the Workshop on Computational Linguistics and Clinical Psychology*\.
- Ringeval et al\. \[2019\]Ringeval, F\., et al\. \(2019\)\.AVEC 2019 workshop and challenge: State\-of\-mind, detecting depression with AI, and cross\-cultural affect recognition\.In*Proceedings of the 9th International Audio/Visual Emotion Challenge and Workshop \(AVEC ’19\)*\.
- Hu et al\. \[2022\]Hu, E\. J\., et al\. \(2022\)\.LoRA: Low\-rank adaptation of large language models\.In*Proceedings of ICLR*\.
- Zhu et al\. \[2025\]Zhu, Z\., Tieleman, O\., Stamatis, C\. A\., Smyth, L\., Hull, T\. D\., Cahn, D\. R\., Chen, J\., and Malgaroli, M\. \(2025\)\.DIAL: Direct iterative adversarial learning for realistic multi\-turn dialogue simulation\.*arXiv preprint arXiv:2512\.20773*\.
- Qwen Team \[2026\]Qwen Team \(2026\),Qwen3\.5: Towards Native Multimodal Agents\.https://qwen\.ai/blog?id=qwen3\.5
- Shin et al\. \[2024\]Shin, D\., et al\. \(2024\)\.Using large language models to detect depression from user\-generated diary text data as a novel approach in digital mental health screening: Instrument validation study\.*Journal of Medical Internet Research*, 26:e54617\.
- Weber et al\. \[2025\]Weber, S\., Deperrois, N\., Heun, R\., et al\. \(2025\)\.Using a fine\-tuned large language model for symptom\-based depression evaluation\.*npj Digital Medicine*, 8:598\.
- WHO \[2023\]World Health Organization \(2023\)\.Depressive disorder \(depression\)\.WHO Fact Sheet\.
- Xie et al\. \[2020\]Xie, Q\., Luong, M\.\-T\., Hovy, E\., and Le, Q\. V\. \(2020\)\.Self\-training with noisy student improves ImageNet classification\.In*Proceedings of CVPR*\.
- Friedrich \[2017\]Friedrich M\.Depression Is the Leading Cause of Disability Around the World\.JAMA\. 2017;317\(15\):1517\. doi:10\.1001/jama\.2017\.3826
- Teferra et al\. \[2024\]Teferra BG, Rueda A, Pang H, Valenzano R, Samavi R, Krishnan S, Bhat V\.Screening for Depression Using Natural Language Processing: Literature Review\.Interact J Med Res\. 2024 Nov 4;13:e55067\. doi: 10\.2196/55067\. PMID: 39496145; PMCID: PMC11574504\.
- Stamatis et al\. \[2022\]Caitlin A\. Stamatis, Jonah Meyerhoff, Tingting Liu, Garrick Sherman, Harry Wang, Tony Liu, Brenda Curtis, Lyle H\. Ungar, David C\. MohrProspective associations of text\-message\-based sentiment with symptoms of depression, generalized anxiety, and social anxietyDepression and Anxiety, vol\. 39, no\. 12, 2022, pp\. 794–804\.
- Rousmaniere et al\. \[2026\]Rousmaniere T, Goldberg SB, Torous J\.Large language models as mental health providers\.Lancet Psychiatry\. 2026 Jan;13\(1\):7\-9

Similar Articles

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

arXiv cs.CL

This paper introduces a framework for synthesizing long-term medical dialogue datasets using LLMs, and creates MediLongChat with three benchmark tasks to evaluate healthcare agents' memory and reasoning capabilities. Experiments show that even state-of-the-art LLMs struggle with these tasks.