Predicting Psychological Well-Being from Spontaneous Speech using LLMs
Summary
This academic paper investigates using LLMs for zero-shot prediction of psychological well-being scores from spontaneous speech, evaluating 12 models and achieving high correlation with clinical metrics.
View Cached Full Text
Cached at: 05/13/26, 06:11 AM
# Predicting Psychological Well-Being from Spontaneous Speech using LLMs
Source: [https://arxiv.org/html/2605.11303](https://arxiv.org/html/2605.11303)
Erfan Loweimi1,2†[https://orcid.org/0000-0002-8761-021X](https://orcid.org/0000-0002-8761-021X)†This work was conducted while the author was with the Centre for Medical Informatics \(CMI\), Usher Institute, University of Edinburgh\.Saturnino Luz[https://orcid.org/0000-0001-8430-7875](https://orcid.org/0000-0001-8430-7875)
###### Abstract
We investigate the use of Large Language Models \(LLMs\) for zero\-shot prediction of Ryff Psychological Well\-Being \(PWB\) scores from spontaneous speech\. Using a few minutes of voice recordings from 111 participants in the PsyVoiD database, we evaluated 12 instruction\-tuned LLMs, including Llama\-3 \(8B, 70B\), Ministral, Mistral, Gemma\-2\-9B, Gemma\-3 \(1B, 4B, 27B\), Phi\-4, DeepSeek \(Qwen and Llama\), and QwQ\-Preview\. A domain\-informed prompt was developed in collaboration with experts in clinical psychology and linguistics\. Results show that LLMs can extract semantically meaningful cues from spontaneous speech, achieving Spearman correlations of up to 0\.8 on 80% of the data\. Additionally, to enhance explainability, we conducted statistical analyses to characterise prediction variability and systematic biases, alongside keyword\-based word cloud analyses to highlight the linguistic features driving the models’ predictions\.
## IIntroduction
Psychological well\-being is central to overall health, resilience to adverse events, and daily functioning\. Recent global trends highlight its importance, with mental health challenges affecting millions annually\[[30](https://arxiv.org/html/2605.11303#bib.bib130)\]\. For example, the COVID\-19 pandemic and its wide\-ranging effects on health, economy, and other domains substantially intensified psychological distress worldwide\[[20](https://arxiv.org/html/2605.11303#bib.bib131),[12](https://arxiv.org/html/2605.11303#bib.bib8)\]\. Timely identification and longitudinal monitoring of well\-being are therefore essential\.
Conventional assessment of psychological well\-being relies on clinical interviews\[[35](https://arxiv.org/html/2605.11303#bib.bib132),[31](https://arxiv.org/html/2605.11303#bib.bib133)\]and self\-report questionnaires\. Although informative, these methods are subjective, resource\-intensive, and face significant scalability challenges\[[18](https://arxiv.org/html/2605.11303#bib.bib135),[19](https://arxiv.org/html/2605.11303#bib.bib134)\]\. Because spoken language inherently encodes internal states through both acoustic and linguistic cues\[[37](https://arxiv.org/html/2605.11303#bib.bib7)\], recent advances in AI have created new opportunities to harness these signals for non\-invasive, low\-cost screening\[[6](https://arxiv.org/html/2605.11303#bib.bib6),[26](https://arxiv.org/html/2605.11303#bib.bib157),[14](https://arxiv.org/html/2605.11303#bib.bib136),[4](https://arxiv.org/html/2605.11303#bib.bib168),[36](https://arxiv.org/html/2605.11303#bib.bib169)\]\.
Large language models \(LLMs\) enhance speech\-based assessment by extracting psychological markers from spontaneous language\. They have been shown to approximate clinical scoring systems such as the Hospital Anxiety and Depression Scale \(HADS\)\[[39](https://arxiv.org/html/2605.11303#bib.bib43)\], achieving reasonable agreement with human\-coded assessments\[[15](https://arxiv.org/html/2605.11303#bib.bib165)\]\. Similar approaches have been applied to depression detection from speech using multimodal LLM architectures\[[13](https://arxiv.org/html/2605.11303#bib.bib172),[22](https://arxiv.org/html/2605.11303#bib.bib173)\]\. With carefully designed prompts, LLMs can serve as rapid and scalable proxies for traditional screening tools, especially where annotated data are limited or costly to obtain\. Despite this promise in mental health monitoring\[[9](https://arxiv.org/html/2605.11303#bib.bib120),[10](https://arxiv.org/html/2605.11303#bib.bib170)\], the prediction of psychological well\-being from spontaneous speech in*zero\-shot*\[[21](https://arxiv.org/html/2605.11303#bib.bib160),[3](https://arxiv.org/html/2605.11303#bib.bib161)\]settings, namely without task\-specific fine\-tuning, remains underexplored\.
Building on this, we extend the zero\-shot evaluation paradigm from predicting symptom\-focused measures such as HADS to Ryff’s Psychological Well\-Being \(PWB\) framework\[[28](https://arxiv.org/html/2605.11303#bib.bib1),[27](https://arxiv.org/html/2605.11303#bib.bib3)\]\. Unlike clinical tools that primarily assess distress or dysfunction, Ryff’s framework provides a holistic, eudaimonic account of well\-being, making it a rigorous testbed for evaluating whether LLMs can infer higher\-order psychological constructs from spontaneous speech\[[29](https://arxiv.org/html/2605.11303#bib.bib166)\]\.
Recent work has questioned whether human\-centric frameworks like Ryff’s align with how LLMs conceptualise well\-being\. For instance,\[[11](https://arxiv.org/html/2605.11303#bib.bib167)\]analysed LLM responses to open\-ended prompts on “flourishing” and introduced thePAPERSframework \(Purposeful Contribution, Adaptive Growth, Positive Relationality, Ethical Integrity, Robust Functionality, Self\-Actualised Autonomy\)\. Their findings suggest that LLMs generate internally coherent but machine\-oriented accounts of well\-being, emphasising effectiveness and compliance with instructions over autonomy or existential meaning\. This raises a central question: do LLMs capture genuine markers of human psychological states, or merely computational analogues that approximate them?
To address this question, we predict Ryff’s scales to evaluate how well LLMs can infer PWB from unstructured personal narratives\. Specifically, we test whether instruction\-tuned LLMs can estimate Ryff PWB scores\[[28](https://arxiv.org/html/2605.11303#bib.bib1),[27](https://arxiv.org/html/2605.11303#bib.bib3)\]from short spontaneous speech recordings collected during the COVID\-19 lockdown as part of the PsyVoiD dataset\[[5](https://arxiv.org/html/2605.11303#bib.bib5)\]\. Participants provided brief monologues describing their daily experiences under lockdown\. Performance is reported using both Pearson correlation coefficient \(PCC\) and Spearman correlation coefficients \(SCC\), complemented by additional statistical tests\.
Our contributions are threefold:
- •Zero\-shot well\-being prediction:We evaluate twelve instruction\-tuned large language models \(LLMs\)—Meta\-Llama\[[8](https://arxiv.org/html/2605.11303#bib.bib149)\]\(3\.1\-8B\[[2](https://arxiv.org/html/2605.11303#bib.bib150)\]and 3\.3\-70B\[[8](https://arxiv.org/html/2605.11303#bib.bib149)\]\), Microsoft Phi\-4\[[1](https://arxiv.org/html/2605.11303#bib.bib152)\], Google Gemma\-2\-9B\[[32](https://arxiv.org/html/2605.11303#bib.bib153)\], Google Gemma\-3 \(1B, 4B, 27B\)\[[33](https://arxiv.org/html/2605.11303#bib.bib154)\], Ministral\-2410\[[16](https://arxiv.org/html/2605.11303#bib.bib155)\], Mistral\-NeMo\-2407\[[17](https://arxiv.org/html/2605.11303#bib.bib156)\], QwQ\-32B\-Preview\[[34](https://arxiv.org/html/2605.11303#bib.bib146),[38](https://arxiv.org/html/2605.11303#bib.bib147)\], DeepSeek\-R1\-Distill\-Qwen\-32B \(DeepSeek Qwen\)\[[7](https://arxiv.org/html/2605.11303#bib.bib148)\], and DeepSeek\-R1\-Distill\-Llama\-70B \(DeepSeek Llama\)\[[7](https://arxiv.org/html/2605.11303#bib.bib148)\]—for zero\-shot prediction of Ryff’s Psychological Well\-Being \(PWB\) dimensions from spontaneous speech transcripts\.
- •Psychologically informed prompt design:We develop and evaluate prompts that integrate established prompt engineering strategies with domain knowledge from psychological well\-being research to guide LLM reasoning and output structure\.
- •Model behaviour analysis and interpretability:We conduct extensive statistical analyses and linguistic profiling of LLM outputs to characterise behavioural patterns and linguistic cues associated with well\-being prediction\.
The rest of this paper is structured as follows\. After describing the data and the Ryff scale in Section[II](https://arxiv.org/html/2605.11303#S2), Section[III](https://arxiv.org/html/2605.11303#S3)presents the workflow, including the prompt engineering approach \(Section[III\-B](https://arxiv.org/html/2605.11303#S3.SS2)\); Section[IV](https://arxiv.org/html/2605.11303#S4)presents results along with discussion, statistical analysis, and keyword visualisation; Section[V](https://arxiv.org/html/2605.11303#S5)concludes the paper\.
## IIPsychological Assessment
### II\-APsyVoiD dataset
The PsyVoiD dataset\[[5](https://arxiv.org/html/2605.11303#bib.bib5)\]was collected through a large\-scale, anonymous survey and comprises 111 participants \(70 female, 41 male\), aged 21–86, all residing in Scotland during the COVID\-19 lockdown\. Of these, 34 participants \(31%\) reported a history of depression\. Each recording lasts one to two minutes, containing an average of 150 words and 92 unique words per sample, with an average articulation rate of approximately 2 words per second\. Table[I](https://arxiv.org/html/2605.11303#S2.T1)presents descriptive statistics \(mean, median, standard deviation \(STD\), minimum, and maximum\) for some dataset attributes\.
### II\-BPsychological Well\-being Measurement
The reference measure for psychological assessment in this study is the Ryff Psychological Well\-Being \(PWB\) scales\[[28](https://arxiv.org/html/2605.11303#bib.bib1),[27](https://arxiv.org/html/2605.11303#bib.bib3)\], a widely validated self\-report instrument\. The PWB framework comprises six dimensions:autonomy,environmental mastery,personal growth,positive relations with others,purpose in life, andself\-acceptance\. Items are rated on a Likert\-type scale, with higher values reflecting greater well\-being \(with reverse\-keyed items scored accordingly\)\. Subscale scores are obtained by aggregating item responses for each dimension; an overall index can also be derived following standard scoring practice\. Descriptive statistics for the Ryff PWB scale are also reported in Table[I](https://arxiv.org/html/2605.11303#S2.T1)\.
TABLE I:Descriptive statistics on PsyVoiD’s 111 subjectsFigure 1:Workflow for zero\-shot Ryff well\-being estimation: ASR front\-end, prompt stage, and LLM inference\.TABLE II:WER on PsyVoiD for various Whisper models
## IIIWorkflow
Fig\.[1](https://arxiv.org/html/2605.11303#S2.F1)presents the system architecture, comprising a speech\-to\-text \(automatic speech recognition\) front\-end, prompt engineering module, and LLM\-based decision back\-end\.
### III\-ASpeech\-to\-Text Conversion
Speech recordings can be converted into text either through manual annotation or by using automatic speech recognition \(ASR\) systems\. Recent state\-of\-the\-art ASR models, such as OpenAI’s Whisper\[[25](https://arxiv.org/html/2605.11303#bib.bib162)\], achieve strong performance and are robust against noise, speaker variability, and spontaneous speech\. Nevertheless, transcription errors remain non\-negligible\.
As shown in Table[II](https://arxiv.org/html/2605.11303#S2.T2), even WhisperLarge\-v3exhibits a Word Error Rate \(WER\) of approximately 9\.2% on the PsyVoiD data, meaning that on average, one out of every eleven words is transcribed incorrectly\. In addition to typical substitution, deletion, and insertion errors, ASR models can also produce hallucinations, such as generating repetitive or extraneous phrases that are not present in the original speech\. These transcription errors, including both distortions of linguistic content and hallucinations, can mislead the language model and compromise the analysis of psychologically relevant features such as hesitation markers, self\-referential language, and affective expressions\.
To remove the confounding effect of ASR errors and ensure that the input text accurately reflects the original speech, we therefore rely on manually annotated transcripts in this study\. This choice allows us to isolate the performance of the downstream large language models and ensures that the observed effects are attributable to linguistic modelling rather than transcription noise\.
### III\-BLLM Prompt Engineering
We feed the manual transcripts to LLMs to estimate psychological well\-being \(i\.e\., Ryff scores\) from spontaneous speech\. In contrast to conventional supervised approaches, which require task\-specific training, LLMs can operate in a*zero\-shot*regime\[[21](https://arxiv.org/html/2605.11303#bib.bib160),[3](https://arxiv.org/html/2605.11303#bib.bib161)\], leveraging broad linguistic and psychosocial priors learnt during pre\-training to infer relevant constructs\[[9](https://arxiv.org/html/2605.11303#bib.bib120)\]\. To enhance reliability and interpretability, we design a domain\-informed prompt co\-developed with input from clinical psychology and linguistics, framing the task as a structured assessment conducted by a clinician–linguist team\. This role\-based prompting strategy has been shown to better align model outputs with expert reasoning in domain\-specific tasks\[[9](https://arxiv.org/html/2605.11303#bib.bib120),[24](https://arxiv.org/html/2605.11303#bib.bib171)\]\.
The prompt \(Fig\.[2](https://arxiv.org/html/2605.11303#S3.F2)\) is designed to guide the LLM as an expert clinical psychologist evaluating psychological well\-being from spontaneous speech\. The model analyses transcripts of participants describing their typical day during the COVID\-19 lockdown, mapping content to the six Ryff PWB dimensions: Autonomy, Environmental Mastery, Personal Growth, Positive Relations with Others, Purpose in Life, and Self\-Acceptance\. For each dimension, the LLM assigns a score \(3–21\), interprets the meaning of low, moderate, and high well\-being, and provides supporting evidence by extracting indicative keywords and relevant transcript excerpts\. The predicted overall Ryff score is calculated as the sum of all six dimension scores\. The output is structured in JSON format, with scores, keywords, and evidence for all six dimensions\.
By combining this psychologically grounded role\-based prompting with structured linguistic analysis and explicit justifications, the approach enhances robustness, explainability, and transparent interpretability in zero\-shot LLM\-based assessments of multidimensional well\-being\.
Figure 2:Prompt design for Ryff PWB inference via LLMs: each colour corresponds to a different aspect of the prompt\.
## IVExperimental Results and Discussion
### IV\-APerformance Evaluation
Table[III](https://arxiv.org/html/2605.11303#S4.T3)reports the Pearson and Spearman correlations between the ground\-truth Ryff scores, derived from standard questionnaires completed by participants, and the Ryff scores predicted by the LLMs\. The highest Pearson correlation coefficients are achieved by Meta\-Llama\-3\.3\-70B and DeepSeek\-Qwen, while the highest Spearman correlations are observed for DeepSeek\-Llama and Meta\-Llama\-3\.3\-70B\.111As a practical note, in our experiments, the DeepSeek\-R1\-Distill\-Llama\-70B and Llama\-3\.3\-70B\-Instruct models—both very large models with 70B parameters—were quantised to 8\-bit precision due to GPU memory constraints, while all other models were run using bfloat16 precision\. Consequently, the reported results for these two models may not fully reflect their performance under higher\-precision settings\.
Although model rankings are broadly similar across both correlation metrics, some differences highlight the need to consider which metric is more appropriate\. Pearson correlation assumes a linear relationship and Gaussian\-distributed data; however, these assumptions are violated for Ryff scores, which are bounded and skewed \(as shown later in Fig\.[5](https://arxiv.org/html/2605.11303#S4.F5)\)\. In contrast, Spearman correlation, being rank\-based, is robust to these issues, providing a more reliable measure of agreement between predicted and actual Ryff scores in this context\.
Are all of these correlations statistically significant? To answer this, we computed p\-values for both Pearson and Spearman correlations using thescipy\.statslibrary\[[23](https://arxiv.org/html/2605.11303#bib.bib164)\]\. Specifically,scipy\.stats\.pearsonrreturns the Pearson correlation coefficient \(rr\) along with a two\-tailed p\-value under the null hypothesisH0:r=0H\_\{0\}:r=0, assuming the data are drawn from a bivariate normal distribution\. Similarly,scipy\.stats\.spearmanrcomputes the SCC \(ρ\\rho\) and a corresponding two\-tailed p\-value underH0:ρ=0H\_\{0\}:\\rho=0\.
As shown in Table[III](https://arxiv.org/html/2605.11303#S4.T3), the p\-values for most models are well below 0\.01, indicating statistically significant correlations\. Exceptions include Gemma\-3\-1B, for which the Pearson and Spearman p\-values are 0\.0285 and 0\.0119, respectively; although higher, these values remain below the 0\.05 threshold\. In contrast, for DeepSeek\-Llama—which achieved the highest SCC—the p\-values are very high \(0\.885 for Pearson, 0\.667 for Spearman\), indicating that these correlations are not statistically significant\. While this occurs for only one LLM, it highlights the importance of reporting p\-values alongside correlation coefficients in healthcare applications\. High correlation coefficients alone can be misleading if sample size, variance, or distributional assumptions prevent statistical significance\. Therefore, all results in Table[III](https://arxiv.org/html/2605.11303#S4.T3)should be interpreted together with their p\-values to ensure conclusions are statistically meaningful\.
### IV\-BCumulative Correlation Analysis
The best correlation, 0\.444 for Meta\-Llama\-3\.3\-70B, is statistically significant but modest, raising the question of whether limitations stem from the data or the LLM itself\. It is important to note that estimating Ryff scores from only a few minutes of spontaneous speech is inherently difficult\. As shown in Fig\.[3](https://arxiv.org/html/2605.11303#S4.F3), roughly 20% of transcripts contain fewer than 60 unique words\. While word count is only a coarse proxy for informational richness, it illustrates the scarcity of linguistic cues in some recordings\. The overall Spearman correlation therefore reflects not only model performance but also these cases, where limited linguistic and contextual information constrains prediction accuracy\. Such limitations pose challenges even for human experts, not just for the LLMs\.
To further assess model reliability, we analysed Spearman correlations underprogressive data retention\. Files were sorted by the absolute difference between LLM\-predicted and ground\-truth Ryff scores, and cumulative correlations were computed iteratively as more data were included \(n=2…Nn=2\\dots N, withNNas the total number of files\)\. Data retention is defined as in Equation \([1](https://arxiv.org/html/2605.11303#S4.E1)\), reaching 100% when all files are included\.
DataRetention\(%\)=nN×100\\mathrm\{Data\\;Retention\\ \(\\%\)\}=\\frac\{n\}\{N\}\\times 100\(1\)
Fig\.[4](https://arxiv.org/html/2605.11303#S4.F4)shows cumulative correlations as a function of data retention\. As lower\-quality samples are added, correlation gradually declines, reaching the full retention values reported in Table[III](https://arxiv.org/html/2605.11303#S4.T3)\. Notably, when considering only the top 75% of samples, the Spearman correlation for the best model \(Meta\-Llama\-3\.3\-70B\) rises to 0\.8, demonstrating the strong potential of LLMs for psychological inference when sufficient information is available\.
TABLE III:Correlation between LLM predictions and ground\-truth scores\. \#Param is in billions, and PV denotes the p\-value\.
### IV\-CStatistical Analysis
To better understand LLM performance beyond overall correlations, we examined the distribution of predicted Ryff scores\. Table[IV](https://arxiv.org/html/2605.11303#S4.T4)reports descriptive statistics \(mean, median, standard deviation, minimum, and maximum\) for all models alongside the ground\-truth scores\. While correlation metrics capture general predictive accuracy, they do not reveal the nature of prediction errors, such as systematic overestimation, underestimation, or limited variability\. Examining the range, minimum, and maximum values helps assess whether predicted scores reflect the diversity of the ground\-truth data or are instead constrained within a narrower interval\.
Figure 3:Word and unique word counts for different recordings in the PsyVoiD dataset\. 40% of the recordings contain fewer than 100 words, and 55% contain fewer than 100 unique words\.Figure 4:Cumulative analysis of the Spearman correlation coefficient \(SCC\) for various LLMs\. For 80% of the data, the SCC exceeds 0\.7\.TABLE IV:Statistics of Ryff score predictions via LLMsThe statistics in Table[IV](https://arxiv.org/html/2605.11303#S4.T4), computed over all 111 participants, provide insight into prediction variability and error patterns\. A dataset of this size, with spontaneous speech paired with clinically validated psychological questionnaires, is both rare and valuable, given the challenges of ethical approval, participant recruitment, and privacy\. Statistically, it is sufficient to detect moderate\-to\-strong correlations at the standard significance threshold \(α=0\.05\\alpha=0\.05\), ensuring that our analysis of LLMs’ performance remains valid and meaningful\.
As shown in Table[IV](https://arxiv.org/html/2605.11303#S4.T4), most models tend to underestimate Ryff scores, though the degree varies\. For instance, QwQ \(median 92\) is close to the ground truth \(median 95\), while Llama\-3\.1 has a much lower median of 60\. Variability also differs across models: the ground\-truth standard deviation is∼\\sim17, with DeepSeek\-Qwen overestimating at 25\.1, Phi\-4 underestimating at 9\.4, and Meta\-Llama\-3\.3\-70B closely matching the ground truth at 16\.4\. Interestingly, DeepSeek\-Llama, despite its statistically insignificant correlations, produces seemingly reasonable descriptive statistics: mean 82, median 90, and range 53–103, compared to ground\-truth values of 91\.7, 95, and 51–123\. These metrics alone, however, do not capture the lack of statistical significance for this LLM\.
Fig\.[5](https://arxiv.org/html/2605.11303#S4.F5)shows histograms of predicted Ryff scores\. DeepSeek\-Llama exhibits a bimodal distribution with small standard deviations around each mode, explaining why its mean and median appear reasonable despite an insignificant correlation\. Overall, most LLMs underestimate the Ryff scores and show non\-bell\-shaped distributions, highlighting the limitations of using Pearson correlation, which assumes Gaussianity\.
Figure 5:Histogram of the predicted Ryff scores by various LLMs vs ground truth\.Figure 6:Word cloud of keywords extracted by Meta\-Llama\-3\.3\.
### IV\-DKeyword Analysis via Word Clouds
Interpretability is essential in healthcare applications, as performance metrics alone are insufficient for practical adoption\. Word clouds, generated from keywords identified by the LLMs, provide an intuitive visualisation of the cues influencing predictions\. To ensure reliability, we applied post\-filtering to retain only words actually present in the transcripts\. This step ensures that the highlighted terms genuinely reflect participants’ speech rather than model hallucinations, thereby enhancing the trustworthiness of the analysis\.
As seen in Fig\.[6](https://arxiv.org/html/2605.11303#S4.F6), the most salient keywords are manage, family, learning, routine, social, control, and positive\. These terms map closely onto Ryff’s six dimensions of psychological well\-being\. For example,manageandcontroldirectly relate to environmental mastery, reflecting an individual’s ability to handle daily demands and maintain agency\.Familyandsocialconnect to positive relations with others, highlighting the importance of interpersonal support networks\.Learningsignals personal growth, consistent with Ryff’s emphasis on self\-development and openness to new experiences\.Routinereflects both autonomy and environmental mastery, as maintaining daily structure is a marker of coping during stressful contexts like lockdown\. Finally,positivealigns with self\-acceptance and purpose in life, as expressions of optimism and self\-evaluation often indicate higher well\-being\.
## VConclusion
To our knowledge, this study is the first to demonstrate that LLMs, guided by a clinically grounded prompt, can estimate Ryff Psychological Well\-Being \(PWB\) scores from spontaneous speech while providing interpretable justifications\. Findings indicate that LLMs effectively extract psychologically meaningful cues, achieving Spearman correlations of up to 0\.8 on information\-rich transcripts\. Statistical analysis revealed systematic underestimation and offered insight into prediction variability, while keyword\-based word cloud analysis highlighted the specific linguistic features driving model decisions\.
Future research involves the multimodal fusion of acoustic and linguistic features to capture non\-verbal affective signals overlooked by text, longitudinal trajectory analysis for the early detection of subtle mental health shifts, and the development of culturally adaptive prompts to address diverse expressions of well\-being across global populations\.
## VIAcknowledgment
This work was supported by UKRI Grant No\. 10102226 \(University of Edinburgh\) as part of the Horizon Europe Guarantee funding for participation in the INT\-ACT project, supported by REA under the Horizon Europe Programme, Grant No\. 101132719\. We would like to thank Angela Chitzanidi and the Research Services team at the University of Edinburgh for facilitating access to the Eddie Research Computing Cluster\.
## References
- \[1\]M\. Abdinet al\.\(2024\)Phi\-4 technical report\.External Links:2412\.08905,[Link](https://arxiv.org/abs/2412.08905)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[2\]M\. AI\(2024\)Llama 3\.1 8b instruct\.Note:Accessed: 2025\-02\-19External Links:[Link](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[3\]T\. Brownet al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1),[§III\-B](https://arxiv.org/html/2605.11303#S3.SS2.p1.1)\.
- \[4\]F\. Busch, L\. Hoffmann, C\. Rueger, E\. H\. C\. van Dijk, R\. Kader, E\. Ortiz\-Prado, M\. R\. Makowski, L\. Saba, M\. Hadamitzky, J\. N\. Kather, D\. Truhn, R\. Cuocolo, L\. C\. Adams, and K\. K\. Bressem\(2025\)Current applications and challenges in large language models for patient care: a systematic review\.Communications Medicine5\(1\),pp\. 1–13\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[5\]S\. de la Fuente Garcia and S\. Luz\(2023\)PsyVoiD \- investigating the relationship between spontaneous speech features and psychology in the context of the covid\-19 pandemic and lockdown: personality, wellbeing, coping strategies and affect, 2020\-2021 \[dataset\]\.University of Edinburgh\. Clinical Psychology\.\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.7488/ds/7532)Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p6.1),[§II\-A](https://arxiv.org/html/2605.11303#S2.SS1.p1.1)\.
- \[6\]S\. de la Fuente Garcia, C\. W\. Ritchie, and S\. Luz\(2020\)Artificial intelligence, speech, and language processing approaches to monitoring alzheimer’s disease: a systematic review\.Journal of Alzheimer’s Disease78\(4\),pp\. 1547–1574\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[7\]DeepSeek\-AIet al\.\(2025\)DeepSeek\-R1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[8\]A\. Grattafioriet al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[9\]Z\. Guoet al\.\(2024\)Large language models for mental health applications: systematic review\.JMIR mental health11\(1\),pp\. e57400\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1),[§III\-B](https://arxiv.org/html/2605.11303#S3.SS2.p1.1)\.
- \[10\]Y\. Hua, H\. Na, Z\. Li, F\. Liu, X\. Fang, D\. Clifton, and J\. Torous\(2025\)A scoping review of large language models for generative tasks in mental health care\.NPJ Digital Medicine8\(1\),pp\. 230\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1)\.
- \[11\]G\. R\. Lau and W\. Y\. Low\(2025\)From human to machine psychology: a conceptual framework for understanding well\-being in large language model\.arXiv preprint arXiv:2506\.12617\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p5.1)\.
- \[12\]K\. Le and M\. Nguyen\(2022\)The psychological consequences of covid\-19 lockdowns\.InThe political economy of COVID\-19,pp\. 39–55\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p1.1)\.
- \[13\]Y\. Li, S\. Shao, M\. Milling, and B\. W\. Schuller\(2025\)Large language models for depression recognition in spoken language integrating psychological knowledge\.Frontiers in Computer Science7\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1)\.
- \[14\]D\. M\. Low, K\. H\. Bentley, and S\. S\. Ghosh\(2020\)Automated assessment of psychiatric disorders using speech: a systematic review\.Laryngoscope investigative otolaryngology5\(1\),pp\. 96–116\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[15\]E\. Loweimi, S\. de la Fuente Garcia, and S\. Luz\(2025\)Zero\-shot speech\-based depression and anxiety assessment with LLMs\.InProc\. Interspeech 2025,pp\. 489–493\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1)\.
- \[16\]Mistral AI Team\(2024\)Ministral\-8B\-Instruct\-2410\.Note:Accessed: 2024External Links:[Link](https://huggingface.co/mistralai/Ministral-8B-Instruct-2410)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[17\]Mistral AI Team\(2024\)Mistral NeMo\.Note:[https://mistral\.ai/news/mistral\-nemo](https://mistral.ai/news/mistral-nemo)Accessed: 2024Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[18\]J\. J\. Newson, D\. Hunter, and T\. C\. Thiagarajan\(2020\)The heterogeneity of mental health assessment\.Frontiers in psychiatry11,pp\. 76\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[19\]J\. Nordgaard, L\. A\. Sass, and J\. Parnas\(2013\)The psychiatric interview: validity, structure, and subjectivity\.European archives of psychiatry and clinical neuroscience263,pp\. 353–364\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[20\]W\. H\. Organizationet al\.\(2022\)Mental health and covid\-19: early evidence of the pandemic’s impact: scientific brief, 2 march 2022\.Technical reportWorld Health Organization\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p1.1)\.
- \[21\]M\. Palatucciet al\.\(2009\)Zero\-shot learning with semantic output codes\.InAdvances in Neural Information Processing Systems \(NIPS\),Vol\.22,pp\.\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1),[§III\-B](https://arxiv.org/html/2605.11303#S3.SS2.p1.1)\.
- \[22\]S\. V\. Patapati\(2024\)Integrating large language models into a tri\-modal architecture for automated depression classification on the DAIC\-WOZ\.arXiv preprint arXiv:2407\.19340\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1)\.
- \[23\]F\. Pedregosaet al\.\(2011\)Scikit\-learn: machine learning in python\.Journal of Machine Learning Research12,pp\. 2825–2830\.Cited by:[§IV\-A](https://arxiv.org/html/2605.11303#S4.SS1.p3.4)\.
- \[24\]Y\. H\. P\. P\. Priyadarshana, A\. Senanayake, Z\. Liang, and I\. Piumarta\(2024\)Prompt engineering for digital mental health: a short review\.Frontiers in Digital Health6,pp\. 1410947\.Cited by:[§III\-B](https://arxiv.org/html/2605.11303#S3.SS2.p1.1)\.
- \[25\]A\. Radford, J\. W\. Kim, T\. Xu, G\. Brockman, C\. McLeavey, and I\. Sutskever\(2022\)Robust speech recognition via large\-scale weak supervision\.arXiv preprint arXiv:2212\.04356\.External Links:[Link](https://arxiv.org/abs/2212.04356)Cited by:[§III\-A](https://arxiv.org/html/2605.11303#S3.SS1.p1.1)\.
- \[26\]K\. Royet al\.\(2025\)Large language models for mental health diagnostic assessments: exploring the potential of large language models for assisting with mental health diagnostic assessments – the depression and anxiety case\.External Links:2501\.01305,[Link](https://arxiv.org/abs/2501.01305)Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[27\]C\. D\. Ryff and C\. L\. Keyes\(1995\)The structure of psychological well\-being revisited\.Journal of personality and social psychology69\(4\),pp\. 719\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p4.1),[§I](https://arxiv.org/html/2605.11303#S1.p6.1),[§II\-B](https://arxiv.org/html/2605.11303#S2.SS2.p1.1)\.
- \[28\]C\. D\. Ryff\(1989\)Happiness is everything, or is it? explorations on the meaning of psychological well\-being\.\.Journal of personality and social psychology57\(6\),pp\. 1069\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p4.1),[§I](https://arxiv.org/html/2605.11303#S1.p6.1),[§II\-B](https://arxiv.org/html/2605.11303#S2.SS2.p1.1)\.
- \[29\]C\. D\. Ryff\(2014\)Self\-realisation and meaning making in the face of adversity: a eudaimonic approach to human resilience\.Journal of psychology in Africa24\(1\),pp\. 1–12\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p4.1)\.
- \[30\]N\. Salariet al\.\(2020\)Prevalence of stress, anxiety, depression among the general population during the covid\-19 pandemic: a systematic review and meta\-analysis\.Globalization and health16,pp\. 1–11\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p1.1)\.
- \[31\]E\. C\. Stadeet al\.\(2023\)Depression and anxiety have distinct and overlapping language patterns: results from a clinical interview\.\.Journal of psychopathology and clinical science\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[32\]G\. Teamet al\.\(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[33\]G\. Team\(2025\)Gemma 3\.Kaggle\.External Links:[Link](https://goo.gle/Gemma3Report)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[34\]Q\. Team\(2024\-11\)QwQ: Reflect Deeply on the Boundaries of the Unknown\.External Links:[Link](https://qwenlm.github.io/blog/qwq-32b-preview/)Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[35\]M\. Von Korffet al\.\(1987\)Anxiety and depression in a primary care clinic: comparison of diagnostic interview schedule, general health questionnaire, and practitioner assessments\.Archives of General Psychiatry44\(2\),pp\. 152–156\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[36\]M\. Wagner, C\. Stephenson, J\. Jagayat, A\. Kumar, A\. Shirazi, N\. Alavi, and M\. Omrani\(2025\)Using large language models as a scalable mental status evaluation technique\.NPP—Digital Psychiatry and Neuroscience3\(1\),pp\. 1–11\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[37\]S\. Wu, T\. H\. Falk, and W\. Y\. Chan\(2011\)Automatic speech emotion recognition using modulation spectral features\.Speech communication53\(5\),pp\. 768–785\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p2.1)\.
- \[38\]A\. Yanget al\.\(2024\)Qwen2 technical report\.arXiv preprint arXiv:2407\.10671\.Cited by:[1st item](https://arxiv.org/html/2605.11303#S1.I1.i1.p1.1)\.
- \[39\]A\. S\. Zigmond and R\. P\. Snaith\(1983\)The hospital anxiety and depression scale\.Acta psychiatrica scandinavica67\(6\)\.Cited by:[§I](https://arxiv.org/html/2605.11303#S1.p3.1)\.Similar Articles
Depression Risk Assessment in Social Media via Large Language Models
Researchers present a zero-shot LLM system that assesses depression risk from Reddit posts, achieving competitive F1 scores and demonstrating scalable mental-health monitoring.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Proposes TextPro-SLM, a speech large language model that minimizes the modality gap by processing spoken input to resemble prosody-aware text input, achieving strong paralinguistic understanding with low training data.
On Predicting the Post-training Potential of Pre-trained LLMs
This paper introduces RuDE, a framework for predicting the post-training potential of pre-trained LLMs by leveraging response discrimination, addressing the limitations of traditional benchmarks like MMLU.