DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods
Summary
This paper presents DreamerNLplus, a hybrid framework combining LLMs, DeBERTa, Random Forest, rule-based methods, and RAG to model mental health dynamics from social media timelines for the CLPsych 2026 shared task, achieving top rankings in subtasks for temporal summarization and change detection.
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods
Source: [https://arxiv.org/html/2605.23052](https://arxiv.org/html/2605.23052)
Maryia Zhyrko1†, Daisy Monika Lal2†, Erik van Mulligen3, Lifeng Han1,4 On behalf of the 4D PICTURE consortium 1Leiden Institute of Advanced Computer Science \(LIACS\), Leiden University, NL 2School of Computing and Communications \(SCC\), Lancaster University, UK 3Department of Medical Informatics, Erasmus University Medical Center Rotterdam, NL 4Biomedical Data Sciences, Leiden University Medical Center, NL †co\-first, corresponding: \{l\.han\} @ lumc\.nl
###### Abstract
We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task\. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence\-level summarization\. For Task 1, we combine LLM\-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction\. For Task 2, we use few\-shot prompting with a locally deployed Llama 3\.1 model to detect Switch and Escalation events using short\-term temporal context\. For Task 3\.1, we explore both a deterministic rule\-based summarization pipeline and a few\-shot LLM\-based approach, ranking2ndofficially\. Our RAG\-based method achieves strong performance in Task 3\.2, ranking1stfor Improvement and3rdfor Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines\. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity\-based evaluation metrics\. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks\. We share our code and prompts at[https://github\.com/4dpicture/CLPsych2026](https://github.com/4dpicture/CLPsych2026)
DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule\-Based and RAG Methods
## 1Introduction and Background
We describe our system submissions to the CLPsych 2026 Shared TaskAliet al\.\([2026](https://arxiv.org/html/2605.23052#bib.bib2)\): Capturing and Characterizing Mental Health Changes through Social Media Timeline Dynamics, where we have attended all the sub\-tasks\. Earlier reviews on NLP for mental health researchLe Glazet al\.\([2021](https://arxiv.org/html/2605.23052#bib.bib10)\); Malgaroliet al\.\([2023](https://arxiv.org/html/2605.23052#bib.bib9)\)provide an overview of traditional NLP and ML methods including rule\-based, statistical \(TF\-IDF, decision trees, SVMs, CRFs, random forests, NNs\), pretrained LMs \(BERT\-like encoder models, sequence\-to\-sequence BART models, domain\-specific MentalBERT\), and early generative decoders \(e\.g\., GPT2\)\. They also pointed out thelimitationsof existing works such as low reproducibility issue\. The resources used include electronic health records \(EHRs\), Psychological Evaluation reports, social media and interview data\. For this study, we used social media post data from the shared task\. To improve modelinterpretabilityas well as investigating newer open\-source models, we applied a hybrid combination of a rule\-based method, a random forest classifier, DeBERTa models, RAG, and open\-sourced LLMs\. Recent work related to ours include RAGKermaniet al\.\([2025](https://arxiv.org/html/2605.23052#bib.bib8)\); Bogdanovaet al\.\([2026](https://arxiv.org/html/2605.23052#bib.bib14)\), prompt engineering and PA\-ISP \(perspective\-aware\)Chanet al\.\([2025](https://arxiv.org/html/2605.23052#bib.bib7)\); Romeroet al\.\([2025](https://arxiv.org/html/2605.23052#bib.bib13)\); Renet al\.\([2025](https://arxiv.org/html/2605.23052#bib.bib12)\), and CLPsych systems/datasetTseriotouet al\.\([2025](https://arxiv.org/html/2605.23052#bib.bib3)\); Atzil\-Slonim \([2025](https://arxiv.org/html/2605.23052#bib.bib4)\); Tsakalidiset al\.\([2022](https://arxiv.org/html/2605.23052#bib.bib5)\); Atzil\-Slonim \([2026](https://arxiv.org/html/2605.23052#bib.bib6)\)\.
- •Task 1:Predict adaptive and maladaptive ABCD element combinations: \(1\.1\) Post\-level identification of dominant ABCD subelements and self\-state composition; \(1\.2\) Self\-state presence rating\.
- •Task 2:Identify moments of change\.
- •Task 3:Summary of change: \(3\.1\) Summarizing sequences surrounding change events; \(3\.2\) Identifying recurrent dynamic signatures of change across timelines\.
Social Media Timelinesposts, temporal order, well\-being signalsTask 1: State modelingWhat psychological states are present?Prompt2Predict\-DeBERTaaugmentation \+ DeBERTa \+ R\.F\.ABCD SubelementsAdaptive / Maladaptive ScoresTask 2: Change DetectionWhen do meaningful changes occur?Few\-shot LLM PromptingOS\-LLMs \+ context window/XGBoostSwitch / Escalationpost\-level predict \+ justify\(llm\)Task 3: SummarizationHow do changes unfold and recur?Rule\-based \+ LLM / RAG MethodsMINDTRACE, Gemma, RAGSummaries \+ Dynamic Signaturessequence narratives, recurrent patternsstate spacechange eventsUnified modeling Goalinterpretable, privacy\-conscious modeling of mental health dynamics across state, change, and narrative levelsFigure 1:Overview of the DreamerNLplus system across the CLPsych 2026 shared tasks\. Task 1 models psychological state representations, Task 2 detects temporal moments of change, and Task 3 summarizes and generalizes change dynamics across sequences\.
## 2DreamerNLplus Methods
Figure[1](https://arxiv.org/html/2605.23052#S1.F1)summarizes the overall DreamerNLplus framework\. Across the three shared tasks, our systems follow a unified modeling perspective: Task 1 identifies psychological state representations, Task 2 detects transitions between states, and Task 3 converts these dynamics into sequence\-level summaries and recurrent signatures\.
Train DataABCD\-labelledCorpus ConstructionLLM Augmentation \(Ollama\)\(definitions \+ evidence\)DeBERTa Fine\-tuningSubelement Prediction \(Task1\.1\)One\-Hot EncodingRandom Forest Regressor\(Mal\)adaptive Scoring \(Task1\.2\)Combined Output \(Task1\)\(ABCD \+ Presence Scores\)LLM\-basedAugmentationTransformerClassificationVector\-basedRegressionFigure 2:Prompt2Predict\-DeBERTa pipeline including data preprocessing for Task 1 \(Predict adaptive and maladaptive ABCD element combinations\)\.Tasks 1 and 2 on State Modeling and Change Detection\.ForTask\-1, we proposePrompt2Predict\-DeBERTa, a simple multi\-stage framework for predicting psychological sub\-elements and presence scores, as in Figure[2](https://arxiv.org/html/2605.23052#S2.F2)\. First, we extract evidence for each label from the original training data and augment the evidence using synthetic data for model training purposes\. To do this, we employ Ollama to expand the dataset bygenerating new examplesthrough prompts that include label definitions and annotated evidence for every ABCD category\. The augmented data is used for DeBERTa model fine\-tuning on the task of subelement prediction \(Task1\.1\)\. The prediction output \(ABCD labels\) will be encoded as one\-hot vector and fed as input to the Random Forest Regressor to further predict the Adaptive / Maladaptive scoring \(Task1\.2\)\. Finally, we combine both outputs to include two elements: ABCD labels and their presence ratings\. We also deployed a rule\-based approach for Task1, as in Figure[3](https://arxiv.org/html/2605.23052#S2.F3)\.
Figure 3:Task 1 rule\-based pattern matching approach using n\-gram collocations to classify post sentences into ABCD sub\-categories\.This pipeline combines LLM\-based augmentation \(data preprocessing\), transformer\-based classification, and lightweight vector\-based regression to produce structured and interpretable predictions, as well as comparing well\-defined rules\.
Timeline JSONContext Window Formattingcurrent post \+ up to 5 preceding posts5\-shot Prompt ConstructionSwitch \+ Escalation examplesLocal LLM InferenceLlama 3\.1 via OllamaPost\-level PredictionSwitch / Escalation binary labelsStructured Outputlabels \+ justificationSubmission FormattingTemporal Contextlocal change signalPrivacy\-preservinglocal deploymentTargetsSwitch, EscalationFigure 4:Task 2 few\-shot prompting pipeline \(Identify moments of change\)\.Task\-2: Few\-Shot LLM Prompting and XGBoostWe first design a framework that supports multiple LLM \+\+backends \(Ollama, HuggingFace\) and runs locally for privacy, with Llama 3\.1 8B as our submitted model \(Figure[4](https://arxiv.org/html/2605.23052#S2.F4)\)\. In this framework, we create a local context window by keeping the preceding 5 timeline posts as context to the LLM prompts for predicting targets \(Switch and Escalation\)\. The prompt also defines Switch and Escalation, in addition to the context examples\. Our examples include the combinations of Switch\-only, Escalation\-only, both, and neither, plus a first\-post case where no change is possible \(no preceding context\)\. The LLM is asked to return its answer in a fixed JSON format with Switch, Escalation, and short justification, and the framework retries the prompt if the response does not match this format\.
As a non\-LLM comparison, we build aclassifier modelusing the XGBoost library\. In the preprocessing of this model, we use TF\-IDF and sentence transformer embeddings \(all\-mpnet\-base\-v2\) to represent the posts\. To get the temporal difference features, we use 1\) the embedding difference between the current and previous posts \(i\.e\., current post embedding \- previous post embeddinght−ht−1h\_\{t\}\-h\_\{t\-1\}\), 2\) their element\-wise product, and 3\) timeline position\. We also used additional feature embeddings, including 14 linguistic features related to sentiment, punctuation, and length \(Table[1](https://arxiv.org/html/2605.23052#A2.T1)\)\. Two binary classifiers then predict Switch and Escalation, with the rare positive examples given more weight during training so the model does not simply default to predicting no change everywhere\.222Positive examples are up\-weighted via XGBoost’sscale\_pos\_weightparameter, set tomin\(nnegnpos,20\)\\min\\\!\\left\(\\frac\{n\_\{\\text\{neg\}\}\}\{n\_\{\\text\{pos\}\}\},\\ 20\\right\)independently for each classifier\.
Test DataMIND\-annotated postsn\-gram / keywordABCD extractionStructured StateRepresentationPresence\-basedAggregationTemporal Segmentationearly / middle / lateSwitch & EscalationDetectionDominance Scoringadaptive vs maladaptiveRule\-basedTheme InferenceTemplate\-drivenNarrative GenerationStructured Summary\(a\)MINDTRACE\-SUMMARY: deterministic rule\-based pipeline\.SequencePostsInput Formattingtext \+ Switch/Esc\.\+ well\-beingDynamic Few\-shotSelectionPrompt ConstructionMIND\+ABCD\+relationalGemma 2 9BGemma 4 E4BLLAMA 3\.1 8BInferenceSummaryOutputNo Task 1 ABCDannotations usedABCD dynamicsinferred from text\(b\)Few\-shot LLM prompting pipeline using OS\-LLMs\.
Figure 5:Task 3\.1 summary generation methods from DreamerNLplus – rule\-based \(left\) vs OS\-LLMs \(right\)\.Training SequencesWell\-being Trajectory Classification\(Deterioration / Improvement\)ABCD Formattingsubelements \+ scores \+ summariesSequence BatchingLLM Inference \(Llama 3\.1 via Ollama\)extract recurring ABCD dynamicsLLM Inference \(Aggregation\)synthesize cross\-batch patternsSignature Generation90\-word signature per directionStructured Outputsignatures \+ 5–10 exemplar sequencesintra\-batch patternscross\-b\. generalizationFigure 6:Overview of the RAG\-LLM Signature Mining framework for Task 3\.2\.Task 3 on Change Summarization and Pattern Mining\.ForTask 3\.1“summarizing sequences surrounding change events”, we explore two distinct summary generation strategies \(Figure[5](https://arxiv.org/html/2605.23052#S2.F5)\)\.
MINDTRACE\-SUMMARYis a fully deterministic, multi\-stage framework for generating sequence\-level psychological signatures from MIND\-annotated posts\. The model generates summaries by following a fixed structure and converting ABCD annotations into natural language\. A full sample template is provided in Appendix[D\.3](https://arxiv.org/html/2605.23052#A4.SS3)\. It first determines whether adaptive or maladaptive states dominate, then builds a narrative with five parts: central theme, initial state, interaction dynamics, transition \(switch or escalation\), and outcome\. ABCD labels are rewritten into fluent psychological descriptions, and the summary emphasises how states reinforce or shift over time\. The method prioritizes consistency and structure over free\-form generation, ensuring summaries clearly reflect change dynamics across the sequence\.
ForFew\-Shot LLM Prompting, the prompt specifies the MIND framework, ABCD abbreviation conventions, relational dynamics vocabulary, and required summary structure covering pre\-change phase, within/between\-state dynamics, and explicit change event identification\. Task 1 ABCD annotations are not used; the model infers ABCD dynamics from post text guided by the prompt\.
ForTask 3\.2“identifying recurrent dynamic signatures of change across timelines”, we proposeRAG\-LLM Signature Mining, a two\-stage LLM\-based framework for identifying recurrent dynamic signatures of psychological change\. As illustrated in Figure[6](https://arxiv.org/html/2605.23052#S2.F6), our framework separates intra\-batch pattern extraction from cross\-batch signature synthesis, enabling the identification of recurrent psychological dynamics across timelines\. Sequences are batched and fed to an open\-source LLM \(Llama 3\.1via Ollama\) with per\-post ABCD subelements, well\-being scores, and gold summaries\. Stage 1 extracts recurring ABCD dynamics per batch; Stage 2 synthesizes these into one 90\-word signature per direction \(deterioration/improvement\), with 5–10 exemplar sequences as evidence\.
## 3Results and Analysis
#### Task 1: State modeling
For Task 1, our system ranks 22nd in Task 1\.1 \(subelement classification\) and 20th in Task 1\.2 \(presence estimation\), as shown in Table[2](https://arxiv.org/html/2605.23052#A6.T2)\. To further understand this discrepancy, we analyze the relationship between classification performance and regression performance in Figure[7](https://arxiv.org/html/2605.23052#S3.F7)\. The results show a moderate negative correlation \(r=−0\.486r=\-0\.486,p=0\.00354p=0\.00354\), indicating that stronger subelement classification performance does not necessarily translate into better presence estimation\. This observation reflects the fundamental difference between the two subtasks: Task 1\.1 requires fine\-grained categorical prediction of psychological subelements, while Task 1\.2 evaluates continuous intensity estimation\. Our system, which combines DeBERTa\-based classification with Random Forest regression, appears more robust in the regression setting, suggesting that downstream aggregation mitigates upstream classification errors\.
#### Task 2: Change Detection
Our submitted LLM system reaches a combined F1 of 0\.442, ranking 11th overall \(Figure[10](https://arxiv.org/html/2605.23052#A6.F10)\), while the XGBoost variant scores 0\.327\. The two approaches show opposite error patterns: the LLM has high recall but low precision \(Switch 0\.762/0\.302, Escalation 0\.917/0\.393\) meaning it picks up most change cues but tends to over\-predict them, while XGBoost has high precision but low recall on rare positive classes\(Switch precision 0\.455, recall 0\.238\), reflecting how hard it is to learn these labels from only 30 gold timelines\. This shows that accurately capturing subtle transitions remains challenging across paradigms, and that errors on weak or ambiguous change signals can propagate across the sequence\. Despite not relying on task\-specific fine\-tuning, the LLM method stays competitive performance while providing interpretable predictions with textual justifications\. This highlights the effectiveness of few\-shot prompting for modeling temporal dynamics in low\-resource settings\.
#### Task 3: Summarization and Pattern Mining
For Task 3\.1, in our processing, we kept other teams best average\-ranked submission and our best\-performing submission \(ID 693964 prompting\-w\-ablation\) ranks 4th overall based on the average ranking across multiple metrics \(Figure[11](https://arxiv.org/html/2605.23052#A6.F11)\)\. However, a notable observation is the strong disagreement between evaluation metrics\.
Interestingly, shown in Figure[11](https://arxiv.org/html/2605.23052#A6.F11), the rule\-based submission \(694142\) achieves the best CT and strong CS performance, yet ranks worst overall due to poor ROUGE and BERT scores \(a detailed discussion of the evaluation is provided in Appendix[E\.1](https://arxiv.org/html/2605.23052#A5.SS1)\.\), highlighting a clear disagreement between semantic coherence metrics and similarity\-based metrics: CS and CT emphasize semantic coherence and psychological consistency, whereas ROUGE and BERTScore prioritize surface\-level similarity to reference summaries\. As a result, optimizing for one set of metrics may degrade performance on the other\. However, in the official Task 3\.1 ranking, DreamerNLplus ranks2ndoverall with submission 693964, based on the average of metric\-specific ranks across CS, CT, ROUGE\-L Recall, and BERTScore Recall\.
For Task 3\.2, as shown in Table[5](https://arxiv.org/html/2605.23052#A6.T5), our RAG\-based approach ranks1ston Improvement and3rdon Deterioration\. In addition, our system achieves the highest score on Specificity and the second\-highest on Recurrence for Improvement, and the second\-highest Specificity for Deterioration\. These results suggest that the proposed RAG\-based framework is particularly effective at identifying precise and recurring dynamic patterns across timelines\. By combining batch\-level pattern extraction with cross\-batch synthesis, the approach is able to capture higher\-level psychological change signatures that generalize across individuals\. However, the performance gap between Improvement and Deterioration also indicates that modeling deterioration patterns remains more challenging, potentially due to greater variability or ambiguity in negative self\-state dynamics\. This highlights the importance of modeling both intra\-sequence consistency and inter\-sequence generalization when identifying recurrent psychological dynamics\.
#### Cross\-Task Analysis
Across all tasks, a consistent pattern emerges: different evaluation metrics capture distinct and sometimes conflicting aspects of performance\. Task 1 reveals a trade\-off between classification accuracy and regression stability, Task 2 highlights the difficulty of balancing local and temporal consistency, and Task 3 exposes a mismatch between semantic coherence and lexical similarity metrics\. These findings suggest that modeling mental health dynamics requires balancing multiple objectives, including structured prediction, temporal reasoning, and semantic generation\. Our hybrid approach demonstrates that combining interpretable representations with flexible LLM\-based methods can provide robust performance across tasks, while also revealing important limitations in current evaluation frameworks\.
Figure 7:Task 1\.1 vs 1\.2 Relation\. Each point = one team\. X\-axis = Task 1\.1 \(F1\) → higher is better\. Y\-axis = Task 1\.2 \(RMSE\) → lower is better\.
## 4Conclusions and Future Work
We presentedDreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines across three tasks: psychological state modeling, temporal change detection, and sequence\-level summarization\. By combining structured representations, few\-shot prompting, and rule\-based generation, our approach provides both interpretability and flexibility across different modeling paradigms\. Future work should aim to design unified evaluation frameworks that reconcile semantic fidelity, temporal consistency, and textual similarity, enabling clinically meaningful assessment of mental health modeling systems\.
## Limitations
For data augmentation in Task 1, in this work, we extracted the evidence from original training data, and only asked LLMs to generate more evidence, subsequently, DeBERTa model is trained on the augmented evidence data to predict sub\-element classes\. This is similar to the dense prescription generation work inBelkadiet al\.\([2025](https://arxiv.org/html/2605.23052#bib.bib11)\)for NER and engineering purposes only, without generating full clinical letters\. In an ideal situation, we will further explore the generation of similar post\-level, not only evidence\.
Task 3 template\-based summarization achieved high CS and low CT scores through predefined linguistic rules and structured feature\-to\-text mappings, enabling stable and interpretable summaries\. However, the approach is task\-specific and less flexible than human summarization, limiting its ability to capture nuanced contextual and emotional variations in narratives\. Despite this, it is well\-suited for formal or high\-stakes settings where standardized and reproducible documentation is prioritized over linguistic diversity\. Additionally, the framework relies on intermediate computations such as switch and escalation scores derived from upstream predictions, meaning errors in earlier stages may propagate into the final summaries\.
## Ethics
The shared task data we used in this paper is anonymized and annotated by CLPych2026 organizers\. We only used secure methods and models to process the data, such as rule\-based, locally hosted open\-source LLMs, locally trained encoders, without releasing the data to any third parties with our best practice for privacy protection\.
## Acknowledgments
We thank David Lindevelt for the help on this project, including the automated prompt pipeline and framework adapted from another projectHanet al\.\([2026](https://arxiv.org/html/2605.23052#bib.bib1)\)\. We thank Prof\. Suzan Verberne for editing the camera\-ready version of this paper\. We thank the reviewers for their valuable comments on our work\. We thank the organizers preparing this shared task, and we are grateful for their communication during our registration and submissions, especially Talia Tseriotou and Iqra Ali from Queen Mary University of London\. The 4D PICTURE consortium is funded by the European Union under Horizon Europe Work Programme 101057332\. Views and opinions expressed are however those of the author\(s\) only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency \(HaDEA\)\. Neither the European Union nor the granting authority can be held responsible for them\. The UK team are funded under the Innovate UK Horizon Europe Guarantee Programme, UKRI Reference Number: 10041120\.
## References
- I\. Ali, T\. Tseriotou, G\. Dvir, C\. Chan, Y\. Zhou, J\. A\. Lossio\-Ventura, A\. Klein, A\. Shamir, D\. Sayda, A\. Hills, A\. Zirikly, D\. Inkpen, D\. Atzil\-Slonim, and M\. Liakata \(2026\)Overview of the clpsych 2026 shared task: capturing and characterizing mental health changes through social media timeline dynamics\.InProceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology,Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- Multimodal intrapersonal and interpersonal dynamics \(mind\): a transtheoretical coding manual\.Open Science Framework\.External Links:[Document](https://dx.doi.org/10.17605/OSF.IO/SJE8C),[Link](https://osf.io/sje8c)Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- D\. Atzil\-Slonim \(2026\)Leveraging theoretical and technological innovations to study the mechanisms that underlie therapeutic change in psychotherapy\.InPractice\-Based Evidence in the Psychological Therapies: Toward Policy Implications for Research, Training, and Clinical Guidelines,L\. G\. Castonguay, D\. Atzil\-Slonim, M\. Barkham, and W\. Lutz \(Eds\.\),External Links:[Link](https://global.oup.com/academic/product/practice-based-evidence-in-the-psychological-therapies-9780197748305AP)Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- S\. Belkadi, N\. Micheletti, L\. Han, W\. Del\-Pinto, and G\. Nenadic \(2025\)LT3: generating medication prescriptions with conditional transformer\.InProceedings of the Second Workshop on Patient\-Oriented Language Processing \(CL4Health\),pp\. 205–218\.Cited by:[Limitations](https://arxiv.org/html/2605.23052#Sx1.p1.1)\.
- L\. Bogdanova, S\. Sun, L\. Han, N\. A\. Lefort, and F\. M\. Plaza\-del\-Arco \(2026\)FLANS at semeval\-2026 task 7: rag with open\-sourced smaller llms for everyday knowledge across diverse languages and cultures\.arXiv preprint arXiv:2603\.01910\.Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- C\. Chan, S\. Khunkhun, D\. Inkpen, and J\. A\. Lossio\-Ventura \(2025\)Prompt engineering for capturing dynamic mental health self states from social media posts\.InProceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology \(CLPsych 2025\),A\. Zirikly, A\. Yates, B\. Desmet, M\. Ireland, S\. Bedrick, S\. MacAvaney, K\. Bar, and Y\. Ophir \(Eds\.\),Albuquerque, New Mexico,pp\. 256–267\.External Links:[Link](https://aclanthology.org/2025.clpsych-1.22/),[Document](https://dx.doi.org/10.18653/v1/2025.clpsych-1.22),ISBN 979\-8\-89176\-226\-8Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- L\. Han, D\. Lindevelt, S\. Puts, E\. van Mulligen, and S\. Verberne \(2026\)Dutch metaphor extraction from cancer patients’ interviews and forum data using llms and human in the loop\.CL4Health WS at LREC2026, Palma, Spain\.Cited by:[Acknowledgments](https://arxiv.org/html/2605.23052#Sx3.p1.1)\.
- A\. Kermani, V\. Perez\-Rosas, and V\. Metsis \(2025\)A systematic evaluation of LLM strategies for mental health text analysis: fine\-tuning vs\. prompt engineering vs\. RAG\.InProceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology \(CLPsych 2025\),A\. Zirikly, A\. Yates, B\. Desmet, M\. Ireland, S\. Bedrick, S\. MacAvaney, K\. Bar, and Y\. Ophir \(Eds\.\),Albuquerque, New Mexico,pp\. 172–180\.External Links:[Link](https://aclanthology.org/2025.clpsych-1.14/),[Document](https://dx.doi.org/10.18653/v1/2025.clpsych-1.14),ISBN 979\-8\-89176\-226\-8Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- A\. Le Glaz, Y\. Haralambous, D\. Kim\-Dufor, P\. Lenca, R\. Billot, T\. C\. Ryan, J\. Marsh, J\. Devylder, M\. Walter, S\. Berrouiguet,et al\.\(2021\)Machine learning and natural language processing in mental health: systematic review\.Journal of medical Internet research23\(5\),pp\. e15708\.Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- M\. Malgaroli, T\. D\. Hull, J\. M\. Zech, and T\. Althoff \(2023\)Natural language processing for mental health interventions: a systematic review and research framework\.Translational Psychiatry13\(1\),pp\. 309\.Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- L\. Ren, Y\. M\. Ng, and L\. Han \(2025\)MaLei at multiclinsum: summarisation of clinical documents using perspective\-aware iterative self\-prompting with llms\.Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- P\. Romero, L\. Ren, L\. Han, and G\. Nenadic \(2025\)The manchester bees at peranssumm 2025: iterative self\-prompting with claude and o1 for perspective\-aware healthcare answer summarisation\.InProceedings of the Second Workshop on Patient\-Oriented Language Processing \(CL4Health\),pp\. 340–348\.Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- A\. Tsakalidis, J\. Chim, I\. M\. Bilal, A\. Zirikly, D\. Atzil\-Slonim, F\. Nanni, P\. Resnik, M\. Gaur, K\. Roy, B\. Inkster, J\. Leintz, and M\. Liakata \(2022\)Overview of the CLPsych 2022 shared task: capturing moments of change in longitudinal user posts\.InProceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology,A\. Zirikly, D\. Atzil\-Slonim, M\. Liakata, S\. Bedrick, B\. Desmet, M\. Ireland, A\. Lee, S\. MacAvaney, M\. Purver, R\. Resnik, and A\. Yates \(Eds\.\),Seattle, USA,pp\. 184–198\.External Links:[Link](https://aclanthology.org/2022.clpsych-1.16/),[Document](https://dx.doi.org/10.18653/v1/2022.clpsych-1.16)Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
- T\. Tseriotou, J\. Chim, A\. Klein, A\. Shamir, G\. Dvir, I\. Ali, C\. Kennedy, G\. Singh Kohli, A\. Hills, A\. Zirikly, D\. Atzil\-Slonim, and M\. Liakata \(2025\)Overview of the CLPsych 2025 shared task: capturing mental health dynamics from social media timelines\.InProceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology \(CLPsych 2025\),A\. Zirikly, A\. Yates, B\. Desmet, M\. Ireland, S\. Bedrick, S\. MacAvaney, K\. Bar, and Y\. Ophir \(Eds\.\),Albuquerque, New Mexico,pp\. 193–217\.External Links:[Link](https://aclanthology.org/2025.clpsych-1.16/),[Document](https://dx.doi.org/10.18653/v1/2025.clpsych-1.16),ISBN 979\-8\-89176\-226\-8Cited by:[§1](https://arxiv.org/html/2605.23052#S1.p1.1)\.
## Appendix ACLPsych 2026 Shared Tasks
CLPsych2026 is affiliated with The Workshop on Computational Linguistics and Clinical Psychology, a workshop series founded in 2014\. CLPsych 2026 will be held at ACL in San Diego, July 4th, 2026\.
### A\.1Task Evaluations
#### Unified Evaluation Framework
The CLPsych 2026 shared tasks evaluate complementary aspects of modeling mental health dynamics from social media timelines, spanning classification, regression, temporal change detection, and summarization\. Across Tasks 1–3, the evaluation framework progressively moves from local, structured predictions to global, sequence\-level reasoning and natural language generation\.
Task 1 \(State modeling\)\.Task 1 evaluates the ability to model psychological self\-states at the post level\. Task 1\.1 focuses on discrete classification of ABCD elements and subelements, using macro\-averaged F1 scores across adaptive and maladaptive categories\. Task 1\.2 evaluates continuous presence estimation on a 1–5 scale, using regression metrics such as RMSE, with ranking based on the mean RMSE across valences\. Together, these subtasks capture the challenge of jointly modeling structured categorical representations and continuous mental state intensity\.
Task 2 \(Change Detection\)\.Task 2 evaluates the detection of temporal change signals, specifically Switch \(sudden change\) and Escalation \(gradual change\)\. Performance is measured using F1 scores at both post\-level and timeline\-level, with final ranking based on the average of these two perspectives\. This design rewards systems that can detect both local transitions and maintain consistency across sequences\.
Task 3 \(Change Summarization and Pattern Mining\)\.Task 3 evaluates sequence\-level understanding and generation\. Task 3\.1 assesses the quality of generated summaries using a multi\-metric framework including semantic consistency \(CS\), contradiction \(CT\), and similarity\-based metrics \(ROUGE\-L and BERTScore\), with final ranking based on averaged metric ranks\. Task 3\.2 focuses on identifying recurrent dynamic patterns across timelines, emphasizing generalization and abstraction of psychological change signatures\.
Cross\-Task Perspective\.Taken together, the evaluation framework reflects a hierarchy of modeling challenges: Task 1 focuses onwhatpsychological states are present, Task 2 focuses onwhenmeaningful changes occur, and Task 3 focuses onhowthese changes can be summarized and generalized\. Notably, the metrics across tasks capture partially complementary qualities, including classification accuracy, regression stability, temporal consistency, and semantic coherence, highlighting the inherent trade\-offs in modeling mental health dynamics\.
This unified evaluation highlights the difficulty of aligning discrete classification, continuous estimation, temporal reasoning, and semantic generation within a single modeling framework\.
## Appendix BMethods and Experimental Development \- Task1 \(More Details\)
We adopt a multi\-paradigm modeling strategy combining rule\-based, transformer\-based, and LLM\-assisted components for the overall shared task, to establish the relevance of the adopted approach for a given task\.
Figure 8:Targeted data augmentation strategy using Ollama\.### B\.1Task 1\.1: ABCD Element & Subelement Classification
Task 1\.1 focuses on identifying fine\-grained ABCD self\-state elements and their corresponding subelements within each post\. However, the task presents two key inherent challenges due to the fine\-grained structure of the ABCD schema\. First, the large number of subelements per category leads to a highly imbalanced and sparse label space, where certain subelements are significantly under\-represented\. Second, there exists substantial semantic overlap between closely related subelements, making boundary definition non\-trivial even for pretrained language models\.
Figure 9:Data Augmentation Examples \(paraphrased to preserve privacy in accordance with shared task guidelines\)\.#### Data Augmentation:
To address data sparsity, we employ targeted data augmentation using Ollamas \(see Fig\.[8](https://arxiv.org/html/2605.23052#A2.F8)\)\. Augmentation is guided by structured ABCD element definitions and subelement descriptions, enabling controlled expansion of training data while preserving label semantics \(see Fig\.[9](https://arxiv.org/html/2605.23052#A2.F9)\)\. However, augmentation introduces additional challenges\. While LLM\-based generation improves data volume and diversity, it may also introduce distributional shifts, as synthetic samples often lack the contextual richness and temporal grounding of real social media posts\. This is particularly critical in longitudinal mental health modeling, where self\-state interpretation depends on subtle linguistic, emotional, and contextual cues\.
#### Transformer\-based Approach \(DeBERTa\):
Initially, we considered a transformer\-based sequence classifier, DeBERTa, for direct multi\-label prediction of subelements\. The model is fine\-tuned to jointly learn element presence and subelement classification in a supervised setting, leveraging contextual embeddings to capture nuanced linguistic signals\. However, due to extreme label granularity, semantic overlap between subelements \(e\.g\., self\-care vs\. self\-improvement, anxiety vs\. despair\), and limited training examples for several classes, purely supervised learning was found to be insufficient for robust generalisation\.
#### Rule\-based Approach \(Augmented n\-gram Modeling\):
Consequently, we introduce a rule\-based pipeline that leverages label\-conditioned augmented data for n\-gram extraction and pattern\-based classification \(see Fig\.[3](https://arxiv.org/html/2605.23052#S2.F3)\)\. Augmented samples generated via Ollama are used to construct label\-specific lexical signatures in the form of n\-grams, which serve as interpretable indicators of subelement presence\. Specifically, we compute top\-k bigrams and trigrams from cleaned, stopword\-removed text and rank them using likelihood ratio scores to identify statistically salient phrases\. This approach is particularly effective in low\-resource settings, where explicit lexical cues correlate strongly with specific psychological states\.
### B\.2Task 1\.2: Presence Rating:
Task 1\.2 focuses on estimating the overall presence rating \(1–5\) of adaptive and maladaptive states for each post, reflecting the psychological centrality of each state within the narrative\. To address this task, we explore both regression\-based and LLM\-based approaches\. We observed a clear relationship between the presence of specific subelements and the final presence ratings, but also noted that simple frequency\-based counting of elements did not reliably correspond to the assigned scores\. In particular, ratings were influenced not just by the number of subelements but by their specific types within the same ABCD category\.
#### Regression\-based Approach:
Given the relationship observed between ABCD subelements and presence ratings, we adopt a regression framework using RandomForestRegressor\. We construct structured binary feature vectors that serve as inputs to two separate regression models, with adaptive and maladaptive states modelled independently\. This approach directly depends on the outputs of Task 1\.1 during inference, where detected subelements are aggregated to form the input feature representation\. Finally, the continuous outputs are rounded to discrete levels \(1–5\), aligning with the original annotation scheme\.
### B\.3Task 2 Features
Table 1:Hand\-crafted linguistic features used in Task 2 \(Switch/Escalation detection\)\. Features are extracted per post and concatenated with the temporal embedding differences\.\#FeatureDescriptionLexical1log\_lenlog\(1\+word count\)\\log\(1\+\\text\{word count\}\); compresses post length onto a continuous scale2n\_sentencesNumber of sentences, split on\[\.\!?\]\+3avg\_word\_lenMean character length across words4frac\_upperFraction of characters that are uppercase5frac\_punctFraction of characters that are punctuation \(\!?\.,;:\)Punctuation / prosodic signals6n\_exclaimRaw count of exclamation marks \(\!\)7n\_questionRaw count of question marks \(?\)8n\_ellipsisRaw count of ellipses \(\.\.\.\)9emo\_punctmin\(n\_exclaim\+n\_question,10\)\\min\(\\texttt\{n\\\_exclaim\}\+\\texttt\{n\\\_question\},\\;10\); capped emotional punctuationSentiment lexicon10frac\_negFraction of words matching a negative lexicon \(*hate, depressed, pain, alone*, …\)11frac\_posFraction of words matching a positive lexicon \(*happy, hope, proud, grateful*, …\)12sentiment\_balancefrac\_neg−frac\_pos\\texttt\{frac\\\_neg\}\-\\texttt\{frac\\\_pos\}; net sentiment polarityStructural13words\_per\_sentword count/sentence count\\text\{word count\}/\\text\{sentence count\}; average sentence length14has\_removedBinary flag for\[removed\]or\[deleted\]posts
## Appendix CCross\-task Analysis
From our methods: Unlike Task 1, which learns explicit ABCD representations, Task 2 relies on contextual prompting to identify dynamic changes directly from timelines\. Together, these approaches reflect a common emphasis on interpretability, temporal sensitivity, and privacy\-preserving deployment\. Task 1 models what psychological states are present, while Task 2 models when meaningful changes occur\.
### C\.1Stratified Sampling and K\-fold
We have tried both Stratified Sampling and K\-fold training data split for model development purposes\. In the end, we adopted the K\-fold approach for our data processing\.
## Appendix DPrompts and Rule\-based Templates
### D\.1Task2 Prompts
### D\.2Task3\.1 Prompts
### D\.3Task3\.1 Summary Generation Template
We generate the final narrative using a rule\-based template conditioned on extracted discourse features\. LetMMdenote the total maladaptive score andAAthe total adaptive score\. LetΔ\\Deltadenote structural change type andDDdenote trajectory direction\.
#### Feature Sets
Adaptive \(ℱa\\mathcal\{F\}\_\{a\}\) and Maladaptive \(ℱm\\mathcal\{F\}\_\{m\}\) features labels\.
#### Initial Phase
IfM≥AM\\geq A, maladaptive processes are dominant:
> Initially, maladaptive self\-state processes are more dominant, characterized by elements such asℱm\\mathcal\{F\}\_\{m\}, while adaptive processes remain less prominent\.
Otherwise, adaptive processes are dominant:
> Initially, adaptive self\-state processes are more dominant, characterized by elements such asℱa\\mathcal\{F\}\_\{a\}, buffering against maladaptive tendencies\.
#### Temporal Dynamics
IfM\>AM\>A, maladaptive dynamics intensify over time:
> Maladaptive dynamics intensify over time through reinforcing cycles of negative affect, self\-critical cognition, and behavioral withdrawal, suppressing adaptive functioning\.
Otherwise, adaptive processes strengthen over time:
> Adaptive processes strengthen over time through increasing self\-compassion, relational engagement, and constructive coping that counter maladaptive tendencies\.
#### Structural Transition
IfΔ=switch\\Delta=\\text\{switch\}:
> A transition point emerges within the sequence, reflecting a shift in the balance between adaptive and maladaptive self\-states\.
IfΔ=escalation\\Delta=\\text\{escalation\}:
> An escalation unfolds across the sequence, reflecting progressive intensification of emotional, cognitive, and behavioural processes over time\.
#### Trajectory Direction
IfD=deteriorationD=\\text\{deterioration\}:
> In the later phase, maladaptive self\-state dynamics dominate, reinforcing sustained distress and hopelessness\.
IfD=improvementD=\\text\{improvement\}:
> In the later phase, adaptive self\-state dynamics become dominant, supporting resilience and psychological recovery\.
IfD=fluctuationD=\\text\{fluctuation\}:
> In the later phase, adaptive and maladaptive self\-states remain in tension, reflecting ongoing fluctuation between distress and coping\.
#### Global Template \(Always Included\)
Across all sequences, the following integrative statements are included:
> The central psychological theme across the sequence reflects an evolving interaction between maladaptive distress and adaptive coping processes expressed through affect, cognition, behavior, and desire\.
> As the sequence progresses, adaptive and maladaptive self\-states increasingly interact, creating periods of internal conflict, reflection, and shifting psychological balance\.
> Across the sequence, adaptive and maladaptive self\-states alternate in dominance and suppression, shaping the overall trajectory of psychological change\.
### D\.4Task3\.2 Prompts
## Appendix EDiscussion
In this section, we discuss the performance characteristics, advantages, and limitations of the proposed methods\.
### E\.1Consistency vs\. Lexical Similarity in Template\-Based Summarization
The template\-based summarization method achieved the highest CS and lowest CT scores, highlighting its strength in producing stable and logically coherent outputs\. This behavior is expected, as the generation process is strictly governed by predefined linguistic rules and structured feature\-to\-text mappings, which reduce variability and minimize the risk of semantic drift or internally inconsistent statements\. Such control is particularly important in sensitive domains such as palliative care narrative analysis, where reliability and interpretability of generated summaries are critical\.
However, this same constrained generation process leads to lower ROUGE and BERTScore performance compared to more flexible neural baselines\. Both metrics reward lexical overlap and semantic similarity to reference summaries, which are typically written in a more natural and varied style\. Template\-based outputs, while semantically faithful and structurally consistent, do not exhibit the paraphrastic richness or lexical alignment required to maximize these scores\. As a result, there is an inherent trade\-off between consistency\-oriented generation and similarity\-based evaluation metrics\.
This trade\-off suggests that template\-based summarization is particularly well\-suited for applications where factual stability, interpretability, and controllability are prioritized over surface\-level similarity to reference texts\. In the context of psychological trajectory modeling from patient and caregiver narratives, such methods are advantageous for producing reproducible summaries of adaptive and maladaptive self\-state dynamics\. Conversely, neural summarization models may be preferred in settings where linguistic expressiveness and alignment with human\-written references are more important than strict structural consistency\.
## Appendix FDetails on rankings and evaluation scores
We list our official ranking scores with other teams in this section\.
Figure 10:Task 2 RankingTable 2:Task 1\.1 and Task 1\.2 rankings\. Task 1\.1 is ranked by Subelement Average Macro F1, while Task 1\.2 is ranked by RMSE, where lower is better\.Table 3:The Official rankof Task 3\.1 based on selected submissions of all teams\. CS, ROUGE\-L Recall, and BERTScore Recall are higher\-is\-better; CT is lower\-is\-better\. Final rank is based on the average of metric\-specific ranks\. We rank the2nd bestoverall\.Table 4:Summary of DreamerNLplus Submissions on Task 3\.1 Across Evaluation Metrics \(↑ higher is better, ↓ lower is better\)\. Detail score refers to Fig\.[11](https://arxiv.org/html/2605.23052#A6.F11)Figure 11:Task 3\.1 eval on test set \- all teams \(our filtering: by using the best averaged rank submission from other teams\)\.Table 5:Task 3\.2 official rankings for recurrent dynamic signatures of Improvement and Deterioration\. Best scores in each metric column are shown in bold, and second\-best scores are underlined\. Wewon Improvementcategory, and2nd beston Specificity for Deterioration category\.Similar Articles
CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes
CUNY's submission to the CLPsych 2026 shared task uses a pipeline approach combining in-context learning with open-weight LLMs, supervised classifiers, and retrieval-augmented generation to classify and summarize mental health changes from Reddit timelines, achieving top rankings on multiple subtasks.
Depression Risk Assessment in Social Media via Large Language Models
Researchers present a zero-shot LLM system that assesses depression risk from Reddit posts, achieving competitive F1 scores and demonstrating scalable mental-health monitoring.
Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation
This paper presents a hybrid model combining DistilBERT embeddings with Holographic Reduced Representation vectors encoding cognitive-linguistic features (first-person pronouns, absolutist words, negative emotion ratios) to detect depression in Reddit posts, achieving a macro F1 of 0.94 and demonstrating that theory-driven features complement contextual embeddings for explainable mental health NLP.
An Agentic LLM-Based Framework for Population-Scale Mental Health Screening
Proposes an agentic framework using LangChain agents for population-scale mental health screening, focusing on depression detection from clinical transcripts. The framework incrementally locks validated stages and uses proxy-guided evaluation to ensure trustworthiness and adaptability.
Synthesis and Evaluation of Long-term History-aware Medical Dialogue
This paper introduces a framework for synthesizing long-term medical dialogue datasets using LLMs, and creates MediLongChat with three benchmark tasks to evaluate healthcare agents' memory and reasoning capabilities. Experiments show that even state-of-the-art LLMs struggle with these tasks.