WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis
Summary
WiseMind is a knowledge-guided multi-agent framework that uses LLMs for psychiatric diagnosis by combining a "Reasonable Mind" agent for evidence-based logic with an "Emotional Mind" agent for empathetic communication, achieving 85.6% diagnostic accuracy on simulated and real patient interactions. The framework leverages DSM-5 structured knowledge graphs to reduce hallucinations and outperforms single-agent baselines by 15-54 percentage points while maintaining clinical soundness and psychological support.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# WiseMind: A Knowledge-Guided Multi-Agent Framework for Accurate and Empathetic Psychiatric Diagnosis Source: https://arxiv.org/html/2502.20689 \\equalcont These authors contributed equally to this work\. \\equalcont These authors contributed equally to this work\. \[4\]\\fnmJingjing\\surLi \[5\]\\fnmYanbo\\surZhang \[1,2\]\\fnmJie\\surChen \[1\]\\orgdivCollege of Biomedical Engineering,\\orgnameFudan University,\\orgaddress\\cityShanghai,\\postcode200433,\\countryChina 2\]\\orgdivDepartment of Electrical and Computer Engineering,\\orgnameUniversity of Alberta,\\orgaddress\\cityEdmonton,\\stateAlberta,\\postcodeT6G 2V4,\\countryCanada 3\]\\orgdivSchool of Data Science,\\orgnameUniversity of Virginia,\\orgaddress\\cityCharlottesville,\\stateVirginia,\\postcode22903,\\countryUSA 4\]\\orgdivMcIntire School of Commerce,\\orgnameUniversity of Virginia,\\orgaddress\\cityCharlottesville,\\stateVirginia,\\postcode22903,\\countryUSA 5\]\\orgdivDepartment of Psychiatry,\\orgnameUniversity of Alberta,\\orgaddress\\cityEdmonton,\\stateAlberta,\\postcodeT6G 2R3,\\countryChina ###### Abstract Large Language Models \(LLMs\) offer promising opportunities to support mental healthcare workflows, yet they often lack the structured clinical reasoning needed for reliable diagnosis and may struggle to provide the emotionally attuned communication essential for patient trust\. Here, we introduce WiseMind, a novel multi\-agent framework inspired by the theory of Dialectical Behavior Therapy designed to facilitate psychiatric assessment\. By integrating a ”Reasonable Mind” Agent for evidence\-based logic and an ”Emotional Mind” Agent for empathetic communication, WiseMind effectively bridges the gap between instrumental accuracy and humanistic care\. Our framework utilizes a Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition \(DSM\-5\)\-guided Structured Knowledge Graph to steer diagnostic inquiries, significantly reducing hallucinations compared to standard prompting methods\. Using a combination of virtual standard patients, simulated interactions, and real human interaction datasets, we evaluate WiseMind across three common psychiatric conditions\. WiseMind outperforms state\-of\-the\-art LLM methods in both identifying critical diagnostic nodes and establishing accurate differential diagnoses\. Across 1206 simulated conversations and 180 real user sessions, the system achieves 85.6% top\-1 diagnostic accuracy, approaching reported diagnostic performance ranges of board\-certified psychiatrists and surpassing knowledge\-enhanced single\-agent baselines by 15–54 percentage points\. Expert review by psychiatrists further validates that WiseMind generates responses that are not only clinically sound but also psychologically supportive, demonstrating the feasibility of empathetic, reliable AI agents to conduct psychiatric assessments under appropriate human oversight\. ###### keywords: Large Language Models, Psychiatry, Differential Diagnosis, Conversational Diagnosis ## Introduction Psychiatric assessment and diagnosis are among the most demanding tasks in clinical medicine, requiring clinicians to synthesize verbal symptom histories, behavioral observations, and contextual factors within an empathic, adaptive interview\[Carlat2023,nordgaard2013psychiatric\]\. Clinicians must differentiate overlapping symptom clusters, apply temporal qualifiers and exclusion rules, monitor for self\-harm or crisis risk, and simultaneously maintain rapport and respond to emotional cues\[First2024DSM5TR,Demazeux2015DSM5,Carlat2023,nordgaard2013psychiatric\]\. These challenges are compounded by escalating clinical demand\[Kessler2005PrevalenceNCS\], workforce shortages\[thomas2006continuing,butryn2017shortage\], long training pipelines\[das2023graduate,button2025clinical\], and expectations for culturally attuned, equitable care\[topol2019high,moon2023ethical\]\. As these pressures intensify, there is growing interest in whether large language models \(LLMs\) could assist intake, triage, and early\-stage diagnostic reasoning\. Although LLMs have demonstrated promising performance on medical board examinations\[Smith2023LLMReview,Johnson2023HealthcareLLM\], most clinical Natural Language Processing \(NLP\) systems are not yet fully aligned with the cognitive, interpersonal, and ethical demands of psychiatric assessment\. Many rely on pretraining, fine\-tuning, in\-context learning \(ICL\), or retrieval\-augmented generation \(RAG\)\[wu2023systematic,xue2024ai,demetriou2020machine,wu2023automatic,freidel2025knowledge\], yet they are evaluated mainly with technical metrics such as F1\-score or Bilingual Evaluation Understudy \(BLEU\)\. As highlighted in recent clinical NLP surveys\[Wu2022UKClinicalNLP,Croxford2025LLMEval\], these systems often perform well in silico but are still insufficient to meet real\-world expectations for systematic diagnostic reasoning, flexible interviewing, empathic engagement, and stringent safety supervision\[tu2025towards\]\. Three gaps restrict their clinical utility, as summarized in Table1 (https://arxiv.org/html/2502.20689#Sx1.T1)\. Table 1:Interdisciplinary contextualization framework showing gaps in current approaches and WiseMind’s corresponding solutions for psychiatric differential diagnosis\.First, a domain knowledge gap limits diagnostic reliability\. Psychiatric diagnosis is governed by highly structured decision pathways defined in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition \(DSM\-5\)\[american2013dsm5,regier2013dsm,First2024DSM5TR,Demazeux2015DSM5\]and the International Classification of Diseases, Eleventh Revision \(ICD\-11\)\[who2019icd\]\. These frameworks organize disorders into hierarchical symptom clusters, supplemented by temporal qualifiers, exclusion criteria, and specifiers that guide clinicians in distinguishing among more than 150 frequently overlapping conditions\. This structured clinical reasoning process—known as psychiatric differential diagnosis \(DDx\)—is essential for determining the most accurate and safe diagnostic formulation\[Demazeux2015DSM5\]\. Because DDx relies almost entirely on verbal histories rather than laboratory or imaging findings, effective diagnostic\-support systems must adhere to these structured pathways to maintain clinical coherence\[Carlat2023,nordgaard2013psychiatric\]\. Yet current LLM knowledge\-integration methods represent information as flat or lightly tagged text\[yoran2024making\], overlooking this tree\-like structure and producing reactive, prompt\-driven behavior that increases the risk of omissions, disorganized dialogue, and missed exclusion criteria\[lu2024multimodal,meurisch2020exploring,10\.1145/3560815\]\. Second, a process gap prevents empathic and adaptive interviewing\. Effective psychiatric assessment requires balancing analytic reasoning and empathic engagement\[nordgaard2013psychiatric\]\. Clinicians build rapport, attune to affect, sense hesitation, respond to emotional cues, and dynamically adjust questioning\. These relational skills strongly shape disclosure, including suicidality, trauma, mania, psychosis, or substance use\[savander2024take\]\. This balance mirrors Dialectical Behavior Therapy \(DBT\)’s distinction between the “reasonable mind” \(cognitive, rule\-based\) and the “emotional mind” \(affective, validating\)\[linehan1993cognitive\]\. Current LLMs excel at chain\-of\-thought reasoning\[singhal2025toward,chung2024scaling,singhal2023large\]but can sometimes lack thoughtful emotional responses\. By contrast, empathy\-oriented systems such as Woebot, Wysa, and EmoGPT foster rapport but may not fully reflect the clinical reasoning\[fitzpatrick2017delivering,inkster2018empathy,lan2024depressiondiagnosisdialoguesimulation\]\. Emerging multi\-agent systems\[kim2024mdagents,tu2025towards,mcduff2025towards\]begin to integrate these capabilities but still rely on heuristic or opaque coordination mechanisms rather than psychologically grounded theories and practices\. Third, an evaluation gap limits safe deployment\. High\-stakes psychiatric AI requires robust assessment of empathy, conversational quality, trustworthiness, fairness, and crisis\-response behavior—not only numerical accuracy\[aggarwal2024cultural,kanjee2023accuracy,Yu2022EmpathyDevelopment,Licciardone2024EmpathyChronicPain\]\. Yet fewer than 15% of mental health NLP studies incorporate user\-centered or ethical metrics\[aggarwal2024cultural,kanjee2023accuracy\]\. Few systematically probe resilience to adversarial prompts or suicidal ideation\[robertson2023diverse,kerz2023toward,pashak2022build\]\. Ethical evaluations are particularly challenging because they require interacting with vulnerable populations, raising risks of harm and participatory injustice\[ferrara2022MLpsychosis,Pozzi2025ParticipatoryInjustice,Meadi2025ConversationalAI\]and requiring intensive human oversight\[BearDontWalk2022ClinicalNLPEthics,Zhang2022MentalIllnessNLP\]\. Existing approaches therefore lack the breadth and depth of evaluation required for clinical adoption\[tu2024multiple\]\. Addressing these gaps, we introduce WiseMind, a multi\-agent LLM framework inspired by the “reasonable mind” \(rational, cognitive\) and “emotional mind” \(affective, intuitive\) constructs of DBT\[linehan1993cognitive\]\. WiseMind operationalizes a “contextualization trio” acrossknowledge,process, andevaluationlayers and integrates the complementary strengths of analytic reasoning and empathic communication through a theory\-informed dual\-agent architecture\. WiseMind is explicitly designed as anassistivetechnology for intake and triage support, rather than a stand\-alone diagnostic system\. It operationalizes this “contextualization trio” through three coordinated components: \(i\) Structured Knowledge–Guided Proactive Reasoning, which encodes the full DSM\-5 decision graph as a state\-transition knowledge graph to steer criterion\-aligned question sequencing; \(ii\) Theory\-Informed Dual\-Agent Architecture, which deploys a DBT\-aligned workflow wherein a*Reasonable\-Mind Agent \(RA\)*consults the graph to determine the next diagnostic action while an*Emotional\-Mind Agent \(EA\)*re\-expresses that intent in empathic, trust\-building language; and \(iii\) Multi\-Faceted Evaluation Strategy, which couples technical testing with a three\-tier validation pipeline—simulated patients, lay\-user studies, and expert clinician review—further strengthened by ethical stress\-testing for self\-harm scenarios and bias audits across age and gender subgroups\. We evaluate WiseMind on three common mental health conditions—depressive mood \(depression\), elevated mood \(hypomania or mania\), and anxious mood \(anxiety\)\. Across 1206 simulated conversations and 180 real user sessions, the system achieves 85.6% top\-1 diagnostic accuracy, surpassing knowledge\-enhanced single\-agent baselines by 15–54 percentage points \(p<<0.01\)\. In parallel, blinded raters judged WiseMind’s empathic quality 13%–54% higher and its advice precision 12%–30% higher than competing models \(ICC2\>\>0.75\)\. Ethical audits further show that the system refuses unsafe instructions, flags suicidal ideation, and reduces hallucinated medication recommendations\. This study makes several contributions to digital psychiatry and the design of clinically reliable LLM systems\. First, we encode the DSM\-5 differential\-diagnosis pathways into a structured state\-transition knowledge graph, enabling proactive, criterion\-aligned interviewing rather than reactive text retrieval\. Second, we introduce a DBT\-informed dual\-agent architecture that cleanly separates analytic diagnostic reasoning from empathic, patient\-centered communication—a requirement often unmet by single\-agent LLMs\. Third, we develop a comprehensive human\-centered evaluation pipeline that spans virtual patient interactions, real\-user feedback, clinician review, and ethical stress\-testing for suicidality and demographic bias\. Finally, we demonstrate that WiseMind delivers clinically meaningful improvements in diagnostic accuracy, empathetic performance, and medical proficiency, illustrating how deep domain contextualization across knowledge, process, and evaluation can inform future work on assistive LLM systems in psychiatric and other high\-stakes domains\. ## Results ### System Overview To address the design gaps identified in the Introduction, we propose WiseMind\. Fig\.1 (https://arxiv.org/html/2502.20689#Sx2.F1)illustrates the overall WiseMind Workflow, demonstrating how the system transforms patient input into clinically validated diagnostic inquiries\. Refer to captionFigure 1:Overview of the WiseMind system\. A DSM\-5–derived structured knowledge graph \(SKG\) guides the Reasonable Mind Agent \(RA\) in performing structured knowledge retrieval \(tool usage\), LLM\-based reasoning \(planning\), and action prediction\. The Emotional Mind Agent \(EA\) translates \(planning\) this diagnostic intent into empathic, patient\-facing responses based on retrieved medical information \(tool usage\)\. Both agents share a short\-term memory that tracks conversation state, symptom evolution, and risk signals\. A risk\-management layer filters unsafe or contradictory content\. System performance is evaluated through a multi\-tier framework combining simulated interviews, human\-participant interactions, and psychiatrist ratings\.WiseMind employs two complementary agents—the Reasonable Mind Agent \(RA\) and the Emotional Mind Agent \(EA\)—which synchronize through a Shared Short\-Term Memory module to track conversation state and symptom evolution\. The workflow proceeds step\-by\-step, corresponding to the numbered flow in the figure\. The process begins when the patient reports symptoms \(1\)\. This input is screened by theRisk Managementlayer to detect high\-risk language \(e\.g\., self\-harm\) before entering the diagnostic loop\. Then from steps 2–5, the RA serves as the primary interface\. It processes the input and uses the SKG Retrieval Tool \(2\) to query \(3\) and obtain \(4\) relevant medical context from the SKG\. Based on this, the RA generates an initialAction Prediction\(5\) \(i\.e\., whether to ask for more details or validate criteria\) within the defined Action Space\. Steps 6–9 was consist of Emotional Processing and Question Formulation\. The action intent from the RA is passed to the EA to generate patient\-facing communication \(step 5\)\. The EA validates the intent against DSM\-5 clinical logic by using SKG tool \(6\) to query \(7\) and obtain \(8\) medial knowledge\. It then applies therapeutic language guidelines—such as validation, reflective phrasing, and non\-judgmental tone—to maintain empathy\. Finally, it integrates these clinical and emotional cues to formulate the next interview question, completing the Ask Next Interview Question \(9\) step\. Both agents operate over the DSM\-5\-adapted SKG, ensuring that every conversational turn is grounded in established psychiatric criteria while maintaining interactional fluidity\. For detailed examples of conversations and a step\-by\-step breakdown of how different parts of the agentic workflow contribute to the clinical dialogue, see Supplementary Section 1.6\. Refer to captionFigure 2:Multi\-Faceted Evaluation Framework for WiseMind\. The system’s performance is holistically validated across three distinct tiers, ensuring the system meets standards for technical accuracy, patient experience, and medical proficiency:\(a\)Tier 1 \(Simulated Interaction Evaluation\): Simulated Clinical Interview assesses Diagnostic Performance by comparing WiseMind’s Predicted Diagnosis against the Ground Truth Diagnosis of a Virtual Standardized Patient\. Metrics quantified include Differential Diagnosis Accuracy \(DDx\-ACC\) and Critical Node Recall \(CN\-Recall\)\. \(b\) Tier 2 \(Real Interaction Evaluation\): User Experience Evaluation assesses Empathetic Performance
Similar Articles
LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
Introduces LingxiDiagBench, a large-scale multi-agent benchmark for evaluating LLMs on Chinese psychiatric consultation and diagnosis. Key findings show high accuracy on binary classification but poor performance on multi-way differential diagnosis, highlighting a decoupling between conversational quality and diagnostic accuracy.
Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking
This paper presents a provenance-aware, knowledge-graph-based multi-agent framework that integrates patient narratives from Reddit and WebMD with FDA adverse event reports for nine antidepressants, using an LLM entity-recognition pipeline to achieve high accuracy and enabling traceable safety information for psychiatric medications.
Mental-R1: Aligning LLM Reasoning for Mental Health Assessment
Proposes Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework for aligning LLM reasoning in mental health assessment, achieving an average improvement of 10.4 percentage points in weighted F1-score over existing baselines.
MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments
The paper introduces MedExAgent, a framework that formalizes clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) to handle noisy and incomplete information. It proposes a two-stage training pipeline combining supervised finetuning and reinforcement learning to improve diagnostic accuracy and cost-efficiency in medical LLMs.
Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications
This paper proposes a multi-agent framework using deterministic orchestration and neuro-symbolic state tracking to mitigate premature diagnostic handoff and silent hallucinations in healthcare LLM applications.