Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking
Summary
This paper presents a provenance-aware, knowledge-graph-based multi-agent framework that integrates patient narratives from Reddit and WebMD with FDA adverse event reports for nine antidepressants, using an LLM entity-recognition pipeline to achieve high accuracy and enabling traceable safety information for psychiatric medications.
View Cached Full Text
Cached at: 06/26/26, 05:11 AM
# Introduction Source: [https://arxiv.org/html/2606.26205](https://arxiv.org/html/2606.26205) Knowledge\-augmented Agentic AI for Mental Health Medication Information Seeking Huizi Yu1†\\dagger, Jian Liu2†\\dagger, Wenkong Wang3†\\dagger, Lingyao Li4, Jiayan Zhou5, Zhaoqian Xue6, Xiang Li2, Xinxin Lin2, Zhiying Liang2, Zhuoru Wu2, Siyuan Ma7, Xin Ma3, and Lizhou Fan2,8\* 1Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China 2Department of Psychiatry, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China 3School of Control Science and Engineering, Shandong University, Ji’nan, Shandong, China 4College of Information Science, University of Arizona, Tucson, Arizona, USA 5Department of Medicine, Stanford University School of Medicine, Stanford University, Palo Alto, California, USA 6Perelman School of Medicine, The University of Pennsylvania, Philadelphia, Pennsylvania, USA 7Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, USA 8Li Ka Shing Institute of Health Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Shatin, N\.T\., Hong Kong SAR, China †\\daggerThese authors contributed equally\. \*Correspondence to: Lizhou Fan \([leofan@cuhk\.edu\.hk](https://arxiv.org/html/2606.26205v1/mailto:[email protected])\) ABSTRACT Patients increasingly seek medication information online, yet safety knowledge for psychiatric drugs is split between regulatory adverse\-event records, which are authoritative but abstract, and patient narratives, which are experience\-near but unvalidated\. Integrating them without conflating evidence and anecdote is especially consequential in psychiatry, where poorly contextualised information can amplify fear, nocebo responses, and non\-adherence\. Here we develop a provenance\-aware, knowledge\-graph\-based multi\-agent framework unifying 466,525 Reddit posts, 60,782 WebMD reviews, and twenty years of U\.S\. FDA Adverse Event Reporting System records for nine antidepressants\. A large\-language\-model entity\-recognition pipeline benchmarked against physician annotations reached highest F1 scores of 0\.969 for medications and 0\.973 for conditions\. The two community platforms were far more concordant with each other \(overlap up to a Jaccard similarity of 0\.905\) than with regulatory reports, indicating that patient\-generated data form a partly independent safety signal\. For sertraline, many adverse events appeared in community sources hundreds of days before the corresponding FDA date\. A Neo4j knowledge graph grounded in ATC\-N, ICD\-10, and MedDRA vocabularies preserves provenance, keeping every claim traceable and regulatory facts distinct from patient experience\. These results establish source\-aware integration as a route to more auditable psychiatric medication information, with usefulness and patient benefit to be tested prospectively\. Psychiatric pharmacotherapy is often long term, individualized, and dynamically adjusted over the course of illness, which makes patient understanding of medication benefits, adverse effects, and uncertainty especially important\. Yet in routine care, medication counseling is frequently compressed into short clinical encounters, leaving many patients with unanswered questions after the visit\. This gap matters in a digital environment where seeking health information online has become routine across regions: in Europe, 55% of people aged 16–74 sought health\-related information online in 2020\[[1](https://arxiv.org/html/2606.26205#bib.bib1)\]; in the United States, the proportion of adults who used the internet first for their most recent health information search increased from 61\.2% in 2008 to 74\.4% in 2017\[[2](https://arxiv.org/html/2606.26205#bib.bib2)\]; and in Asia, a 10\-country survey found that 71\.6% of smartphone users sought health information on their smartphones at least a few times per month\[[3](https://arxiv.org/html/2606.26205#bib.bib3)\]\. Prior syntheses suggest that online health information seeking is closely tied to medication\-related decision\-making, even if its relationship with adherence is complex and heterogeneous across settings\[[4](https://arxiv.org/html/2606.26205#bib.bib4),[5](https://arxiv.org/html/2606.26205#bib.bib5)\]\. In parallel, recent work shows that patients’ knowledge of newly prescribed medications remains incomplete, particularly for practical details such as administration and side effects, underscoring the need for clearer and more accessible medication education tools\[[6](https://arxiv.org/html/2606.26205#bib.bib6)\]\. The challenge may be particularly acute in psychiatric care\. Patients prescribed antidepressants, antipsychotics, mood stabilizers, or anxiolytics often seek explanations not only about formal adverse\-effect lists, but also about how medication\-related experiences unfold in daily life: changes in sleep, appetite, concentration, affect, weight, motivation, or discontinuation symptoms\. Peer forums and online communities can provide validation, practical language, and a sense of not being alone; recent mixed\-methods work in mental health forums suggests that users often derive emotional support, normalization, and practical benefit from these spaces\[[7](https://arxiv.org/html/2606.26205#bib.bib7)\]\. At the same time, peer narratives are highly heterogeneous in accuracy and representativeness\. In medication contexts, expectations can themselves shape the experience of side effects, and negative or poorly contextualized information may amplify anxiety and nocebo responses\[[8](https://arxiv.org/html/2606.26205#bib.bib8),[9](https://arxiv.org/html/2606.26205#bib.bib9)\]\. For psychiatric medications, where adherence is already vulnerable to fear, stigma, and uncertainty, this creates a communication problem: patients need information that is understandable and experience\-near, but they also need it to be proportionate, evidence\-aware, and safe\. Existing information channels each address only part of this need\. Regulatory and professional sources offer standardized, evidence\-based descriptions of indications, warnings, and reported adverse events, but they are often experienced by patients as abstract, decontextualized, or difficult to map onto lived experience\. FAERS, for example, is a major postmarketing safety resource explicitly designed to support the FDA’s postmarketing surveillance of marketed drugs and biologics\[[10](https://arxiv.org/html/2606.26205#bib.bib10)\]\. Meanwhile, social media and consumer\-generated narratives can surface patient\-reported experiences that may be absent, delayed, or underemphasized in formal channels\. A recent scoping review concluded that social media analysis may serve as a useful supplementary source for adverse\-event detection and pharmacovigilance, while also emphasizing the need for careful validation and source\-aware interpretation\[[11](https://arxiv.org/html/2606.26205#bib.bib11)\]\. The central problem, therefore, is not whether authoritative data or patient narratives are “better,” but how to integrate them so that official evidence remains the factual backbone while real\-world experience adds context rather than distortion\. Large language models offer a potentially powerful interface for this integration because they can translate complex medical information into fluent, user\-facing explanations and support conversational querying\. However, the same properties that make LLMs accessible also make them risky in health settings\. Systematic reviews show rapid growth in LLM applications for patient care and chatbot\-based health advice, but also document recurring concerns around factual reliability, transparency, evaluation quality, and governance\[[12](https://arxiv.org/html/2606.26205#bib.bib12),[13](https://arxiv.org/html/2606.26205#bib.bib13),[14](https://arxiv.org/html/2606.26205#bib.bib14)\]\. More pointedly, benchmark work in clinical decision\-making shows that current LLMs are not ready for autonomous clinical use\[[15](https://arxiv.org/html/2606.26205#bib.bib15)\], and workflow\-level evaluations further suggest that clinical readiness depends on how models gather, interpret, and communicate information across multi\-turn consultation processes, not only final\-answer accuracy\[[16](https://arxiv.org/html/2606.26205#bib.bib16)\]\. Recent safety\-focused research has called for clinician\-centered frameworks to assess hallucinations and clinical harm rather than relying on surface plausibility alone\[[17](https://arxiv.org/html/2606.26205#bib.bib17)\], while evidence\-guided alignment frameworks have also been proposed to improve structured psychiatric clinical reasoning in lighter\-weight LLMs\[[18](https://arxiv.org/html/2606.26205#bib.bib18)\]\.These concerns are especially salient in psychiatry, where poorly framed answers about adverse effects, suicidality, withdrawal, or vulnerable populations could increase fear, disrupt adherence, or undermine trust\. As a result, any AI system for psychiatric medication education should be explicitly constrained, source\-grounded, and evaluated as an educational aid rather than a substitute for professional judgment\[[19](https://arxiv.org/html/2606.26205#bib.bib19)\]\. A promising way to achieve this is to combine LLMs with knowledge graphs and multi\-agent orchestration\. Knowledge graphs provide a structured representation of entities, relations, and provenance across heterogeneous information sources, improving interpretability and supporting evidence\-aware retrieval\[[20](https://arxiv.org/html/2606.26205#bib.bib20)\]\. In healthcare, retrieval\-augmented and graph\-grounded LLM workflows are increasingly used to reduce hallucination and improve factual consistency by separating evidence retrieval from answer generation\[[17](https://arxiv.org/html/2606.26205#bib.bib17)\]\. For the present application, this architecture is particularly attractive because it can encode different evidence roles within the same system: regulatory or professional sources can anchor factual claims, while community narratives can be surfaced as contextualized, secondary experience signals\. A multi\-agent design can then distribute functions such as query understanding, source selection, retrieval, synthesis, and validation, thereby reinforcing internal checks and clearer safety boundaries\[[17](https://arxiv.org/html/2606.26205#bib.bib17),[19](https://arxiv.org/html/2606.26205#bib.bib19),[21](https://arxiv.org/html/2606.26205#bib.bib21),[22](https://arxiv.org/html/2606.26205#bib.bib22),[23](https://arxiv.org/html/2606.26205#bib.bib23),[24](https://arxiv.org/html/2606.26205#bib.bib24)\]\. Against this background, the present study constructs and characterises a knowledge\-graph\-driven multi\-agent AI system for psychiatric drug adverse event information using multi\-source health data\. The system integrates community and regulatory information from Reddit, WebMD, and FDA/FAERS into a unified knowledge framework designed to support educational question answering about medication use, side effects, withdrawal, and risk communication, rather than diagnosis or treatment recommendation\. The core contributions of this work are: \(i\) a scalable LLM\-based NER pipeline benchmarked across nine state\-of\-the\-art models on physician\-annotated data; \(ii\) a multi\-source comparative analysis of adverse\-event profiles across regulatory and community\-derived data streams, including cross\-source overlap, frequency structure, and temporal lead\-time analysis for nine antidepressants; and \(iii\) a Neo4j knowledge graph and multi\-agent chatbot architecture providing provenance\-aware, safety\-constrained retrieval of psychiatric medication information\. ## Methods ### Drug List Construction \(ATC\-N\) We built the drug dictionary from the WHO Anatomical Therapeutic Chemical \(ATC\) system, which hierarchically classifies medicines by anatomical system, therapeutic/pharmacologic class, and generic names \(i\.e\., chemical substance\)\. From the ATC Index, we programmatically retrieved the hierarchy and filtered to the N \(Nervous system\) branch, which contained 626 generic drug names\. To improve recall during downstream text mining, we expanded generic names to brand synonymy using a constrained LLM prompt that returned brand names in a simple structured format \(see Appendix[A](https://arxiv.org/html/2606.26205#A1)\)\. The resulting ATC\-N dictionary—containing ATC code, generic drug name, and curated brand names—serves as the basis for keyword generation in data collection and for medication normalization across community and agency sources\. ### Data Collection \(Community and Agency\) We collected community\-based discourse posts from Reddit using the Academic Torrents public dataset, restricting the analysis window to June 2005 through April 2025\. Reddit is a large, topic\-organized social platform where users share timestamped narratives within public forums\. We included submissions whose title or body contained any ATC\-N generic or curated brand keyword\. To enhance textual quality and specificity, we removed entries with missing content, concatenated titles and body text, excluded records with fewer than 10 words, and deduplicated exact \(title, body\) pairs\. Posts triggered solely by ambiguous keyword matches \(e\.g\., “lithium”, “cocaine”\) were also excluded\. Finally, we applied automated language detection to retain only English\-language posts\. This process resulted in 1,138,331 unique, English, keyword\-positive Reddit posts\. Upon closer inspection, we found that a substantial portion of Reddit posts retained after keyword\- and rule\-based filtering still lacked meaningful clinical information\. To further enhance the overall data quality, we introduced a secondary filtering stage designed to differentiate information\-rich from information\-poor content \(see Appendix[B](https://arxiv.org/html/2606.26205#A2)\)\. The final Reddit post collection contains 466,525 posts\. For the second community\-based discourse source, we collected structured medication reviews from WebMD, a consumer health platform where patients and caregivers submit rated, time\-stamped narratives about prescribed drugs\. For each ATC\-N generic name, we identified the corresponding review page\(s\) and scraped all available entries, capturing users’ age group, gender, duration of use, reviewer role \(patient or caregiver\), overall rating, and free\-text review\. Compared with Reddit, WebMD provides semi\-structured demographic and usage fields alongside narrative text, enabling complementary community\-based signal characterization and stratified analyses\. We completed the same rule\-based and model\-based filtering as the Reddit data, and the remaining WebMD corpus contained 60,782 reviews\. We analyzed the U\.S\. FDA’s Adverse Event Reporting System \(FAERS\), a large, long\-standing national pharmacovigilance database with millions of post\-marketing reports and broad regulatory prestige\. We aligned the analytic window from 2005 Quarter 2 to 2025 Quarter 1 to correspond with community data\. Using the open\-sourcefaers\-toolkit\(GitHub\), we parsed quarterly releases into SQLite\. We then filtered FAERS to the ATC\-N drug set by matching each report’s drug names against our ATC\-N dictionary\. Finally, we restricted records by FDA receipt date and applied light normalization \(trimming whitespace, converting empty fields to null\)\. All reporter types \(manufacturer, clinician, consumer\) were retained to maximize coverage\. ### LLM\-based Named Entity Recognition \(NER\) To extract structured clinical information from unstructured Reddit and WebMD posts, we employed a large language model \(LLM\)\-based NER pipeline\. Each post was processed using a single\-pass structured extraction prompt \(see Appendix[D](https://arxiv.org/html/2606.26205#A4)\) instructing the model to identify and return a JSON object containing: \(i\) medications with dosage, dosage form, duration of use, and continuation status; \(ii\) primary psychiatric condition with severity and diagnostic status; \(iii\) comorbid conditions; and \(iv\) side effects with associated drug, severity, frequency, and duration\. Relations between entities \(TREATS, CAUSES, CAUSES\_BY\_WITHDRAW, COMORBID\_WITH\) were also extracted in the same pass\. We benchmarked nine state\-of\-the\-art LLMs on a physician\-annotated gold\-standard NER dataset \(annotation interface shown in Appendix[C](https://arxiv.org/html/2606.26205#A3); model comparison reported in the Results\)\. GPT\-4\.1\-mini was selected as the pipeline default based on its favourable balance of extraction accuracy, throughput, and cost \(Appendix[F](https://arxiv.org/html/2606.26205#A6)\)\. ### Entity Mapping and Knowledge Graph Construction #### Entity canonicalization\. Raw entity strings extracted by the NER pipeline were mapped to controlled biomedical vocabularies to ensure consistency across posts\. Medications were aligned to ATC\-N ingredient\-level identifiers using our ATC\-N dictionary\. Conditions \(primary and comorbid\) were mapped to International Classification of Diseases 10th Revision \(ICD\-10\) terms, and side effects were mapped to Medical Dictionary for Regulatory Activities \(MedDRA\) Preferred Terms\. Mapping was performed by embedding\-based nearest\-neighbour retrieval usingtext\-embedding\-3\-small, with entity\-type\-specific cosine\-similarity thresholds calibrated to maximize Youden’s J statistic on a physician\-annotated gold standard \(see Appendix[E](https://arxiv.org/html/2606.26205#A5)\)\. #### Graph schema\. The knowledge graph was implemented in Neo4j with four main node types: Post, Medication, Condition, and SideEffect\. Post nodes are lightweight anchor nodes containing only a unique identifier and minimal metadata; full post text is stored in a linked SQLite sidecar database with full\-text search indexing \(see Appendix[G](https://arxiv.org/html/2606.26205#A7)\)\. Domain entity nodes store canonicalized ontology\-level identifiers and are deduplicated via Neo4j uniqueness constraints, so all lexical variants of the same clinical concept collapse into a single node\. #### Typed edges\. Relations were materialized as four typed edge classes:TREATS\(medication–condition\),CAUSES\(medication–side effect\),CAUSES\_BY\_WITHDRAW\(medication–side effect, discontinuation context\), andCOMORBID\_WITH\(condition–condition\)\. In addition,MENTIONSedges link each Post node to all domain entities referenced in that post, preserving direct provenance linkage for evidence\-traced retrieval\. Where the same entity pair appeared across multiple posts, supporting post identifiers were accumulated as an edge property list\. ### Multi\-source Comparative Analysis To compare AE patterns across data sources, we conducted a multi\-source analysis integrating FDA, WebMD, and Reddit data for a set of antidepressant drugs\. After source\-specific preprocessing and AE extraction, records from the three sources were aligned at the drug–AE level using harmonized AE terms so that comparable adverse effects could be evaluated across platforms\. This allowed us to examine both shared and source\-specific AE patterns across regulatory and community\-derived data\. We assessed cross\-source similarity from several complementary perspectives\. First, we measured the overlap in AE profiles across source pairs to evaluate the extent to which the same adverse effects were represented in different datasets\. Second, we summarized how evenly AE information was distributed across the three sources, allowing us to distinguish drugs with relatively balanced cross\-source representation from those whose AE profiles were concentrated more strongly in one or two sources\. Third, we compared the relative prominence of AEs across sources to assess whether some adverse effects were emphasized more strongly in one source than another\. We also examined temporal differences across sources using dated AE records from FDA, WebMD, and Reddit\. After aligning AE terms across the three datasets, we identified when each adverse effect first appeared in each source and compared the timing of emergence across regulatory and community\-derived data\. This temporal analysis was used to assess whether some adverse effects tended to appear earlier in community reporting streams than in FDA, or vice versa\. Overall, this comparative framework allowed us to evaluate similarity and divergence across FDA, WebMD, and Reddit at multiple levels, including AE overlap, relative prominence, source balance, and timing of first appearance\. The goal was not to treat these data streams as interchangeable, but to characterize how they converge and differ in representing antidepressant adverse effects\. ### Multi\-Agent Adverse Event Chatbot Figure 1:Multi\-agent adverse\-event information\-seeking architecture\. The system receives a user query and routes it through two parallel analysis streams: \(i\) aNER Agentthat extracts medication entities and maps them to canonical identifiers via Entity Mapping, producing Extraction Specifications; and \(ii\) aUser Intent Agentthat parses the clinical question type\. Both outputs converge at theKG Query Generation Agent, which formulates structured queries against three parallel knowledge graphs \(Reddit KG, FAERS KG, and WebMD KG\)\. Extracted information from each source is then passed to theSummarization Agent, which produces source\-specific summaries\. TheComparison Agentsynthesizes these into a cross\-source comparison, which is reviewed by theValidation Agentbefore a final evidence\-grounded response is returned to the user\.The chatbot follows a multi\-agent pipeline designed to decompose information seeking into specialist sub\-tasks while enforcing source attribution throughout \(Fig\.[1](https://arxiv.org/html/2606.26205#Sx2.F1)\), consistent with recent role\-structured agentic workflows in healthcare, mental health, emergency medical services, and simulated\-patient systems\[[22](https://arxiv.org/html/2606.26205#bib.bib22),[23](https://arxiv.org/html/2606.26205#bib.bib23),[24](https://arxiv.org/html/2606.26205#bib.bib24)\]\. When a user submits a query, two agents process it in parallel: aNER Agentidentifies all medication names and maps them to canonical ATC\-N identifiers via the entity\-mapping module, producing a structured extraction specification; and aUser Intent Agentclassifies the clinical question type \(e\.g\., general adverse event inquiry, demographic\-stratified question, or longitudinal trend query\)\. The extraction specification and the parsed user intent are combined by theKG Query Generation Agent, which selects the appropriate knowledge graphs and formulates Cypher queries against each source \(Reddit KG, FAERS KG, WebMD KG\)\. Source selection is intent\-driven: demographic or epidemiological questions route primarily to the FAERS and WebMD graphs, which carry structured age, sex, and temporal metadata; experiential or context\-rich questions route preferentially to the Reddit graph\. Retrieved evidence from each graph is passed to theSummarization Agent, which produces independent source\-specific summaries preserving provenance tags\. TheComparison Agentthen synthesizes these summaries into a cross\-source comparison that highlights consensus and divergence across community and regulatory data\. Finally, theValidation Agentreviews the synthesized response against a predefined safety ruleset before the answer is returned to the user\. The architecture is designed to constrain generation to retrieved graph context, thereby reducing hallucination risk while preserving user\-facing fluency—a property particularly important for psychiatric medication information, where unsupported claims about adverse effects, discontinuation, or drug interactions could amplify nocebo responses or disrupt adherence\. ## Results ### Evaluation of NER Performance Across all nine LLMs evaluated on the physician\-annotated NER benchmark, performance varied substantially by entity category and attribute type\. For medication\-related entities \(Table[1](https://arxiv.org/html/2606.26205#Sx3.T1)\), GPT\-4\.1\-mini achieved the highest F1 for medication name extraction \(0\.969\), followed by Claude\-Sonnet\-4 \(0\.952\) and Gemini\-2\.5\-Flash \(0\.947\)\. Dosage attribute extraction was markedly harder across all models, with scores ranging from 0\.523 \(GPT\-5\-nano\) to 0\.751 \(Claude\-Sonnet\-4\), likely reflecting the heterogeneous ways in which patients report drug doses in unstructured narratives\. Dosage form extraction was near\-ceiling for most premium\-tier models \(\>\>0\.98\), consistent with its relatively stereotyped surface forms in patient text\. For condition\-related entities \(Table[2](https://arxiv.org/html/2606.26205#Sx3.T2)\), Claude\-Sonnet\-4 achieved the highest primary\-condition F1 \(0\.973\), with GPT\-4\.1\-mini and GPT\-4o\-mini both reaching 0\.966\. Comorbid condition extraction was uniformly lower than primary\-condition extraction across all models, reflecting the greater syntactic ambiguity and higher clinical density of posts describing multiple co\-occurring diagnoses\. Duration of illness showed the widest cross\-model variability \(range 0\.222–0\.562\), indicating that temporal expressions in patient narratives remain a persistent challenge for current LLMs\. For side\-effect entities \(Table[3](https://arxiv.org/html/2606.26205#Sx3.T3)\), Deepseek\-V3 achieved the highest side\-effect name F1 \(0\.912\), followed by Claude\-Sonnet\-4 \(0\.879\) and Gemini\-2\.5\-Flash \(0\.846\)\. Attribute\-level extraction for side effects—particularly duration \(range 0\.188–0\.476\) and frequency \(range 0\.125–0\.750\)—was substantially lower than name\-level F1, highlighting the tendency of patients to describe adverse events qualitatively rather than with explicit temporal or recurrence characterization\. Balancing NER accuracy across all three entity categories against deployment throughput and cost \(see Appendix[F](https://arxiv.org/html/2606.26205#A6)\), GPT\-4\.1\-mini was selected as the pipeline default for full\-corpus extraction\. The consistent gap between name\-level and attribute\-level extraction across entity types underscores a fundamental challenge in clinical NLP from patient\-generated text: while models reliably detect that an adverse event occurred, precise characterization of its severity, timing, and recurrence requires either richer contextual framing or supplementary annotation\. Table 1:Named\-entity recognition performance of LLMs on medication\-related entity extraction \(F1 scores\)\.Table 2:Named\-entity recognition performance of LLMs on condition\-related entity extraction \(F1 scores\)\.Table 3:Named\-entity recognition performance of LLMs on side\-effect entity extraction \(F1 scores\)\. ### Knowledge Graph The Reddit\-derived mental health knowledge graph integrates ontology\-grounded representations of psychiatric pharmacotherapy from 466,525 information\-rich posts \(Fig\.[2](https://arxiv.org/html/2606.26205#Sx3.F2)\)\. Medications were canonicalized to ATC\-N ingredient\-level identifiers, conditions to ICD\-10 terms, and side effects to MedDRA Preferred Terms using embedding\-based nearest\-neighbour matching with entity\-type\-specific cosine\-similarity thresholds calibrated to maximize Youden’s J statistic on a physician\-annotated gold standard \(see Appendix[E](https://arxiv.org/html/2606.26205#A5)\)\. The graph encodes four main typed edge classes—TREATS,CAUSES,CAUSES\_BY\_WITHDRAW, andCOMORBID\_WITH—along withMENTIONSedges that preserve direct provenance linkage between entity nodes and the Reddit posts in which they appear, enabling evidence\-traced retrieval for downstream question answering\. The design deliberately separates entity\-level semantics from post\-level content: full post text is stored in a linked SQLite sidecar database with full\-text search indexing, keeping the core knowledge graph compact and privacy\-preserving while preserving the ability to trace any inferred relation back to its source posts for downstream verification and retrieval\-augmented generation\. A sertraline\-centred subgraph illustrates how medication nodes connect to co\-occurring conditions, adverse events, and supporting Reddit posts \(Fig\.[3](https://arxiv.org/html/2606.26205#Sx3.F3)\)\. Figure 2:Overview of the Reddit\-derived mental health knowledge graph\. Nodes represent canonicalized clinical entities \(Medication,Condition,SideEffect; color\-coded by type\) or sourcePosts\. Typed edges encode treatment relationships \(TREATS\), adverse\-effect associations \(CAUSES,CAUSES\_BY\_WITHDRAW\), comorbid\-disease co\-occurrence \(COMORBID\_WITH\), and source\-post provenance \(MENTIONS\)\. The layout highlights hub nodes corresponding to commonly prescribed antidepressants and frequently reported psychiatric diagnoses\.Figure 3:Sertraline\-centred subgraph linking the medication to co\-occurring ICD\-10 conditions, MedDRA adverse effects and supporting Reddit posts\. Edge thickness is proportional to the number of supporting posts;TREATS,CAUSESandCOMORBID\_WITHdenote treatment, adverse\-event and comorbidity relations\. ### Similarity Across Sources We first compared adverse\-event \(AE\) similarity across FDA, WebMD, and Reddit for nine antidepressants using pairwise Jaccard indices and source\-balance metrics \(Fig\.[4](https://arxiv.org/html/2606.26205#Sx3.F4); Table[4](https://arxiv.org/html/2606.26205#Sx3.T4)\)\. Across drugs, WebMD and Reddit generally showed the greatest overlap, with WebMD–Reddit Jaccard values reaching 0\.905 for desvenlafaxine, 0\.875 for duloxetine, 0\.833 for fluoxetine, 0\.805 for paroxetine, and 0\.786 for sertraline\. In contrast, overlap between FDA and the two community\-derived sources was typically lower, although the extent of this difference varied by drug\. Desvenlafaxine showed high concordance across all three source pairs, whereas vilazodone showed marked asymmetry, with high FDA–Reddit overlap \(0\.756\) but much lower FDA–WebMD and WebMD–Reddit overlap \(0\.286 and 0\.293, respectively\)\. Together, these findings suggest that a shared AE core exists across sources, but the degree of cross\-source agreement is drug\-specific\. Source\-balance metrics showed that overlap in AE identity and balance in AE volume were not identical properties\. Fluoxetine, phenelzine, sertraline, and desvenlafaxine had relatively high composition entropy and evenness, indicating a more balanced distribution of AE counts across sources\. By contrast, duloxetine and amitriptyline showed lower entropy and evenness, suggesting that their AE profiles were more concentrated in one or two sources\. We next performed a focused analysis of sertraline\. At the set level, sertraline followed the same overall trend as the broader drug panel, with the greatest overlap between WebMD and Reddit \(Jaccard=0\.786=0\.786\), compared with FDA–WebMD \(0\.5770\.577\) and FDA–Reddit \(0\.5580\.558\)\. Frequency\-based comparisons were consistent with this pattern \(Fig\.[5](https://arxiv.org/html/2606.26205#Sx3.F5)\)\. In the sertraline scatterplots, the correlation was moderate for FDA versus Reddit \(r=0\.593r=0\.593\) and FDA versus WebMD \(r=0\.588r=0\.588\), but substantially higher for Reddit versus WebMD \(r=0\.847r=0\.847\)\. The corresponding dot\-whisker plot showed the same ordering, indicating that the two community\-derived sources were more closely aligned not only in which AEs appeared, but also in their relative prominence\. Inspection of the sertraline scatterplots suggests that broadly recognizable, patient\-salient AEs—such as nausea, headache, sleepiness, rash, and sexual dysfunction—were shared across sources and tended to appear among the more prominent signals\. However, several points deviated from the diagonal, indicating source\-specific emphasis\. In general, FDA\-based comparisons appeared to give relatively greater weight to more formally reported or medically coded events, whereas Reddit and WebMD more strongly reflected symptoms that are directly felt in daily life and readily discussed in lay narratives\. This pattern is consistent with the stronger Reddit–WebMD concordance observed in both overlap and frequency analyses\. To quantify source\-specific enrichment in more detail, we examined pairwise volcano plots for sertraline \(Fig\.[6](https://arxiv.org/html/2606.26205#Sx3.F6)\)\. These plots showed that most AEs clustered near the null region, consistent with a shared core safety profile across sources, but a smaller subset showed clear source\-preferential enrichment\. In the FDA\-versus\-Reddit comparison, several AEs were strongly shifted away from the center, indicating substantial differences in relative representation between regulatory and community reporting\. In the FDA\-versus\-WebMD comparison, the degree of divergence was still evident but generally less extreme\. By contrast, the Reddit\-versus\-WebMD comparison showed fewer large\-effect outliers, again supporting the view that the two community\-derived platforms are more similar to one another than either is to FDA\. Notably, some experiential symptoms such as dry mouth, sexual dysfunction, and panic appeared more prominent in community\-based comparisons, whereas several medically framed events showed stronger enrichment in FDA, suggesting that source differences reflect not only noise, but also distinct reporting incentives, symptom salience, and coding practices\. We further assessed temporal concordance using lead\-time analysis for sertraline \(Fig\.[7](https://arxiv.org/html/2606.26205#Sx3.F7)\), defined as the difference between the first FDA report date and the earliest first mention in the community sources\. Negative values therefore indicate that an AE was mentioned earlier in Reddit or WebMD than in FDA, whereas positive values indicate earlier appearance in FDA\. The lead\-time distribution was skewed toward negative values, with many sertraline AEs first appearing in community data substantially earlier than in FDA, in some cases by several hundred days\. This pattern suggests that community platforms may capture a subset of patient\-experienced AEs earlier than formal pharmacovigilance channels\. At the same time, a smaller set of AEs showed positive lead times, indicating earlier appearance in FDA\. These later\-emerging community mentions may reflect events that become discussed only after they are clinically recognized, formally coded, or amplified through broader public awareness\. Thus, the temporal analysis supports a complementary interpretation: community and regulatory data do not simply duplicate one another, but may contribute different surveillance timing for different types of AEs\. Taken together, these analyses indicate that FDA, WebMD, and Reddit capture a shared but non\-identical AE signal for antidepressants\. Across drugs, the strongest similarity was usually observed between the two community\-derived platforms, whereas FDA remained related but systematically distinct\. The sertraline case study further showed that this distinction is evident at multiple levels: AE set overlap, relative frequency structure, differential enrichment, and time of first appearance\. These findings support the use of a multi\-source integration strategy, in which regulatory data provide formal safety context and community sources add patient\-centered salience, narrative detail, and potentially earlier visibility for some adverse effects\. Table 4:Pairwise adverse\-event profile similarity and source balance across nine antidepressants\.Figure 4:Pooled adverse\-event profile correlations across FDA, WebMD, and Reddit for antidepressants\. Points show Fisher\-pooled correlations across the nine\-drug panel; horizontal bars indicate uncertainty intervals from the pooled estimates\.Figure 5:Pairwise scatterplots of normalized adverse\-event frequencies for sertraline across FDA, Reddit, and WebMD\. Each point represents one adverse event; the diagonal indicates equal relative frequency across the two compared sources\.Figure 6:Volcano analysis of adverse\-event frequency differences across data sources for sertraline\. Each panel compares one source pair; the x\-axis shows log2odds ratios and the y\-axis shows−log10\-\\log\_\{10\}false\-discovery\-rate\-adjusted significance\.Figure 7:Longitudinal lead time of adverse\-event first mention across surveillance streams for sertraline\. Lead time \(days\)==first FDA date−\-min\(first WebMD date, first Reddit date\)\. Negative values indicate earlier mention in a community source; positive values indicate earlier appearance in FDA\. ## Discussion The multi\-source pharmacovigilance and knowledge graph framework described here has three broad implications for psychiatric medication information: it clarifies how community and regulatory sources differ, it shows where community data may provide earlier contextual signals, and it defines a retrieval architecture that keeps source provenance visible during response generation\. #### Community data as a complementary pharmacovigilance signal\. FDA, WebMD, and Reddit did not provide interchangeable views of antidepressant AEs\. The stronger concordance between WebMD and Reddit \(Jaccard similarity up to 0\.905 for desvenlafaxine\) than between either community source and FAERS suggests that patient\-generated data form a coherent, partly independent signal\. This does not make community data a substitute for regulatory pharmacovigilance\. Rather, it indicates that community platforms capture symptom salience, everyday language, and reporting incentives that are less visible in formally coded safety records\. The source\-balance results further show that this complementarity is compound\-specific\. Some drugs showed relatively even cross\-source representation, whereas others were concentrated in one or two sources\. Multi\-source monitoring should therefore be interpreted at the drug\-specific level rather than treated as a uniform property of all antidepressants\. #### Temporal lead times and early signal context\. The sertraline lead\-time analysis suggests that community sources can contain AE mentions before the corresponding FAERS receipt date, sometimes by several hundred days\. This pattern supports the use of social media and consumer reviews as supplementary signal contexts, but it should not be read as proof that community platforms detect risk earlier in a causal or regulatory sense\. Lead time may reflect administrative delay, differences in coding practice, or the timing with which patients choose to discuss symptoms publicly\. The presence of positive lead times for some AEs reinforces this boundary: community data are not uniformly earlier or more complete\. Their value lies in adding a patient\-centred temporal layer to formal pharmacovigilance, especially when interpreted alongside regulatory records rather than in isolation\. #### Knowledge graph architecture as provenance\-aware retrieval\. The knowledge graph addresses a practical problem for health\-facing LLM systems: fluent answers are not enough when the underlying evidence cannot be inspected\. By grounding medications, conditions, and adverse events in ATC\-N, ICD\-10, and MedDRA terms, and by preserving post\-level provenance throughMENTIONSedges, the framework makes each retrieved claim traceable to source\-specific evidence\. This design choice is also consistent with emerging evidence that clinical LLM evaluation should move beyond isolated final\-answer accuracy toward workflow\-level, process\-based assessment of information gathering, reasoning, communication, and safety\[[16](https://arxiv.org/html/2606.26205#bib.bib16),[22](https://arxiv.org/html/2606.26205#bib.bib22)\]\. The multi\-agent chatbot is therefore framed as an educational retrieval and synthesis interface, not as an autonomous clinical decision system\. This boundary is especially important for psychiatric medication information, where unsupported statements about discontinuation, suicidality, drug interactions, or vulnerable populations could increase fear or distort treatment decisions\. The architecture is designed to reduce hallucination risk by separating evidence retrieval, cross\-source comparison, and response validation, although its actual safety and usefulness require prospective human evaluation\. #### Patient\-facing interpretation\. A central motivation for integrating community and regulatory data is that patients often need both formal safety context and language that resembles lived experience\. Reddit contributes broad coverage of patient\-described experiences, WebMD contributes semi\-structured review context, and FAERS contributes a formal post\-marketing safety record\. Keeping these roles separate within the same retrieval framework may help future systems answer questions without flattening patient narratives into regulatory facts or treating anecdotal reports as clinical incidence estimates\. This distinction is essential: community reports can make experiences easier to name and discuss, but they cannot by themselves establish causality, prevalence, or individualized treatment risk\. #### Limitations\. The NER pipeline achieved strong name\-level F1 across entity types, but attribute\-level extraction was weaker, particularly for side\-effect duration and frequency\. Some adverse\-event characterizations in the graph may therefore be incomplete even when the entity name is correct\. The corpus is restricted to English\-language posts and reviews, which may underrepresent psychiatric medication experiences from non\-English\-speaking communities\. Cross\-source AE harmonization relies on embedding\-based matching to MedDRA and ICD\-10\. Although thresholds were calibrated on physician\-annotated data, residual mapping errors could affect similarity estimates\. The nine\-antidepressant focus enables detailed multi\-source comparison but limits immediate generalizability to other drug classes\. Finally, the current work evaluates data integration, entity extraction, cross\-source signal structure, and system architecture\. It does not yet establish clinical utility, response safety, patient usability, or effects on adherence or decision\-making\. Future studies should extend the framework to broader medication classes, incorporate multilingual corpora, and evaluate chatbot responses prospectively with clinicians and patients before deployment claims are made, ideally using workflow\-level and safety\-aware evaluation designs rather than relying only on automated metrics or vignette\-based testing\[[16](https://arxiv.org/html/2606.26205#bib.bib16),[22](https://arxiv.org/html/2606.26205#bib.bib22)\]\. ## References - \[1\]Eurostat\. One in two EU citizens look for health information online\.Eurostat News, 2021\.[https://ec\.europa\.eu/eurostat/web/products\-eurostat\-news/\-/edn\-20210406\-1](https://ec.europa.eu/eurostat/web/products-eurostat-news/-/edn-20210406-1) - \[2\]Finney Rutten, L\. J\. et al\. Online health information seeking among US adults: Measuring progress toward a Healthy People 2020 objective\.Public Health Rep\.134, 617–625 \(2019\)\. - \[3\]Wong, D\. K\.\-K\. & Cheung, M\.\-K\. Online health information seeking and eHealth literacy among patients attending a primary care clinic in Hong Kong\.J\. Med\. Internet Res\.21, e10831 \(2019\)\. - \[4\]Wang, X\. & Cohen, R\. A\. Health Information Technology Use among Adults: United States, July–December 2022\. CDC \(2023\)\.[https://doi\.org/10\.15620/cdc:133700](https://doi.org/10.15620/cdc:133700) - \[5\]Lim, H\. M\. et al\. Association between online health information\-seeking and medication adherence\.Digit\. Health8, 20552076221097784 \(2022\)\. - \[6\]Sieling, C\. et al\. What do patients know about their newly prescribed medication?Patient Educ\. Couns\.133, 108645 \(2025\)\. - \[7\]Lobban, F\. et al\. Impacts of using peer online forums in mental health\.J\. Med\. Internet Res\.27, e79289 \(2025\)\. - \[8\]Faasse, K\. & Petrie, K\. J\. The nocebo effect: patient expectations and medication side effects\.Postgrad\. Med\. J\.89, 540–546 \(2013\)\. - \[9\]Nestoriuc, Y\. et al\. Informing about the nocebo effect affects patients’ need for information about antidepressants\.Front\. Psychiatry12, 587122 \(2021\)\. - \[10\]Center for Drug Evaluation & Research\. FDA Adverse Event Reporting System \(FAERS\) Database\. U\.S\. Food and Drug Administration \(2024\)\. - \[11\]Golder, S\. et al\. The value of social media analysis for adverse events detection and pharmacovigilance: Scoping review\.JMIR Public Health Surveill\.10, e59167 \(2024\)\. - \[12\]Busch, F\. et al\. Current applications and challenges in large language models for patient care\.Commun\. Med\. \(Lond\.\)5, 26 \(2025\)\. - \[13\]Huo, B\. et al\. Large language models for chatbot health advice studies: A systematic review\.JAMA Netw\. Open8, e2457879 \(2025\)\. - \[14\]Yu, H\. et al\. Large language models in biomedical and health informatics: A review with bibliometric analysis\.J\. Healthc\. Inform\. Res\.8, 658–711 \(2024\)\. - \[15\]Hager, P\. et al\. Evaluation and mitigation of the limitations of large language models in clinical decision\-making\.Nat\. Med\.30, 2613–2622 \(2024\)\. - \[16\]Niu, J\. et al\. AIPatient Arena: EHR\-grounded evaluation of large language models in end\-to\-end clinical consultation workflows\.arXiv preprint arXiv:2606\.17474\(2026\)\. - \[17\]Asgari, E\. et al\. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation\.NPJ Digit\. Med\.8, 274 \(2025\)\. - \[18\]Lin, X\. et al\. Evaluating an evidence\-guided reinforcement learning framework in aligning light\-parameter large language models with decision\-making cognition in psychiatric clinical reasoning\.arXiv preprint arXiv:2602\.06449\(2026\)\. - \[19\]Stade, E\. C\. et al\. Large language models could change the future of behavioral healthcare\.Npj Ment\. Health Res\.3, 12 \(2024\)\. - \[20\]Rajabi, E\. & Etminani, K\. Knowledge\-graph\-based explainable AI: A systematic review\.J\. Inf\. Sci\.50, 1019–1029 \(2024\)\. - \[21\]Miao, Y\. et al\. Improving large language model applications in medical and nursing domains with retrieval\-augmented generation\.J\. Med\. Internet Res\.27, e80557 \(2025\)\. - \[22\]Zhu, L\. et al\. Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis\.medRxiv2026\.04\.21\.26351365 \(2026\)\. - \[23\]Li, X\. et al\. DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services\.BMC Emerg\. Med\.26, 78 \(2026\)\. - \[24\]Yu, H\. et al\. Simulated patient systems powered by large language model\-based AI agents offer potential for transforming medical education\.Commun\. Med\.6, 27 \(2026\)\. ## Appendices Table of Contents ## Appendix ALLM Prompt for Expanding Generic Names to Brand Synonymy The brand\-name expansion prompt was designed to return structured XML rather than free text in order to simplify downstream parsing: a fixed tag schema \(<generic\>/<brand\>\) ensures that each API response can be split without regular\-expression heuristics and is robust to variation in model phrasing\. The phrase “intended for nervous system related illnesses” was included as a domain constraint to prevent the model from returning homonymous brand names from other ATC classes \(e\.g\., returning a cardiovascular trade name for a shared generic string\)\. A<brand\>None</brand\>sentinel was specified to distinguish true negatives \(no marketed brand found\) from model refusals or off\-topic outputs\. The resulting XML was post\-processed by a tag\-parser that extracted all<brand\>values per generic, deduplicated across API calls, and merged into the ATC\-N dictionary as a curated synonym list used in all downstream keyword\-matching and entity\-normalization steps\. Table S1:LLM prompt for expanding generic drug names to brand synonyms\.Brand\-name expansion promptYou are a drug brand name identifier\. For each input generic drug name, return: <brand\_names\> <generic\>GENERIC\_NAME</generic\> <brand\>BRAND\_NAME\_1</brand\> <brand\>BRAND\_NAME\_2</brand\> </brand\_names\> If no brand names found, return: <brand\_names\> <generic\>GENERIC\_NAME</generic\> <brand\>None</brand\> </brand\_names\> Only return the XML\. No explanations\. The drugs should be intended for nervous system related illnesses\. Input drug generic name: \{generic\_name\} ## Appendix BBinary Classification Model for Post Information Richness During exploratory inspection, we observed that many Reddit posts still lacked substantive content after initial keyword\-based and rule\-based filtering\. To further improve data quality, we implemented a secondary data cleaning stage aimed at distinguishinginformation\-richfrominformation\-poorposts\. We first randomly sampled 3,500 posts from the cleaned Reddit corpus and manually labeled them according to three criteria: 1. 1\.The post is about a Nervous System drug \(e\.g\., antidepressants, antipsychotics, mood stabilizers, anxiolytics\)\. 2. 2\.The post mentions a side effect or adverse event related to the drug\. A*side effect*is a common or expected reaction \(e\.g\., fatigue, weight gain\); an*adverse event*is a severe or unexpected reaction \(e\.g\., seizures, suicidal thoughts\)\. 3. 3\.The post is*information\-rich*, meaning it includes at least one of the following: specific symptom\(s\) or effect\(s\); details about timing, duration, dosage, or sequence of events; or how the reaction impacted daily life or required medical attention\. Approximately 23\.3% of posts were labeled as information\-rich\. Using these labeled data, we then fine\-tuned aBERT\-base\-uncasedmodel for binary classification using the Hugging Face Transformers \(v4\.52\.3\) library and PyTorch\. The model architecture comprised 12 transformer layers, each with 12 self\-attention heads and a hidden size of 768, totaling approximately 110 million parameters\. Key hyperparameters included a hidden dropout probability of 0\.1, intermediate layer size of 3,072, GELU activation, and maximum sequence length of 512 tokens\. Training used a batch size of 16, learning rate of2×10−52\\times 10^\{\-5\}, and three epochs, optimized with AdamW and float32 precision\. After three epochs, the classifier achieved weighted accuracy of 0\.866, with precision 0\.875, recall 0\.866, and F1 score 0\.869 on the test set\. The trained model was then used to automatically filter the remaining Reddit corpus, retaining only posts predicted as information\-rich for subsequent analyses\. Using the pre\-training classifier, we filtered 466,525 information\-rich posts from the original 1,138,331 posts\. To assess post\-filtering quality, we randomly sampled 1,000 retained posts and manually verified that the majority contained substantive narratives regarding psychiatric medication use and side effects\. Table S2:Binary classification performance for information richness filtering\. ## Appendix CNER Annotation Interface Figure S1:Mental\-health entity annotation interface for medical doctors\. The interface shows a full Reddit post on the left with model\-highlighted spans\. The right panel presents the structured extraction as a JSON preview\. Medical doctors either confirm the extraction or flag errors, and can pinpoint specific subfields via checkboxes before submitting corrections\. ## Appendix DLLM Prompt for Named Entity Recognition The NER prompt was designed as a single\-pass structured extraction: all entity types \(drugs, condition, comorbid conditions, side effects\) and their pairwise relations are extracted in one model call per post, reducing API cost compared with sequential entity\-specific prompts\. A JSON output schema was selected over XML because it maps directly to Python dictionaries and natively supports the nested objects and arrays required for multi\-drug and multi\-side\-effect posts\. A strict null\-handling rule \(“all missing fieldsmustbe null”\) was specified to prevent the model from hallucinating absent attributes; this was especially important for dosage, duration, and severity fields, which patients rarely state explicitly\. Sentiment \(positive/negative/neutral\) and effectiveness were added as auxiliary labels to enable post\-level stratification in downstream analyses without requiring a separate classification call\. The instruction “Respond with ONLY the JSON output” and the prohibition on paraphrasing were included to suppress the conversational preamble that LLMs otherwise produce, which would break automated JSON parsing\. Prompt used for named entity recognitionprompt = f""" You are a clinical language evaluator and medical NLP assistant\. Your task is to extract structured mental health\-related information from user\-generated Reddit content to support knowledge graph construction in psychiatry and psychology\. From the given post, extract the following information in JSON format\. Required entities and attributes: \- drugs: a list of drug objects, each with: name, dosage, dosage\_form, continued\_use \(yes/no\), alternative\_drug\_considered \(yes/no\), duration\_on\_drug \- condition: name, severity \(if explicitly mentioned\), duration \(if explicitly mentioned\), diagnosed \(yes/no/unknown\) \- comorbid\_conditions: list of other explicitly mentioned psychiatric or psychological conditions \(e\.g\., ADHD, anxiety, PTSD\) \- side\_effects: name, severity, frequency, duration, associated\_drug Strict guidelines: 1\.Medication Identification: Extract psychiatric or psychological medications only if explicitly mentioned \(e\.g\., SSRIs like Prozac, antipsychotics like Seroquel, mood stabilizers like Lamictal\)\. Infer dosage\_form as ‘‘oral’’ unless explicitly stated otherwise \(e\.g\., injection\)\. Set continued\_use to ‘‘yes’’ if the user mentions they are still taking it, and ‘‘no’’ if they have stopped or switched\. Set alternative\_drug\_considered to ‘‘yes’’ if another drug is discussed for switching or comparing; otherwise set it to ‘‘no’’\. Extract duration\_on\_drug only when the time taking the drug is clearly stated \(e\.g\., ‘‘for 3 days’’, ‘‘took it for 2 months’’\); otherwise set it to null\. 2\.Condition Identification: Only extract explicitly named psychiatric or psychological conditions \(e\.g\., depression, bipolar disorder, PTSD\)\. Extract severity \(e\.g\., ‘‘mild’’, ‘‘severe’’, ‘‘moderate’’\) only if mentioned\. Set diagnosed to ‘‘yes’’ if the post mentions a formal diagnosis, ‘‘no’’ if the user says self\-diagnosed, and ‘‘unknown’’ if unclear\. Extract duration only if a time period is clearly described \(e\.g\., ‘‘struggled for 3 years’’\)\. 3\.Comorbid Conditions: Extract all other named mental or behavioral health conditions in the post beyond the primary one\. Only extract conditions explicitly stated \(e\.g\., ‘‘I have ADHD and anxiety’’\)\. 4\.Side Effects: Only extract when clearly described and attributed to a drug\. All fields \(severity, frequency, duration\) must be explicitly mentioned; otherwise return null\. Associate side effects to drugs if stated \(e\.g\., ‘‘Lexapro gave me nausea’’\)\. If a side effect is clearly caused by stopping or tapering a drug, the relationship is ‘‘causes\_by\_withdraw’’; otherwise the relationship is ‘‘causes’’\. 5\.Null Handling: All missing or unmentioned fields MUST be null\. Use empty arrays if no side effects, therapies, or comorbid conditions are mentioned\. Omit entire objects \(e\.g\., drug, side effects\) if nothing is mentioned\. 6\.Output Format: Strictly adhere to the JSON format below\. No explanations or inferred data\. 7\.Sentiment and Effectiveness: Overall sentiment should be ‘‘positive’’, ‘‘negative’’, or ‘‘neutral’’\. Only categorize effectiveness or sentiment if explicitly stated\. JSON output format: \{ "structured\_info": \{ "drugs": \[\{ "name": "", "dosage": "", "dosage\_form": "", "continued\_use": "", "alternative\_drug\_considered": "", "duration\_on\_drug": "" \}\], "condition": \{ "name": "", "severity": "", "duration": "", "diagnosed": "" \}, "comorbid\_conditions": \[\], "side\_effects": \[\{ "name": "", "severity": "", "frequency": "", "duration": "", "associated\_drug": "" \}\] \}, "relations": \[ \{ "start": \{ "label": "Medication", "properties": \{ "name": "" \} \}, "end": \{ "label": "Condition", "properties": \{ "name": "" \} \}, "relation": "treats", "properties": \{ "diagnosed": null, "off\_label": null \} \}, \{ "start": \{ "label": "Medication", "properties": \{ "name": "" \} \}, "end": \{ "label": "SideEffect", "properties": \{ "name": "" \} \}, "relation": "causes", "properties": \{ "severity": null, "duration": null, "dosage": null \} \}, \{ "start": \{ "label": "Medication", "properties": \{ "name": "" \} \}, "end": \{ "label": "SideEffect", "properties": \{ "name": "" \} \}, "relation": "causes\_by\_withdraw", "properties": \{\} \}, \{ "start": \{ "label": "Condition", "properties": \{ "name": "" \} \}, "end": \{ "label": "Condition", "properties": \{ "name": "" \} \}, "relation": "comorbid\_with", "properties": \{\} \} \], "sentiment": "", "effectiveness": "" \} Text to Analyze: \{text\} Respond with ONLY the JSON output\. Do not paraphrase or reword any content; extract the information exactly as written in the original post\. """ ## Appendix EEmbedding Mapping Details and Cut\-off Points To translate free\-text entities into controlled vocabularies, we used embedding\-based nearest\-neighbour matching and selected operating thresholds empirically against a physician\-annotated gold standard\. Specifically, we first generated a gold NER set by dual human adjudication and consensus, then embedded each extracted entity string \(lower\-cased and trimmed\) withtext\-embedding\-3\-small\. For each entity, we computed cosine similarity to all reference terms and retained the single most similar candidate \(top\-1\)\. We labeled a match as “correct” when either \(i\) the top\-1 canonical term was exactly equal to the normalized gold string or \(ii\) manual adjudication marked the pair as semantically the same\. These binary labels provided ground truth for threshold selection\. Using these labels, we evaluated similarity thresholds by sweeping the decision boundary over the continuous similarity score and computing precision–recall \(PR\) curves, receiver operating characteristic \(ROC\) curves, and the area under the ROC curve \(AUC\)\. We then identified the operating point that maximized Youden’s J statistic \(TPR−\-FPR\), which balances sensitivity against specificity and avoids overfitting to prevalence\. We confirmed that the Youden\-selected cutoffs aligned with the “elbow” region of the PR curves, providing a practical trade\-off between false merges \(over\-mapping\) and misses \(under\-mapping\)\. Because the distribution of similarity scores differed by entity type, we tuned thresholds separately for side effects \(mapped to MedDRA Preferred Terms\) and for conditions and comorbid conditions \(mapped to ICD\-10 terms\)\. The final operating thresholds used throughout the study were: - •Side effects→\\rightarrowMedDRA PT:τ=0\.68\\tau=0\.68 - •Condition→\\rightarrowICD\-10:τ=0\.56\\tau=0\.56 - •Comorbid condition→\\rightarrowICD\-10:τ=0\.54\\tau=0\.54 These cutoffs were chosen on the labeled validation subset \(derived from the physician gold NER\) and then applied to the full corpus\. For transparency and reproducibility, Figs\.[S2](https://arxiv.org/html/2606.26205#A5.F2)–[S4](https://arxiv.org/html/2606.26205#A5.F4)show ROC plots generated from the labeled data to illustrate performance across the full range of similarity values\. Figure S2:ROC threshold calibration for side\-effect mapping\. The red point indicates the selected Youden operating point \(τ=0\.68\\tau=0\.68\)\.Figure S3:ROC threshold calibration for condition mapping\. The red point indicates the selected Youden operating point \(τ=0\.56\\tau=0\.56\)\.Figure S4:ROC threshold calibration for comorbid\-condition mapping\. The red point indicates the selected Youden operating point \(τ=0\.54\\tau=0\.54\)\. ## Appendix FCost Analysis for Model Comparison In this section, we compare multiple LLMs under a fixed NER prompt/schema to identify a configuration that scales to our full corpus with acceptable accuracy and cost\. We quantify efficiency \(tokens, latency, wall\-clock time\) and operational spend \(per\-question and total\), and interpret these alongside the main\-text NER quality metrics\. The goal is a practical default model that minimizes downstream adjudication while remaining fast and budget\-conscious for repeated runs\. With average inputs essentially constant \(≈\\approx1\.5k tokens\), differences arise from output length, latency, and pricing\. GPT\-4\.1\-nano is the fastest \(≈\\approx0\.041 s/request;≈\\approx5\.4 h total\) and among the cheapest \($0\.0029/question; $188 total\), but its lower capacity corresponds to weaker extractions on harder posts in our evaluation\. GPT\-5\-nano is the least expensive overall \($0\.0024/question; $156 total\) but slower in aggregate \(≈\\approx22\.5 h\)\. Premium models, notably Claude\-Sonnet\-4, are competitive on latency \(≈\\approx0\.069 s;≈\\approx9 h total\) but prohibitively costly for repeated runs \($0\.0768/question; $4,991 total\)\. Mid\-tier options such as GPT\-4o\-mini \($0\.0042/question; $270;≈\\approx7\.3 h\) and Deepseek\-V3 \($0\.0054/question; $349;≈\\approx18\.1 h\) offer moderate spend and speed\. Gemini\-2\.5\-Flash and GPT\-5\-mini land near GPT\-4\.1\-mini on per\-question price but are slower or costlier in total for our workload\. Balancing accuracy, speed, and cost, GPT\-4\.1\-mini is the preferred default for this study\. It maintains low latency \(≈\\approx0\.066 s/request;≈\\approx8\.6 h total\) and a moderate budget \($0\.012/question; $777 total\), while delivering stronger entity and attribute extraction than the nano tier on long, multi\-entity posts\. In practice, this balance reduces manual correction without incurring the steep costs of premium models, making GPT\-4\.1\-mini the most efficient and reproducible choice for our NER pipeline\. Table S3:Efficiency and cost comparison of LLMs for the NER pipeline\.ModelAvg\. OutputAvg\. Time/Cost/Est\. TotalEst\. TotalTokensRequest \(s\)Question \($\)Cost \($\)Time \(h\)GPT\-5\-mini701\.70\.1840\.012581123\.9GPT\-5\-nano477\.00\.1740\.002415622\.5GPT\-4\.1\-mini687\.30\.0660\.01207778\.6GPT\-4\.1\-nano651\.20\.0410\.00291885\.4GPT\-4o\-mini614\.70\.0570\.00422707\.3Claude\-Sonnet\-4720\.20\.0690\.07684,9919\.0Gemini\-2\.5\-Flash815\.60\.1580\.012380120\.4Deepseek\-V3740\.60\.1400\.005434918\.1Qwen3\-235b\-A22b880\.20\.2110\.007145927\.3Boldrow = selected pipeline default \(GPT\-4\.1\-mini\)\. Costs are estimated from published API pricing at the time of the study\. ## Appendix GKnowledge Graph Construction Implementation Details We constructed a Reddit\-based mental health knowledge graph \(KG\) by ingesting a JSON corpus of posts previously processed by an NLP pipeline to extract entities and relations\. Each JSON record corresponded to a single Reddit post and contained: \(i\) a uniquepost\_idand optional index; \(ii\) the original post text; \(iii\) auxiliary model outputs \(e\.g\., sentiment and effectiveness scores\); and \(iv\) astructured\_infoobject listing extracted medications, conditions, comorbid conditions, side effects, and pairwise relations between them\. ### Schema and Canonicalization The KG was implemented in Neo4j with four main node types: Post, Medication, Condition, and SideEffect\. We treated Post nodes as lightweight anchor nodes containing only a unique identifier \(post\_id\) and minimal metadata \(e\.g\., an index and creation timestamp\); post text and model outputs were not stored in the graph to keep the KG compact and privacy\-preserving\. Domain entities were canonicalized before insertion\. For medications, we required the presence of a canonical ingredient\-level name \(level\_5\_name\) produced by an upstream mapping step; records without this field were discarded\. The canonical name was stored as the node’sname, and temporary mapping fields \(e\.g\., similarity scores\) were removed\. Conditions were required to have anicd10\-termfield and a non\-missing similarity score indicating a successful mapping; the nodenamewas set to this ICD\-10 term and the similarity value was dropped from the stored properties\. Side effects were processed analogously, retaining only entries with a non\-null similarity score and a validmeddra\-term, which became the nodename\. For each entity, we derived a deterministic, lowercased, whitespace\-normalized unique identifier \(uid\) of the formmed::\.\.\.,cond::\.\.\., orse::\.\.\.\. Neo4j uniqueness constraints onuid\(forMedication,Condition, andSideEffect\) and onpost\_id\(forPost\) ensured that logically identical entities were merged across posts\. ### Relations and Evidence Aggregation We modeled three classes of edges\. First, we created post\-to\-entity mention edges \(Post\)\-\[:MENTIONS\]\-\>\(Medication\|Condition\|SideEffect\) for every canonical entity detected in a post\. These edges provide a direct trace from graph entities back to the user\-generated content in which they appear\. Second, we imported typed entity–entity relations derived from the NLP pipeline\. Each relation in the JSON input specified a start node, end node, and a relation label \(e\.g\.,treats,causes,causes\_by\_withdraw,comorbid\_with\)\. We mapped these labels to a fixed set of Neo4j relationship types:TREATS,CAUSES,CAUSES\_BY\_WITHDRAW, andCOMORBID\_WITH\. For each endpoint, we reused the same canonicalization logic as for standalone entities; if a canonicalized entity did not yet exist in the batch, it was created and linked back to the corresponding posts viaMENTIONS\. Relation\-level properties \(such as severity, duration, dosage, and mapped name fields\) were retained, whereas similarity scores used only for upstream mapping were stripped\. To support evidence aggregation, each unique triple \(from\_uid,relation\_type,to\_uid\) was treated as a single relationship instance in the graph and attached to a list of supportingpost\_ids\. During ingestion, if a relation between two entities already existed, the currentpost\_idwas appended to this list only if it was not already present, thereby accumulating the set of posts that mention the same entity pair and relation type\. Third, we automatically constructed comorbidity edges between the primary condition and each comorbid condition mentioned in a post\. Specifically, when a post contained a primary condition plus one or more comorbid conditions in the structured information, we addedCOMORBID\_WITHedges from the primary condition to each comorbid condition, regardless of whether an explicit comorbidity relation had been generated by the NLP model\. This ensured consistent representation of comorbid patterns even when relation extraction was conservative\. ### Batched Ingestion and Sidecar Database To scale to the full Reddit corpus, we implemented a batched ingestor in Python using the Neo4j Bolt driver\. The script streams the JSON file either as an array or as JSONL, groups records into batches \(typically10310^\{3\}posts\), and, for each batch, constructs in\-memory collections of posts, canonicalized entities, mention edges, and relations\. These are written to Neo4j in a single transactionalUNWINDcall per node/edge type, usingMERGEsemantics so that entities and relationships are deduplicated across batches while their evidence \(post\_ids\) is incrementally updated\. Because we intentionally keep the graph compact, we maintain a separate SQLite “sidecar” database that stores the full post text, sentiment and effectiveness outputs, and per\-post evidence tables linking posts to entity UIDs and relation instances\. The sidecar uses a full\-text search index over post text and indexed foreign keys to entity and relation tables, enabling efficient retrieval of raw textual context and model annotations for any subset of graph nodes or edges without polluting the core KG with unstructured text\.
Similar Articles
SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation
Introduces SafeRx-Agent, a knowledge-grounded multi-agent framework for safe and explainable medication recommendation that generates fine-grained ATC code predictions while controlling drug interactions and contraindications, evaluated on MIMIC-III and MIMIC-IV datasets.
WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis
WiseMind is a knowledge-guided multi-agent framework that uses LLMs for psychiatric diagnosis by combining a "Reasonable Mind" agent for evidence-based logic with an "Emotional Mind" agent for empathetic communication, achieving 85.6% diagnostic accuracy on simulated and real patient interactions. The framework leverages DSM-5 structured knowledge graphs to reduce hallucinations and outperforms single-agent baselines by 15-54 percentage points while maintaining clinical soundness and psychological support.
Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications
This paper proposes a multi-agent framework using deterministic orchestration and neuro-symbolic state tracking to mitigate premature diagnostic handoff and silent hallucinations in healthcare LLM applications.
HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs
HypoAgent is an agentic framework for interactive abductive hypothesis generation over knowledge graphs, integrating three agents to handle evolving user intents and fine-grained diagnosis, achieving state-of-the-art performance.
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
SEMA-RAG is a self-evolving multi-agent RAG framework for medical question answering that decouples interpretation, exploration, and adjudication into three specialist agents, achieving significant accuracy improvements over baselines across multiple benchmarks.