MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

arXiv cs.CL Papers

Summary

This paper introduces MHGraphBench, a knowledge-graph-grounded benchmark for evaluating large language models on mental health knowledge, including entity recognition, relation judgment, and multi-hop reasoning. Experiments across 15 LLMs reveal a gap between recognition and judgment capabilities.

arXiv:2605.15589v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:33 AM

# MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
Source: [https://arxiv.org/html/2605.15589](https://arxiv.org/html/2605.15589)
Weixin Liu1Congning Ni2Shelagh A\. Mulvaney1Susannah L\. Rose2Murat Kantarcioglu3Bradley A\. Malin1,2Zhijun Yin1,21Vanderbilt University, Nashville, TN, USA2Vanderbilt University Medical Center, Nashville, TN, USA3Virginia Tech, Blacksburg, VA, USA\{weixin\.liu, shelagh\.mulvaney\}@vanderbilt\.edu\{congning\.ni\.1, susannah\.rose, b\.malin, zhijun\.yin\}@vumc\.orgmuratk@vt\.edu

###### Abstract

Large language models \(LLMs\) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments\. Here, we present a knowledge\-graph \(KG\)\-grounded benchmark for assessing LLMs on mental\-health entity recognition, relation judgment, and two\-hop reasoning\. The benchmark is derived from PrimeKG and comprises nine task families with KG\-supported answers and controlled negative options\. Experiments across 15 closed\- and open\-source LLMs reveal a persistent recognition\-to\-judgment gap: leading models achieve near\-ceiling performance on entity typing and on the small relation\-typing subset, yet they still struggle with relation prediction and two\-hop reasoning\. Additionally, short KG\-derived snippets benefit some models but degrade performance for others\. Moreover, output\-format reliability can substantially influence measured performance under constrained multiple\-choice settings, highlighting the critical role of response validity in benchmark\-based evaluation\. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental\-health slice of PrimeKG under a constrained multiple\-choice interface, rather than as a direct assessment of real\-world clinical safety\.

MHGraphBench: Knowledge Graph\-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

Weixin Liu1Congning Ni2Shelagh A\. Mulvaney1Susannah L\. Rose2Murat Kantarcioglu3Bradley A\. Malin1,2Zhijun Yin1,21Vanderbilt University, Nashville, TN, USA2Vanderbilt University Medical Center, Nashville, TN, USA3Virginia Tech, Blacksburg, VA, USA\{weixin\.liu, shelagh\.mulvaney\}@vanderbilt\.edu\{congning\.ni\.1, susannah\.rose, b\.malin, zhijun\.yin\}@vumc\.orgmuratk@vt\.edu

## 1Introduction

Mental health disorders impose a large and growing burden worldwideGBD 2019 Mental Disorders Collaborators \([2022](https://arxiv.org/html/2605.15589#bib.bib1)\)\. Clinical care and translational research in mental health often require integrating heterogeneous biomedical evidence, including disorder relationships, phenotypes and exposures, medication\-use boundaries \(e\.g\.,indicationvs\.contraindicationvs\.off\-label use\), and disease\-associated biological signalsFreidel and Schwarz \([2025](https://arxiv.org/html/2605.15589#bib.bib9)\); Gaoet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib10)\); Kyrioset al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib11)\); Roslandet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib12)\)\. These characteristics make mental\-health applications particularly sensitive not only to whether models can recognize relevant biomedical entities, but also to whether they can correctly apply knowledge to clinically salient structured judgments\.

Large language models \(LLMs\) have shown strong performance in biomedical and clinical tasks and have attracted growing interest in healthcare applicationsSinghalet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib2)\); Saabet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib13)\); Iqbalet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib6)\); Liet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib14)\), including mental\-health settingsVolkmeret al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib15)\); Obradovichet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib21)\)\. Most existing evaluations still report aggregate accuracy on broad biomedical or clinical benchmarks, offering limited insight into two questions that are especially important in mental health: \(i\) how broadly an LLM covers mental\-health biomedical knowledge, and \(ii\) whether it can reliably distinguish clinically salient and safety\-sensitive relation boundariesAroraet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib7)\); Caiet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib8)\)\.

At the same time, mental\-health\-specific evaluation is evolving rapidly\. Recent benchmarks have begun to assess psychiatric diagnostic decision\-making, realistic counseling and help\-seeking interactions, and trustworthiness in mental\-health settingsSonget al\.\([2026](https://arxiv.org/html/2605.15589#bib.bib40)\); Xionget al\.\([2026](https://arxiv.org/html/2605.15589#bib.bib41)\); Liet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib42)\)\. These efforts broaden the scope of mental\-health LLM evaluation, but they primarily emphasize diagnosis, counseling quality, or trustworthiness rather than verifiable structured biomedical knowledge and knowledge\-graph \(KG\)\-grounded relation reasoning\. This challenge is further complicated by growing evidence that multiple\-choice LLM evaluation can itself be sensitive to option ordering, prompt formatting, constrained answer formats, and output parsing rulesWanget al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib18)\); Pezeshkpour and Hruschka \([2024](https://arxiv.org/html/2605.15589#bib.bib19)\); Zhenget al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib22)\)\. Accordingly, our goal is not to evaluate real\-world clinical decision\-making or clinical safety directly, but rather to evaluate KG\-grounded structured discrimination and short\-path reasoning with respect to a curated mental\-health graph under a constrained multiple\-choice interface\.

KGs provide curated biomedical facts in a structured and machine\-verifiable form\. They are well\-suited to benchmark construction because they support automatic QA generation from factual triples, systematic negative sampling, and interpretable task design over entities, relations, and pathsChandaket al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib3)\); Sunet al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib16)\); Salnikovet al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib17)\); Markowitzet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib20)\)\. Biomedical KGs such as PrimeKG also illustrate the value of graph\-structured resources for downstream biomedical reasoning and analysisChandaket al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib3)\)\. For mental health, KG\-grounded evaluation is especially useful because it enables controlled benchmarking over clinically salient relation families rather than relying only on open\-ended prompting\. In addition, benchmark items derived directly from KG facts are verifiable against the underlying graph, enabling analysis not only of task accuracy but also of graph\-wide knowledge coverage\.

In this paper, we introduceMHGraphBench, a KG\-grounded benchmark for evaluating mental\-health biomedical knowledge in LLMs using a curated mental\-health subgraph of PrimeKG\. We define the benchmark domain with 42 psychiatric seed disease nodes, extract a clinically focused subgraph, and transform it into nine standardized multiple\-choice task families spanning entity recognition, relation judgment, and short disease\-mediated reasoning\. All benchmark items are derived from KG\-backed facts with controlled negatives, making MHGraphBench a structured and reproducible benchmark for evaluating mental\-health biomedical knowledge with respect to a curated graph rather than a direct measure of broader clinical reasoning or real\-world clinical safety\. Beyond benchmark accuracy, we also quantify graph\-wide coverage over entities, relations, and triples, and provide fine\-grained entity\- and relation\-centric analyses to localize where models succeed or fail\. Figure[1](https://arxiv.org/html/2605.15589#S1.F1)summarizes the overall pipeline, including psychiatric seed selection, mental\-health subgraph extraction, KG\-to\-QA generation, task construction, and evaluation\.

#### Design principle: verifiable KG\-grounded evaluation\.

Our central design principle is that benchmark items should be automatically derived from KG facts, paired with explicit negative sampling, and remain verifiable against the underlying mental\-health subgraph\. This makes the evaluation scalable and reproducible while also allowing us to analyze which entities, relations, and graph regions models handle well or poorly, rather than summarizing performance only with a single overall accuracy number\.

Using MHGraphBench, we ask four main questions: 1\) Do models that perform well on entity typing and on the small relation\-typing subset also perform well on clinically meaningful relation judgment? 2\) How difficult is short disease\-mediated reasoning relative to simpler recognition tasks? 3\) What additional insight do graph\-wide coverage and fine\-grained analyses provide beyond average task accuracy? 4\) When short KG\-derived evidence is added, does it consistently help model performance?

Our experiments across 15 models yield three main takeaways\. First, even the strongest models are near ceiling on entity typing and on the small relation\-typing subset but remain substantially weaker on relation prediction and two\-hop reasoning, revealing a persistent recognition\-to\-judgment gap\. Second, clinically sensitive relation families, especiallycontraindication, remain difficult across models, and open\-source models lag well behind the strongest GPT\-series systems on the overall benchmark\. Third, graph\-wide coverage and evidence augmentation provide complementary insight: coverage rankings do not fully match average task rankings, and short KG\-derived evidence helps some models but degrades others\.

#### Contributions

- •We construct a mental\-health benchmark from a curated PrimeKG subgraph defined by 42 psychiatric seed disease nodes\. It includes nine standardized multiple\-choice task families spanning entity recognition, relation judgment, and short two\-hop reasoning, all with KG\-supported ground truth and controlled negatives\.
- •We introduce graph\-wide coverage metrics and fine\-grained entity\- and relation\-centric analyses to complement raw task accuracy and localize model strengths and weaknesses within the mental\-health graph\.
- •We present empirical results across 15 LLMs showing a persistent recognition\-to\-judgment gap, mixed effects of evidence augmentation, and the importance of response\-format reliability in constrained benchmark evaluation\.

![Refer to caption](https://arxiv.org/html/2605.15589v1/new_model317.png)Figure 1:Overview of the KG\-grounded mental\-health benchmark framework\. Starting from 42 final psychiatric seed disease nodes in PrimeKG, we extract a clinically focused mental\-health subgraph, transform the resulting knowledge graph into a multiple\-choice question\-answering \(QA\) benchmark with nine task families, and evaluate models using accuracy, coverage, and fine\-grained analyses\.

## 2Related Work

Several studies evaluate LLMs on biomedical question answering, clinical reasoning, factuality, expert\-style exam tasks, and broader healthcare use casesSinghalet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib2)\); Saabet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib13)\); Iqbalet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib6)\); Liet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib14)\)\. These benchmarks provide useful broad capability signals, but they often report aggregate scores over heterogeneous tasks and therefore offer limited insight into mental\-health\-specific knowledge or failure modesSinghalet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib2)\); Saabet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib13)\); Aroraet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib7)\)\. In addition, prior work has shown that multiple\-choice LLM evaluation can itself be sensitive to factors such as option ordering, prompt formatting, constrained answer formats, and output parsing rulesPezeshkpour and Hruschka \([2024](https://arxiv.org/html/2605.15589#bib.bib19)\); Zhenget al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib22)\); Wanget al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib18)\)\.

Knowledge graphs \(KGs\) have been used to probe factual knowledge, generate verifiable benchmarks, and study model reasoning behavior under controlled perturbationsChandaket al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib3)\); Sunet al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib16)\); Salnikovet al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib17)\); Markowitzet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib20)\)\. Biomedical KG resources such as the Drug Repurposing Knowledge Graph \(DRKG\) and PrimeKG further illustrate the value of structured graph representations for integrating heterogeneous biomedical evidenceIoannidiset al\.\([2020](https://arxiv.org/html/2605.15589#bib.bib4)\); Chandaket al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib3)\)\. KG\-grounded benchmarks are especially appealing because they support scalable question generation, controlled negative sampling, explicit gold labels, and interpretable evaluation over entities, relations, and paths\. However, relatively little prior work has focused on mental\-health\-centered KG benchmarking with clinically salient relation boundaries, short\-path reasoning tasks, and graph\-level coverage analysis\.

Mental\-health biomedical knowledge spans disorder relationships, medication\-use boundaries, phenotypes, exposures, and biological associationsFreidel and Schwarz \([2025](https://arxiv.org/html/2605.15589#bib.bib9)\); Gaoet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib10)\)\. Recent mental\-health\-specific benchmarks extend evaluation beyond broad biomedical or clinical QA by targeting psychiatric diagnostic decision\-making, counseling and help\-seeking quality, and trustworthiness in safety\-sensitive settingsSonget al\.\([2026](https://arxiv.org/html/2605.15589#bib.bib40)\); Xionget al\.\([2026](https://arxiv.org/html/2605.15589#bib.bib41)\); Liet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib42)\)\. These benchmarks broaden the scope of mental\-health LLM evaluation, but they primarily emphasize diagnosis, counseling quality, or trustworthiness rather than verifiable structured biomedical knowledge and KG\-grounded relation reasoning\. Our benchmark complements these efforts by focusing on a curated mental\-health slice of PrimeKG and evaluating entity recognition, relation judgment, short reasoning behavior, and graph\-wide coverage in a unified KG\-grounded framework\.

## 3Benchmark Construction and KG\-to\-QA Generation

Figure[1](https://arxiv.org/html/2605.15589#S1.F1)summarizes the end\-to\-end pipeline of the proposed framework\. Psychiatric seed diseases define the target domain, subgraph extraction yields a curated mental\-health slice of PrimeKG, KG\-to\-QA generation converts graph facts into benchmark items, and evaluation reports aggregate accuracy, graph\-wide coverage, and fine\-grained analyses\.

### 3\.1Mental\-Health Subgraph

PrimeKG is a large, publicly available biomedical knowledge graph that integrates curated associations across drugs, diseases, genes/proteins, pathways, and other biomedical entities, in which typed nodes are connected by semantically defined relation edgesChandaket al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib3)\)\. In PrimeKG, disease nodes are encoded using terms from the Mondo Disease Ontology \(MONDO\) and grouped into clinically meaningful disease nodes during graph constructionChandaket al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib3)\)\. Building on this disease layer, we manually curated a high\-precision candidate seed list of 44 PrimeKG disease nodes with psychiatric relevance\. We then excluded two candidates during post\-curation:*X\-linked intellectual disability\-psychosis\-macroorchidism syndrome*, which was considered outside the intended benchmark scope, and*multiple personality disorder*, which was considered outdated terminology\. This yielded 42 final psychiatric seed disease nodes \(see Appendix[B](https://arxiv.org/html/2605.15589#A2)\)\. This final seed set defines the benchmark’s mental\-health domain boundary and provides a reproducible basis for subgraph extraction\. Starting from these 42 final psychiatric seed disease nodes, we extracted all 1\-hop seed\-touching edges and then retained only a fixed set of clinically salient relation families, including drug–disease usage relations \(indication,contraindication,off\-label use\), disease–disease links, and related biomedical associations such asdisease\_protein,disease\_phenotype\_positive, andexposure\_disease\.

This procedure yields 9,242 raw edges connecting to the seed disease nodes\. We canonicalized each retained relation to a consistent head/tail type signature, with symmetric handling fordisease\_disease, and deduplicated triples after canonicalization\. The resulting mental\-health subgraph contains 4,621 unique triples over 1,847 entities and 7 retained relation types, serving as the sole source of benchmark ground truth\.

### 3\.2KG\-to\-QA Task Suite

From the curated PrimeKG mental\-health subgraph, we generate nine standardized multiple\-choice tasks with letter\-only outputs and KG\-grounded answers:

- •Entity Typing \(ET\)asks the model to identify the type of a target entity, using the entity type in the subgraph as the gold label\.
- •Entity Clustering \(EC\)presents an “odd\-one\-out” problem formed by sampling four entities of the same type and one entity of a different type\.
- •Fact Checking \(FC\)asks whether a candidate triple is supported by the subgraph\. Negative examples are generated by replacing the head or tail entity with a type\-matched alternative under the same relation and retaining only perturbed triples that are unsupported by the extracted subgraph\. FC instances are balanced per relation so that each relation contributes equal numbers of “Yes” and “No” examples\.
- •Relation Typing \(RT\)asks the model to identify the correct head→\\totail type\-pair schema of a relation, based on the dominant type signature observed in the subgraph\.
- •Relation Prediction \(RP\)classifies a drug–disease pair into one of four categories:indication,contraindication,off\-label use, ornone\. Positive pairs are drawn from subgraph triples, whilenoneexamples are sampled from drug–disease pairs that do not appear in the subgraph\.
- •Two\-hop Verification \(R1\)andTwo\-hop Selection \(R2\)are constructed from 2\-hop contexts of the form Drug A→\\rightarrowDisease B and Disease B→\\rightarrowDisease C\. Positive instances are created such that the queried Drug A→\\rightarrowDisease C edge already exists in the subgraph\. Negative instances preserve the same 2\-hop scaffold but select a Disease C such that the queried edge is unsupported\. R1 labels are sampled to achieve an approximately balanced \(≈50%\\approx 50\\%\) “Yes” rate, and R2 uses the same underlying 2\-hop contexts\.
- •Evidence\-augmented Two\-hop Verification \(R1\+E\)andEvidence\-augmented Two\-hop Selection \(R2\+E\)extend the corresponding two\-hop tasks by attaching short PrimeKG feature\-table snippets for the involved entities\. Controlled sanitization is applied to redact lexical forms overlapping with relation answer options, thereby reducing potential answer leakage\.

The ground\-truth answers for these questions are strictly defined by triples in the extracted PrimeKG mental\-health subgraph\. Negative options are constructed in a task\-specific but KG\-consistent manner: FC negatives are created by type\-matched head or tail replacement under the same relation and retained only when the perturbed triple is unsupported by the extracted subgraph; RP usesnoneexamples drawn from drug–disease pairs that do not appear in the subgraph; and negative R1/R2 instances preserve the same 2\-hop scaffold but query a drug–disease edge that is unsupported by the subgraph\. The final benchmark comprises 1,847 ET items, 2,000 EC items, 4,000 FC items, 7 RT items, 1,634 RP items, and 1,200 items each for R1, R1\+E, R2, and R2\+E\.

## 4Experiments

### 4\.1Models

We evaluate 15 models spanning both closed\- and open\-source families: GPT\-4\.1, GPT\-5\.2\-chat, GPT\-4o, GPT\-5\-mini, GPT\-5\.1\-chat, Qwen2\.5\-32B\-Instruct, Mistral\-7B\-Instruct\-v0\.3, Qwen2\.5\-7B\-Instruct, BioMistral\-7B, Llama3\-Med42\-8B, DeepSeek\-R1\-Distill\-Qwen\-7B, DeepSeek\-R1\-Distill\-Qwen\-32B, Llama3\.1\-8B\-Instruct, Meditron\-7B, and Llama3\-OpenBioLLM\-8B\. This set includes frontier GPT\-series models, general\-purpose open\-source instruction\-tuned models, and biomedical\-domain variants\. Representative technical reports and model cards for several evaluated families include GPT\-4\.1OpenAI \([2025](https://arxiv.org/html/2605.15589#bib.bib33)\), GPT\-5\-miniOpenAI \([2026a](https://arxiv.org/html/2605.15589#bib.bib36)\), GPT\-5\.1\-chatOpenAI \([2026b](https://arxiv.org/html/2605.15589#bib.bib34)\), GPT\-5\.2\-chatOpenAI \([2026c](https://arxiv.org/html/2605.15589#bib.bib35)\), GPT\-4oOpenAI \([2024](https://arxiv.org/html/2605.15589#bib.bib28)\), Qwen2\.5Qwen Team \([2024](https://arxiv.org/html/2605.15589#bib.bib31)\), Mistral\-7B\-Instruct\-v0\.3Mistral AI \([2024](https://arxiv.org/html/2605.15589#bib.bib26)\), BioMistralLabraket al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib23)\), Med42Christopheet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib24)\), DeepSeek\-R1DeepSeek\-AI \([2025](https://arxiv.org/html/2605.15589#bib.bib37)\), MeditronChenet al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib38)\), OpenBioLLMPal and Sankarasubbu \([2024](https://arxiv.org/html/2605.15589#bib.bib39)\), and Llama 3Llama Team AI @ Meta \([2024](https://arxiv.org/html/2605.15589#bib.bib27)\)\.

### 4\.2Evaluation Protocol

All tasks are evaluated under the same letter\-only multiple\-choice interface\. For API\-based models, we instruct the model to return a single option letter and apply strict answer parsing to recover one valid choice from the response\. For local models, we use the same option\-letter scoring setup as in the rest of the evaluation pipeline\. Binary tasks use A/B labels rather than literal Yes/No to reduce lexical answer bias\.

Because benchmark scoring under this setup depends on recovering a valid option letter, response validity is itself part of the evaluation problem in addition to raw task accuracy\. We therefore treat output\-format reliability as an important evaluation caveat in constrained multiple\-choice assessment\. Additional implementation details for benchmark construction, evidence sanitization, forced\-choice local evaluation, API answer parsing, and randomness control are provided in Appendix[D](https://arxiv.org/html/2605.15589#A4)\.

### 4\.3Metrics

We report task\-level accuracy \(%\) and grouped averages for four benchmark dimensions:

AvgE=mean​\(ET,EC\),\\mathrm\{Avg\}\_\{E\}=\\mathrm\{mean\}\(\\mathrm\{ET\},\\mathrm\{EC\}\),\(1\)AvgR=mean​\(FC,RT,RP\),\\mathrm\{Avg\}\_\{R\}=\\mathrm\{mean\}\(\\mathrm\{FC\},\\mathrm\{RT\},\\mathrm\{RP\}\),\(2\)AvgR∗=mean​\(FC,RP\),\\mathrm\{Avg\}\_\{R\}^\{\\ast\}=\\mathrm\{mean\}\(\\mathrm\{FC\},\\mathrm\{RP\}\),\(3\)AvgS=mean​\(R1,R2\),\\mathrm\{Avg\}\_\{S\}=\\mathrm\{mean\}\(\\mathrm\{R1\},\\mathrm\{R2\}\),\(4\)AvgS\+E=mean​\(R1\+E,R2\+E\),\\mathrm\{Avg\}\_\{S\+E\}=\\mathrm\{mean\}\(\\mathrm\{R1\{\+\}E\},\\mathrm\{R2\{\+\}E\}\),\(5\)AvgA​l​l=mean​over all nine tasks,\\mathrm\{Avg\}\_\{All\}=\\mathrm\{mean\}\\ \\text\{over all nine tasks\},\(6\)AvgA​l​l∗=mean​over the eight tasks excluding RT\.\\mathrm\{Avg\}\_\{All\}^\{\\ast\}=\\mathrm\{mean\}\\ \\text\{over the eight tasks excluding RT\}\.\(7\)
Because RT contains only one question per retained relation \(7 items total\), its score should be interpreted cautiously\. In the results discussion below, we therefore emphasize the starred averages when drawing overall comparisons that are less influenced by the small RT set\.

Beyond task accuracy, we also report graph\-oriented coverage over entities, relations, and triples in the curated mental\-health slice of PrimeKG\. Specifically, we compute mean and degree\-weighted correctness over entities and relations, together with a triple\-level aggregate derived from entity and relation correctness\. Thenoneoption in RP is treated as a task\-specific no\-relation label rather than as a KG relation and is therefore excluded from relation coverage\. Full metric definitions are provided in Appendix[C](https://arxiv.org/html/2605.15589#A3)\. In the current benchmark, all 1,847 entities and all 7 retained relations are measured for every model\.

Table 1:Benchmark accuracy \(%\) on the PrimeKG mental\-health KG\-to\-QA tasks\. We group tasks into four levels and report block averages:AvgE=mean​\(ET,EC\)\\mathrm\{Avg\}\_\{E\}\{=\}\\mathrm\{mean\}\(\\mathrm\{ET\},\\mathrm\{EC\}\),AvgR=mean​\(FC,RT,RP\)\\mathrm\{Avg\}\_\{R\}\{=\}\\mathrm\{mean\}\(\\mathrm\{FC\},\\mathrm\{RT\},\\mathrm\{RP\}\),AvgR∗=mean​\(FC,RP\)\\mathrm\{Avg\}\_\{R\}^\{\\ast\}\{=\}\\mathrm\{mean\}\(\\mathrm\{FC\},\\mathrm\{RP\}\),AvgS=mean​\(R1,R2\)\\mathrm\{Avg\}\_\{S\}\{=\}\\mathrm\{mean\}\(\\mathrm\{R1\},\\mathrm\{R2\}\),AvgS\+E=mean​\(R1\+E,R2\+E\)\\mathrm\{Avg\}\_\{S\+E\}\{=\}\\mathrm\{mean\}\(\\mathrm\{R1\{\+\}E\},\\mathrm\{R2\{\+\}E\}\),AvgA​l​l=mean\\mathrm\{Avg\}\_\{All\}\{=\}\\mathrm\{mean\}over all nine tasks, andAvgA​l​l∗=mean\\mathrm\{Avg\}\_\{All\}^\{\\ast\}\{=\}\\mathrm\{mean\}over the eight tasks excluding RT\. Because RT contains only 7 items, the starred summaries are often more informative for overall comparisons\. Best values in each column are bolded; ties are jointly bolded\.- •Model abbreviations:GPT\-5\.2=GPT\-5\.2\-chat; GPT\-5\.1=GPT\-5\.1\-chat; GPT\-4o=GPT\-4o; GPT\-5\-mini=GPT\-5\-mini; Qwen2\.5\-32B=Qwen2\.5\-32B\-Instruct; Qwen2\.5\-7B=Qwen2\.5\-7B\-Instruct; Mistral\-7B=Mistral\-7B\-Instruct\-v0\.3; BioMistral=BioMistral\-7B; Med42\-8B=Llama3\-Med42\-8B; DeepSeek\-R1\-DQ\-7B/32B=DeepSeek\-R1\-Distill\-Qwen\-7B/32B; Llama3\.1\-8B=Llama3\.1\-8B\-Instruct; Meditron=Meditron\-7B; OpenBioLLM\-8B=Llama3\-OpenBioLLM\-8B\.

## 5Results

### 5\.1Overall Performance

Table[1](https://arxiv.org/html/2605.15589#S4.T1)reports the accuracy of the 15 selected LLMs on MHGraphBench\. Because RT contains only 7 items, we focus first on the RT\-excluded overall summaryAvgA​l​l∗\\mathrm\{Avg\}\_\{All\}^\{\\ast\}\. Under this summary, the strongest models in this evaluation are all GPT\-series models: GPT\-4\.1 ranks first withAvgA​l​l∗=70\.28%\\mathrm\{Avg\}\_\{All\}^\{\\ast\}=70\.28\\%, followed by GPT\-5\.2\-chat at 69\.32% and GPT\-4o at 69\.10%\. GPT\-5\-mini and GPT\-5\.1\-chat follow closely at 68\.38% and 68\.33%, respectively\. The RT\-including summaryAvgA​l​l\\mathrm\{Avg\}\_\{All\}yields a similar top\-level ordering, with GPT\-4\.1 achieving the highest score at 73\.58%, followed by GPT\-5\.2\-chat at 72\.73% and GPT\-4o at 72\.54%\.

Among open\-source models, Qwen2\.5\-32B\-Instruct is the strongest under both overall summaries, reaching 56\.09% onAvgA​l​l∗\\mathrm\{Avg\}\_\{All\}^\{\\ast\}and 60\.97% onAvgA​l​l\\mathrm\{Avg\}\_\{All\}\. Under the RT\-excluded summary, this still leaves a gap of more than 12 percentage points relative to the leading GPT models\. Below Qwen2\.5\-32B\-Instruct, performance drops markedly: Mistral\-7B\-Instruct\-v0\.3 reaches 42\.10% onAvgA​l​l∗\\mathrm\{Avg\}\_\{All\}^\{\\ast\}and Qwen2\.5\-7B\-Instruct reaches 43\.61%\. The remaining models cluster in the low\- to mid\-30s on the RT\-excluded overall summary\. Taken together, these results suggest that the benchmark is challenging not only for smaller biomedical models, but for most open\-source models in general\.

### 5\.2Recognition vs\. Relation Judgment

A central pattern in Table[1](https://arxiv.org/html/2605.15589#S4.T1)is the gap between recognition\-oriented tasks and relation\-judgment tasks\. For the top GPT\-series models, recognition\-oriented performance is very strong\. GPT\-5\.1\-chat achieves the highestAvgE\\mathrm\{Avg\}\_\{E\}at 95\.31%, closely followed by GPT\-5\-mini at 95\.12%, GPT\-4\.1 at 94\.73%, GPT\-4o at 94\.62%, and GPT\-5\.2\-chat at 94\.07%\. ET is particularly strong, ranging from 97\.40% to 98\.48% across the top five models, while EC ranges from 90\.10% to 92\.35%\. The RT subset is also saturated at 100\.00% for these models, but this result should be interpreted cautiously because RT contains only 7 items and is therefore better treated as a small descriptive subset than as strong standalone evidence\.

However, this recognition strength does not translate into equally strong performance on judgment\-related tasks\. The best RP score is only 58\.63%, achieved by GPT\-5\.2\-chat, followed by 58\.08% for GPT\-5\.1\-chat and 57\.28% for GPT\-5\-mini\. Even for the strongest models, these values remain far below ET and EC\. The same separation appears in the grouped relation summaries\. On the RT\-excluded summary, GPT\-5\.2\-chat reaches the highestAvgR∗\\mathrm\{Avg\}\_\{R\}^\{\\ast\}at 60\.99%, followed by GPT\-5\-mini at 60\.81% and GPT\-5\.1\-chat at 60\.29%\. The RT\-including summaryAvgR\\mathrm\{Avg\}\_\{R\}shows a similar ordering, but it should be interpreted more cautiously because it includes the 7\-item RT subset\.

This pattern indicates that a model may correctly identify entity types and relation schemas while still struggling to distinguish whether a drug is indicated, contraindicated, used off\-label, or unsupported for a target disorder\.

### 5\.3Short\-Chain Reasoning

Short disease\-mediated reasoning is one of the task categories that most clearly separates stronger models from weaker ones in the benchmark\. Even among the strongest models, subgraph reasoning scores remain well below entity\-level performance\. GPT\-4o achieves the highest R1 score at 62\.08%, while GPT\-5\.2\-chat achieves the highest R2 score at 65\.58%\. The grouped reasoning scoreAvgS\\mathrm\{Avg\}\_\{S\}is highest for GPT\-4\.1 at 60\.79%, followed by GPT\-4o at 58\.16% and GPT\-5\.2\-chat at 57\.88%\.

These values are notable because the reasoning tasks are tightly controlled: the 2\-hop scaffold is explicitly provided, the answer space is constrained, and correctness is defined with respect to KG support\. Even under these conditions, short\-path composition remains substantially harder than ET or the small RT set\. Among open\-source models, the drop is sharper: Qwen2\.5\-32B\-Instruct reaches 54\.75% onAvgS\\mathrm\{Avg\}\_\{S\}, whereas most others remain in the high\-30s to mid\-40s\. This suggests that composing even two simple KG hops into a correct structured decision remains a major failure mode in constrained evaluation settings\.

### 5\.4Evidence Augmentation Is Not Uniformly Helpful

Evidence augmentation affects models differently rather than providing a consistent benefit\. On the positive side, several strong models improve when short KG\-derived feature snippets are added\. GPT\-4\.1 improves from 57\.58% to 71\.42% on R1, and GPT\-4o improves from 62\.08% to 68\.83%\. On the selection side, GPT\-5\.2\-chat improves from 65\.58% to 67\.58% on R2, and GPT\-5\-mini improves from 59\.83% to 65\.42%\. In grouped terms, GPT\-4\.1 achieves the best evidence\-augmented reasoning score, withAvgS\+E=66\.46%\\mathrm\{Avg\}\_\{S\+E\}=66\.46\\%, followed by GPT\-4o at 65\.12% and GPT\-5\.2\-chat at 64\.33%\.

At the same time, evidence is not uniformly helpful across models\. Qwen2\.5\-32B\-Instruct, for example, improves strongly on R1, from 50\.50% to 61\.25%, but drops sharply on R2, from 59\.00% to 50\.08%, yielding only a modest evidence\-grouped score of 55\.66%\. Smaller models also show inconsistent behavior, with some improving on one evidence\-augmented task while remaining weak or deteriorating on the other\. These results suggest that evidence augmentation is better interpreted as a diagnostic probe of whether a model can integrate short structured cues than as a universally corrective prompting strategy\.

### 5\.5Response\-Format Reliability

Response\-format reliability is an important evaluation issue in MHGraphBench because all tasks use a constrained letter\-only multiple\-choice interface\. In this setting, measured performance depends not only on whether a model knows the correct answer, but also on whether it can reliably return a single valid option letter that can be unambiguously parsed\. This issue is especially relevant for API\-based models, whose outputs may include extra explanation, multiple candidate letters, or other text that does not strictly follow the requested response format\. As a result, benchmark accuracy can partially reflect output controllability in addition to underlying task knowledge, a broader concern that has also been noted in prior work on multiple\-choice LLM evaluationWanget al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib18)\); Pezeshkpour and Hruschka \([2024](https://arxiv.org/html/2605.15589#bib.bib19)\); Zhenget al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib22)\)\.

This format issue also affects how chance\-like scores on binary tasks such as FC or R1, where the positive and negative labels are approximately balanced, should be interpreted\. Several weaker models remain close to 50% accuracy on FC or R1 while simultaneously performing poorly on ET, EC, or RP\. Such behavior could reflect weak but genuine reasoning ability, but it may also arise in part from unstable constrained outputs, response biases, or instruction\-following failures under the letter\-only evaluation setup\. For this reason, aggregate task accuracy alone can be misleading unless it is interpreted together with output\-validity checks and inspection of prediction distributions\.

### 5\.6Model\-Family Observations

The results also offer a cautious perspective on biomedical\-domain models\. In this evaluation, biomedical or medically branded open\-source models do not consistently outperform general\-purpose instruction\-tuned alternatives\. BioMistral\-7BLabraket al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib23)\), Llama3\-Med42\-8BChristopheet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib24)\), Meditron\-7BChenet al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib38)\), and Llama3\-OpenBioLLM\-8BPal and Sankarasubbu \([2024](https://arxiv.org/html/2605.15589#bib.bib39)\)all score below the strongest GPT models and below Qwen2\.5\-32B\-Instruct on the overall summaries\. Some of these models also exhibit unexpectedly weak entity\-level performance: for example, Llama3\-Med42\-8B reaches only 24\.76% onAvgE\\mathrm\{Avg\}\_\{E\}, and Llama3\-OpenBioLLM\-8B reaches 19\.96%\.

At the same time, these models are not uniformly weak across every dimension\. Llama3\-Med42\-8B reaches 55\.00% on R1, and BioMistral\-7B reaches 53\.92% on R1, despite their low entity scores\. This uneven profile suggests that biomedical adaptation alone does not guarantee robust structured evaluation performance\. However, this comparison should be interpreted cautiously, because the evaluated models also differ in parameter scale, base\-model capability, instruction tuning, and output\-format reliability\. Taken together, the results suggest that performance in this benchmark reflects not only domain adaptation, but also the interaction among general model capacity, instruction following, constrained answer formats, and short reasoning requirements\.

Table 2:Compact knowledge coverage \(%\) on the PrimeKG mental\-health subgraph\. We report mean entity coverage, degree\-weighted relation coverage, and triple coverage; full coverage metrics are provided in Appendix[E\.1](https://arxiv.org/html/2605.15589#A5.SS1)\. Models are sorted byCov​\(T\)\\mathrm\{Cov\}\(T\)\. Model abbreviations follow Table[1](https://arxiv.org/html/2605.15589#S4.T1)\.
### 5\.7Knowledge Coverage

Coverage provides a complementary graph\-wide view of model performance \(Table[2](https://arxiv.org/html/2605.15589#S5.T2); full metrics in Appendix[E\.1](https://arxiv.org/html/2605.15589#A5.SS1)\)\. GPT\-5\-mini achieves the strongest triple coverage, withCov​\(T\)=65\.27%\\mathrm\{Cov\}\(T\)=65\.27\\%, even though GPT\-4\.1 remains the top model by average task accuracy\. This shows that benchmark averages and graph\-wide coverage are not interchangeable\. Coverage also changes the interpretation of open\-source models: Qwen2\.5\-32B\-Instruct is the best open\-source model byAvgA​l​l∗\\mathrm\{Avg\}\_\{All\}^\{\\ast\}, but not by triple coverage, and GPT\-5\.2\-chat shows lower coverage than the other GPT models despite ranking near the top on the main task table\.

### 5\.8Refined Entity and Relation Analysis

To better understand where models succeed or fail, we also compute fine\-grained entity\- and relation\-centric accuracy, with full tables reported in Appendix[E\.2](https://arxiv.org/html/2605.15589#A5.SS2)\. The fine\-grained relation results show thatcontraindicationis by far the hardest retained relation on average, whereasindicationis comparatively easier\. The fine\-grained entity results further show that high benchmark incidence does not guarantee ease: anxiety\-spectrum and psychotic\-spectrum entities remain difficult despite their prominence in the benchmark\. Taken together, these results suggest that model failures are concentrated in clinically important and diagnostically heterogeneous parts of the graph\.

## 6Discussion and Conclusion

We introduced MHGraphBench, a KG\-grounded benchmark for evaluating mental\-health biomedical knowledge in LLMs using a curated 1\-hop mental\-health subgraph of PrimeKG\. The benchmark transforms KG\-backed facts into nine standardized multiple\-choice task families spanning entity recognition, relation judgment, and short two\-hop reasoning, and complements task accuracy with graph\-wide coverage and fine\-grained entity\- and relation\-centric analyses\.

Across 15 models, our results reveal a persistent recognition\-to\-judgment gap\. Leading models achieve near\-perfect performance on entity typing and very strong performance on the small relation\-typing subset, yet they remain substantially weaker on relation prediction and short\-chain reasoning\. Coverage analysis further shows that average task accuracy and graph\-wide coverage are not interchangeable: GPT\-4\.1 performs best on the main benchmark averages, whereas GPT\-5\-mini achieves the strongest triple coverage\. Fine\-grained analyses localize especially difficult graph regions, with clinically important relations such ascontraindicationand prominent entities such as*anxiety disorder*remaining challenging\. We also find that evidence augmentation is not uniformly helpful across models and that response\-format reliability can materially affect measured performance under constrained multiple\-choice evaluation\.

These findings suggest that broad biomedical competence should not be equated with reliable structured judgment in mental\-health settings\. They are consistent with recent evidence outside biomedicine\. In a recent expert\-led study on high\-temperature superconductivity, LLM systems grounded in curated literature outperformed more general systems, yet all evaluated systems still showed important limitations in expert\-level scientific question answeringGuoet al\.\([2026](https://arxiv.org/html/2605.15589#bib.bib43)\)\. Taken together, these results suggest that current LLMs may read and organize scientific text fluently without consistently supporting the deeper judgment required for expert reasoning\.

More broadly, our results show that KG\-grounded benchmarking provides an interpretable and reproducible way to study what LLMs capture about mental\-health biomedical structure, while highlighting limitations in safety\-relevant relation distinctions and controlled reasoning\. Future work should move toward more challenging but still controlled benchmarks that better connect structured knowledge evaluation with clinically relevant mental\-health decision support\. MHGraphBench should therefore be interpreted as a structured evaluation of a curated KG slice rather than as a direct assessment of real\-world clinical safety\.

## Limitations

Our benchmark inherits the coverage limits and curation decisions of PrimeKG as well as those of our mental\-health subgraph extraction process\. As a result, the task suite is intentionally scoped and does not capture the full breadth of psychiatric care, longitudinal patient context, or individualized treatment decision\-making\.

All labels are defined with respect to the extracted PrimeKG mental\-health subgraph\. Because biomedical knowledge and clinical guidelines evolve over time, these KG\-based labels may be incomplete or may lag behind the most up\-to\-date evidence\. Accordingly, the benchmark measures agreement with a curated KG slice rather than absolute clinical truth\.

In addition, we do not perform manual or expert validation of sampled benchmark items, negative instances, or evidence snippets beyond the KG\-grounded construction pipeline itself\. This means that benchmark validity depends on the quality of the underlying graph, the extraction procedure, and the task\-generation rules\. In particular, an edge being unsupported in the extracted subgraph should not be interpreted as evidence that the corresponding claim is false in the real world; it indicates only that the queried relation is absent from the curated benchmark graph\. Similarly, although evidence snippets are sanitized to reduce direct answer leakage, they are not externally adjudicated by domain experts for completeness, clinical appropriateness, or real\-world decision support value\.

Although coverage and fine\-grained analyses provide a richer view than average task accuracy alone, they still depend on benchmark construction choices and on how task items involve particular graph components\. These analyses help localize strengths and weaknesses, but they should not be interpreted as exhaustive measurements of mental\-health biomedical knowledge\.

Finally, because evaluation relies on constrained multiple\-choice outputs, models that fail to follow answer\-format instructions may be penalized for reasons partly independent of their underlying biomedical reasoning ability\. This is both a limitation and an empirical finding of the benchmark: output controllability is entangled with measured performance\.

## Ethics Statement

This work does not evaluate clinical safety, real\-world mental\-health decision\-making, or patient\-specific treatment appropriateness\. The benchmark should not be used as a substitute for expert oversight, especially in settings involving treatment boundaries, contraindications, or crisis\-related decisionsAgarwalet al\.\([2024](https://arxiv.org/html/2605.15589#bib.bib29)\); Zhuet al\.\([2025](https://arxiv.org/html/2605.15589#bib.bib30)\)\. More broadly, our results should be interpreted as a structured evaluation of KG\-grounded mental\-health biomedical knowledge with respect to a curated mental\-health subgraph, rather than as evidence of clinical validity, real\-world safety, or readiness for clinical deployment\.

## Acknowledgments

This research was, in part, funded by the Advanced Research Projects Agency for Health \(ARPA\-H\)\. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Government\.

## References

- MedHalu: hallucinations in responses to healthcare queries by large language models\.arXiv preprint arXiv:2409\.19492\.Cited by:[Ethics Statement](https://arxiv.org/html/2605.15589#Sx2.p1.1)\.
- R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel, J\. Heidecke, and K\. Singhal \(2025\)HealthBench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1)\.
- Y\. Cai, L\. Wang, Y\. Wang, G\. de Melo, Y\. Zhang, Y\. Wang, and L\. He \(2024\)MedBench: a large\-scale Chinese benchmark for evaluating medical large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17709–17717\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1)\.
- P\. Chandak, K\. Huang, and M\. Zitnik \(2023\)Building a knowledge graph to enable precision medicine\.Scientific Data10\(1\),pp\. 67\.Cited by:[§B\.1](https://arxiv.org/html/2605.15589#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.15589#S1.p4.1),[§2](https://arxiv.org/html/2605.15589#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.15589#S3.SS1.p1.1)\.
- Z\. Chen, A\. Hernández Cano, A\. Romanou, A\. Bonnet, K\. Matoba, F\. Salvi, M\. Pagliardini, S\. Fan, A\. Köpf, A\. Mohtashami, A\. Sallinen, A\. Sakhaeirad, V\. Swamy, I\. Krawczuk, D\. Bayazit, A\. Marmet, S\. Montariol, M\. Hartley, M\. Jaggi, and A\. Bosselut \(2023\)Meditron\-70B: scaling medical pretraining for large language models\.arXiv preprint arXiv:2311\.16079\.Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1),[§5\.6](https://arxiv.org/html/2605.15589#S5.SS6.p1.1)\.
- C\. Christophe, P\. K\. Kanithi, T\. Raha, S\. Khan, and M\. A\. Pimentel \(2024\)Med42\-v2: a suite of clinical LLMs\.arXiv preprint arXiv:2408\.06142\.Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1),[§5\.6](https://arxiv.org/html/2605.15589#S5.SS6.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- S\. Freidel and E\. Schwarz \(2025\)Knowledge graphs in psychiatric research: potential applications and future perspectives\.Acta Psychiatrica Scandinavica151\(3\),pp\. 180–191\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p1.1),[§2](https://arxiv.org/html/2605.15589#S2.p3.1)\.
- S\. Gao, K\. Yu, Y\. Yang, S\. Yu, C\. Shi, X\. Wang, N\. Tang, and H\. Zhu \(2025\)Large language model powered knowledge graph construction for mental health exploration\.Nature Communications16\(1\),pp\. 7526\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p1.1),[§2](https://arxiv.org/html/2605.15589#S2.p3.1)\.
- GBD 2019 Mental Disorders Collaborators \(2022\)Global, regional, and national burden of 12 mental disorders in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019\.The Lancet Psychiatry9\(2\),pp\. 137–150\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p1.1)\.
- H\. Guo, M\. Tikhanovskaya, P\. Raccuglia, A\. Vlaskin, C\. Co, D\. J\. Liebling, S\. Ellsworth, M\. Abraham, E\. Dorfman, N\. P\. Armitage, C\. Feng, A\. Georges, O\. Gingras, D\. Kiese, S\. A\. Kivelson, V\. Oganesyan, B\. J\. Ramshaw, S\. Sachdev, T\. Senthil, J\. M\. Tranquada, M\. P\. Brenner, S\. Venugopalan, and E\. Kim \(2026\)Expert evaluation of LLM world models: a high\-TcT\_\{c\}superconductivity case study\.Proceedings of the National Academy of Sciences123\(11\),pp\. e2533676123\.Cited by:[§6](https://arxiv.org/html/2605.15589#S6.p3.1)\.
- V\. N\. Ioannidis, X\. Song, S\. Manchanda, M\. Li, X\. Pan, D\. Zheng, X\. Ning, X\. Zeng, and G\. Karypis \(2020\)DRKG: drug repurposing knowledge graph for COVID\-19\.Note:[https://github\.com/gnn4dr/DRKG](https://github.com/gnn4dr/DRKG)Accessed: 2026\-03\-17Cited by:[§2](https://arxiv.org/html/2605.15589#S2.p2.1)\.
- U\. Iqbal, A\. Tanweer, A\. R\. Rahmanti, D\. Greenfield, L\. T\. Lee, and Y\. J\. Li \(2025\)Impact of large language model \(ChatGPT\) in healthcare: an umbrella review and evidence synthesis\.Journal of Biomedical Science32\(1\),pp\. 45\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1)\.
- M\. Kyrios, J\. Levido, D\. Talbot, and A\. Harris \(2024\)Off\-label prescribing of psychotropics in a psychiatric patient population in australia\.Australasian Psychiatry32\(3\),pp\. 196–200\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p1.1)\.
- Y\. Labrak, A\. Bazoge, E\. Morin, P\. Gourraud, M\. Rouvier, and R\. Dufour \(2024\)BioMistral: a collection of open\-source pretrained large language models for medical domains\.arXiv preprint arXiv:2402\.10373\.Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1),[§5\.6](https://arxiv.org/html/2605.15589#S5.SS6.p1.1)\.
- J\. Li, A\. Dada, B\. Puladi, J\. Kleesiek, and J\. Egger \(2024\)ChatGPT in healthcare: a taxonomy and systematic review\.Computer Methods and Programs in Biomedicine245,pp\. 108013\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1)\.
- Y\. Li, J\. Yao, J\. B\. S\. Bunyi, A\. C\. Frank, A\. H\. Hwang, and R\. Liu \(2025\)CounselBench: a large\-scale expert evaluation and adversarial benchmarking of large language models in mental health question answering\.arXiv preprint arXiv:2506\.08584\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p3.1),[§2](https://arxiv.org/html/2605.15589#S2.p3.1)\.
- Llama Team AI @ Meta \(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- E\. Markowitz, K\. Galiya, G\. V\. Steeg, and A\. Galstyan \(2025\)KG\-LLM\-Bench: a scalable benchmark for evaluating LLM reasoning on textualized knowledge graphs\.arXiv preprint arXiv:2504\.07087\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p4.1),[§2](https://arxiv.org/html/2605.15589#S2.p2.1)\.
- Mistral AI \(2024\)Mistralai/mistral\-7b\-instruct\-v0\.3 \(hugging face model card\)\.Note:[https://huggingface\.co/mistralai/Mistral\-7B\-Instruct\-v0\.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)Accessed: 2026\-01\-13Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- N\. Obradovich, S\. S\. Khalsa, W\. U\. Khan, J\. Suh, R\. H\. Perlis, O\. Ajilore, and M\. P\. Paulus \(2024\)Opportunities and risks of large language models in psychiatry\.NPP—Digital Psychiatry and Neuroscience2\(1\),pp\. 8\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1)\.
- OpenAI \(2024\)GPT\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- OpenAI \(2025\)Introducing GPT\-4\.1 in the API\.Note:[https://openai\.com/index/gpt\-4\-1/](https://openai.com/index/gpt-4-1/)Accessed: 2026\-03\-17Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- OpenAI \(2026a\)GPT\-5 mini model\.Note:[https://developers\.openai\.com/api/docs/models/gpt\-5\-mini](https://developers.openai.com/api/docs/models/gpt-5-mini)OpenAI API model documentation; Accessed: 2026\-03\-17Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- OpenAI \(2026b\)GPT\-5\.1 chat model\.Note:[https://developers\.openai\.com/api/docs/models/gpt\-5\.1\-chat\-latest](https://developers.openai.com/api/docs/models/gpt-5.1-chat-latest)OpenAI API model documentation; Accessed: 2026\-03\-17Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- OpenAI \(2026c\)GPT\-5\.2 chat model\.Note:[https://developers\.openai\.com/api/docs/models/gpt\-5\.2\-chat\-latest](https://developers.openai.com/api/docs/models/gpt-5.2-chat-latest)OpenAI API model documentation; Accessed: 2026\-03\-17Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- A\. Pal and M\. Sankarasubbu \(2024\)OpenBioLLMs: advancing open\-source large language models for healthcare and life sciences\.Hugging Face\.Note:[https://huggingface\.co/blog/aaditya/openbiollm](https://huggingface.co/blog/aaditya/openbiollm)Accessed: 2026\-03\-17Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1),[§5\.6](https://arxiv.org/html/2605.15589#S5.SS6.p1.1)\.
- P\. Pezeshkpour and E\. Hruschka \(2024\)Large language models sensitivity to the order of options in multiple\-choice questions\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 2006–2017\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p3.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1),[§5\.5](https://arxiv.org/html/2605.15589#S5.SS5.p1.1)\.
- Qwen Team \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.1](https://arxiv.org/html/2605.15589#S4.SS1.p1.1)\.
- H\. G\. Rosland, G\. J\. Wergeland, and L\. Holst \(2025\)Off\-label use of psychotropic drugs in youth\.BMC Psychiatry25\(1\),pp\. 739\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p1.1)\.
- K\. Saab, T\. Tu, W\. Weng, R\. Tanno, D\. Stutz, E\. Wulczyn, F\. Zhang, T\. Strother, C\. Park, E\. Vedadi, J\. Zambrano Chaves, S\. Hu, M\. Schaekermann, A\. Kamath, Y\. Cheng, D\. G\. T\. Barrett, C\. Cheung, B\. Mustafa, A\. Palepu, D\. McDuff, L\. Hou, T\. Golany, L\. Liu, J\. Alayrac, N\. Houlsby, N\. Tomasev, J\. Freyberg, C\. Lau, J\. Kemp, J\. Lai, S\. Azizi, K\. Kanada, S\. Man, K\. Kulkarni, R\. Sun, S\. Shakeri, L\. He, B\. Caine, A\. Webson, N\. Latysheva, M\. Johnson, P\. Mansfield, J\. Lu, E\. Rivlin, J\. Anderson, B\. Green, R\. Wong, J\. Krause, J\. Shlens, E\. Dominowska, S\. M\. A\. Eslami, K\. Chou, C\. Cui, O\. Vinyals, K\. Kavukcuoglu, J\. Manyika, J\. Dean, D\. Hassabis, Y\. Matias, D\. Webster, J\. Barral, G\. Corrado, C\. Semturs, S\. S\. Mahdavi, J\. Gottweis, A\. Karthikesalingam, and V\. Natarajan \(2024\)Capabilities of Gemini models in medicine\.arXiv preprint arXiv:2404\.18416\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1)\.
- M\. Salnikov, H\. Le, P\. Rajput, I\. Nikishina, P\. Braslavski, V\. Malykh, and A\. Panchenko \(2023\)Large language models meet knowledge graphs to answer factoid questions\.arXiv preprint arXiv:2310\.02166\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p4.1),[§2](https://arxiv.org/html/2605.15589#S2.p2.1)\.
- K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, H\. Cole\-Lewis, D\. Neal, Q\. M\. Rashid, M\. Schaekermann, A\. Wang, D\. Dash, J\. H\. Chen, N\. H\. Shah, S\. Lachgar, P\. A\. Mansfield, S\. Prakash, B\. Green, E\. Dominowska, B\. Agüera y Arcas, N\. Tomašev, Y\. Liu, R\. Wong, C\. Semturs, S\. S\. Mahdavi, J\. K\. Barral, D\. R\. Webster, G\. S\. Corrado, Y\. Matias, S\. Azizi, A\. Karthikesalingam, and V\. Natarajan \(2025\)Toward expert\-level medical question answering with large language models\.Nature Medicine31\(3\),pp\. 943–950\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1)\.
- H\. Song, M\. Kang, J\. Shin, J\. Kim, C\. Park, H\. Yoo, J\. An, A\. Oh, J\. Han, and K\. Lim \(2026\)MentalBench: a benchmark for evaluating psychiatric diagnostic capability of large language models\.arXiv preprint arXiv:2602\.12871\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p3.1),[§2](https://arxiv.org/html/2605.15589#S2.p3.1)\.
- J\. Sun, C\. Xu, L\. Tang, S\. Wang, C\. Lin, Y\. Gong, L\. M\. Ni, H\. Shum, and J\. Guo \(2023\)Think\-on\-graph: deep and responsible reasoning of large language model on knowledge graph\.arXiv preprint arXiv:2307\.07697\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p4.1),[§2](https://arxiv.org/html/2605.15589#S2.p2.1)\.
- S\. Volkmer, A\. Meyer\-Lindenberg, and E\. Schwarz \(2024\)Large language models in psychiatry: opportunities and challenges\.Psychiatry Research339,pp\. 116026\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p2.1)\.
- H\. Wang, S\. Zhao, Z\. Qiang, B\. Qin, and T\. Liu \(2024\)Beyond the answers: reviewing the rationality of multiple choice question answering for the evaluation of large language models\.CoRR\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p3.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1),[§5\.5](https://arxiv.org/html/2605.15589#S5.SS5.p1.1)\.
- Z\. Xiong, Z\. Wang, H\. Fan, X\. Zhang, and W\. Wang \(2026\)TrustMH\-Bench: a comprehensive benchmark for evaluating the trustworthiness of large language models in mental health\.arXiv preprint arXiv:2603\.03047\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p3.1),[§2](https://arxiv.org/html/2605.15589#S2.p3.1)\.
- C\. Zheng, H\. Zhou, F\. Meng, J\. Zhou, and M\. Huang \(2023\)Large language models are not robust multiple choice selectors\.arXiv preprint arXiv:2309\.03882\.Cited by:[§1](https://arxiv.org/html/2605.15589#S1.p3.1),[§2](https://arxiv.org/html/2605.15589#S2.p1.1),[§5\.5](https://arxiv.org/html/2605.15589#S5.SS5.p1.1)\.
- Z\. Zhu, Y\. Zhang, X\. Zhuang, F\. Zhang, Z\. Wan, Y\. Chen, Q\. Long, Y\. Zheng, and X\. Wu \(2025\)Can we trust AI doctors? a survey of medical hallucination in large language and large vision\-language models\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 6748–6769\.Cited by:[Ethics Statement](https://arxiv.org/html/2605.15589#Sx2.p1.1)\.

## Appendix ASubgraph Statistics

This appendix section summarizes the size and composition of the curated PrimeKG mental\-health subgraph used throughout the benchmark\.

### A\.1Summary Statistics

Table 3:Summary statistics of the PrimeKG mental\-health subgraph used in this study\. HP = high\-precision; canon\./dedup\. = canonicalization and deduplication\.
### A\.2Entity and Relation Breakdown

Table 4:Entity\-type and relation\-type counts in the PrimeKG mental\-health subgraph\.

## Appendix BMental\-Health Seed Disease Nodes

This appendix section documents how the psychiatric seed disease list was defined and reports the final set of seed nodes used for subgraph extraction\.

### B\.1Seed Selection Procedure

We began with a manually curated high\-precision candidate seed list of 44 PrimeKG disease nodes, stored inmental\_health\_seed\_diseases\_HP\.csv\. PrimeKG disease nodes are encoded using terms from the Mondo Disease Ontology \(MONDO\) and grouped into clinically meaningful disease nodes during PrimeKG constructionChandaket al\.\([2023](https://arxiv.org/html/2605.15589#bib.bib3)\)\. Candidate selection was restricted to psychiatric disorders and closely related conditions intended to define the benchmark scope\. We then applied two manual post\-curation exclusions: one out\-of\-scope entry \(*X\-linked intellectual disability\-psychosis\-macroorchidism syndrome*\), which appeared in the initial candidate list but was excluded at post\-curation because it was outside the intended psychiatric benchmark scope, and one outdated entry \(*multiple personality disorder*\)\. This yielded the final set of 42 psychiatric seed disease nodes used for subgraph extraction\.

### B\.2Final Seed List

The final seed list, stored inmental\_health\_seed\_diseases\_FINAL\.csv, is shown in Table[5](https://arxiv.org/html/2605.15589#A2.T5)\.

Table 5:Final set of 42 psychiatric seed disease nodes used to extract the PrimeKG mental\-health subgraph\.

## Appendix CCoverage Metric Definitions

This section defines the coverage metrics used in the main paper\. We first define per\-entity and per\-relation correctness and then derive entity\-, relation\-, and triple\-level coverage scores\.

### C\.1Per\-Entity and Per\-Relation Correctness

Beyond task accuracy, we quantify how well a model covers the mental\-health slice of PrimeKG in terms of correctness over entities, relations, and triples\. Let the curated mental\-health graph beG=\(V,R,T\)G=\(V,R,T\), whereVVis the entity set,RRis the relation set, andT⊆V×R×VT\\subseteq V\\times R\\times Vis the triple set\.

For each entitye∈Ve\\in V, let𝒬​\(e\)\\mathcal\{Q\}\(e\)denote the set of benchmark items whose gold annotation involvesee\. We define empirical entity correctness as

aE​\(e\)=1\|𝒬​\(e\)\|​∑q∈𝒬​\(e\)𝟏​\[y^q=yq\]\.a\_\{E\}\(e\)=\\frac\{1\}\{\|\\mathcal\{Q\}\(e\)\|\}\\sum\_\{q\\in\\mathcal\{Q\}\(e\)\}\\mathbf\{1\}\[\\hat\{y\}\_\{q\}=y\_\{q\}\]\.\(8\)Similarly, for each relationr∈Rr\\in R, let𝒬​\(r\)\\mathcal\{Q\}\(r\)denote the set of benchmark items whose gold annotation involvesrr, and define empirical relation correctness as

aR​\(r\)=1\|𝒬​\(r\)\|​∑q∈𝒬​\(r\)𝟏​\[y^q=yq\]\.a\_\{R\}\(r\)=\\frac\{1\}\{\|\\mathcal\{Q\}\(r\)\|\}\\sum\_\{q\\in\\mathcal\{Q\}\(r\)\}\\mathbf\{1\}\[\\hat\{y\}\_\{q\}=y\_\{q\}\]\.\(9\)
Thenoneoption is treated as a task\-specific no\-relation label rather than a KG relation\. It is therefore excluded from relation coverage and contributes only indirectly through entity\-level correctness\.

### C\.2Coverage Scores

We then define five coverage scores\. Mean entity coverage is

CovAvg​\(E\)=1\|Vmeas\|​∑e∈VmeasaE​\(e\)\.\\mathrm\{CovAvg\}\(E\)=\\frac\{1\}\{\|V\_\{\\mathrm\{meas\}\}\|\}\\sum\_\{e\\in V\_\{\\mathrm\{meas\}\}\}a\_\{E\}\(e\)\.\(10\)For degree\-weighted entity coverage, we define entity degree as the number of incident triples inTT,

deg⁡\(e\)=\|\{\(h,r,t\)∈T:h=e​or​t=e\}\|,\\deg\(e\)=\\left\|\\\{\(h,r,t\)\\in T:h=e\\ \\text\{or\}\\ t=e\\\}\\right\|,\(11\)with normalizing constant

ZE=∑e∈Vdeg⁡\(e\)\.Z\_\{E\}=\\sum\_\{e\\in V\}\\deg\(e\)\.\(12\)The resulting degree\-weighted entity coverage is

CovDeg​\(E\)=1ZE​∑e∈Vdeg⁡\(e\)​aE​\(e\)\.\\mathrm\{CovDeg\}\(E\)=\\frac\{1\}\{Z\_\{E\}\}\\sum\_\{e\\in V\}\\deg\(e\)\\,a\_\{E\}\(e\)\.\(13\)
For relations, mean relation coverage is

CovAvg​\(R\)=1\|Rmeas\|​∑r∈RmeasaR​\(r\),\\mathrm\{CovAvg\}\(R\)=\\frac\{1\}\{\|R\_\{\\mathrm\{meas\}\}\|\}\\sum\_\{r\\in R\_\{\\mathrm\{meas\}\}\}a\_\{R\}\(r\),\(14\)where relation degree is defined as the number of triples inTTthat use relationrr,

deg⁡\(r\)=\|\{\(h,r′,t\)∈T:r′=r\}\|\.\\deg\(r\)=\\left\|\\\{\(h,r^\{\\prime\},t\)\\in T:r^\{\\prime\}=r\\\}\\right\|\.\(15\)The corresponding degree\-weighted relation coverage is

CovDeg​\(R\)=1\|T\|​∑r∈Rdeg⁡\(r\)​aR​\(r\)\.\\mathrm\{CovDeg\}\(R\)=\\frac\{1\}\{\|T\|\}\\sum\_\{r\\in R\}\\deg\(r\)\\,a\_\{R\}\(r\)\.\(16\)
Finally, for each triple\(h,r,t\)∈T\(h,r,t\)\\in T, we define an auxiliary triple score

s​\(h,r,t\)=aE​\(h\)\+aR​\(r\)\+aE​\(t\)3,s\(h,r,t\)=\\frac\{a\_\{E\}\(h\)\+a\_\{R\}\(r\)\+a\_\{E\}\(t\)\}\{3\},\(17\)and triple coverage as

Cov​\(T\)=1\|T\|​∑\(h,r,t\)∈Ts​\(h,r,t\)\.\\mathrm\{Cov\}\(T\)=\\frac\{1\}\{\|T\|\}\\sum\_\{\(h,r,t\)\\in T\}s\(h,r,t\)\.\(18\)
Here,VmeasV\_\{\\mathrm\{meas\}\}andRmeasR\_\{\\mathrm\{meas\}\}denote the sets of measured entities and relations\. In the current benchmark, all 1,847 entities and all 7 retained relations are measured for every model\.

## Appendix DBenchmark Construction, Evidence, and Evaluation Details

This appendix section provides implementation\-level details that extend, rather than repeat, the benchmark overview in Section[3](https://arxiv.org/html/2605.15589#S3)and the evaluation description in Section[4\.2](https://arxiv.org/html/2605.15589#S4.SS2)\. Unless otherwise noted, all statements in this section are derived directly from the benchmark\-generation and evaluation code used in our experiments\.

### D\.1Benchmark Construction Details

The benchmark is generated from PrimeKG using four input files:kg\.csv,mental\_health\_seed\_diseases\_FINAL\.csv,disease\_features\.tab,drug\_features\.tab\. The final seed file contains the 42 psychiatric seed disease nodes reported in Appendix[B](https://arxiv.org/html/2605.15589#A2)\. Starting from these seeds, we extract all 1\-hop seed\-touching edges fromkg\.csv, retain only a fixed set of seven clinically salient relations, canonicalize relation direction to predefined head/tail type signatures, and deduplicate the resulting triples\.

The retained relations aredisease\_protein,contraindication,indication,off\-label use,disease\_disease,disease\_phenotype\_positive, andexposure\_disease\. Canonical relation signatures are fixed during preprocessing:disease\_proteinis canonicalized as disease→\\rightarrowgene/protein;contraindication,indication, andoff\-label useas drug→\\rightarrowdisease;disease\_diseaseas disease→\\rightarrowdisease;disease\_phenotype\_positiveas disease→\\rightarroweffect/phenotype; andexposure\_diseaseas exposure→\\rightarrowdisease\. Fordisease\_disease, symmetric duplicates are additionally removed by lexicographic canonicalization of the two disease names\.

All task instances are generated programmatically from fixed English templates and use a unified letter\-only answer interface\. Entity Typing \(ET\) is a 5\-way multiple\-choice question over entity types\. Entity Clustering \(EC\) is constructed as an odd\-one\-out task with four entities of one type and one entity of another type\. Relation Typing \(RT\) asks for the dominant head→\\rightarrowtail type signature of a relation\. Relation Prediction \(RP\) is a 4\-way multiple\-choice task overindication,contraindication,off\-label use, andnone\. Two\-hop Verification \(R1\) is a binary A/B task, and Two\-hop Selection \(R2\) is a 4\-way multiple\-choice task\. Evidence\-augmented variants \(R1\+E and R2\+E\) use the same underlying task structure but append short feature\-table evidence snippets to the question\.

The two\-hop tasks are constructed from contexts in which a drug is linked to disease A by one of the three drug–disease usage relations and disease A is linked to disease B bydisease\_disease\. Positive R1 instances are those for which the queried drug–disease B relation already exists in the retained subgraph\. Negative R1 instances preserve the same 2\-hop scaffold but require that the queried drug–disease B relation be absent from the subgraph\. The generator targets an approximately balanced R1 label distribution with a 50% Yes rate and does not allow the intermediate disease and queried disease to be identical\. R2 instances are built from the same contexts and ask the model to select the most appropriate relation label for the queried drug–disease pair\.

The Fact Checking \(FC\) task is balanced*per relation*\. For each retained relationrr, the benchmark generator samples positive triples from that relation and constructs an equal number of negative examples under the same relation\. FC negatives are created by type\-matched head or tail replacement while preserving the original relation label, and the perturbed triple is kept only if it does not appear in the retained mental\-health subgraph\. This design avoids relation\-replacement negatives and makes per\-relation FC behavior easier to interpret\.

### D\.2Evidence Construction and Sanitization

Evidence\-augmented tasks draw text snippets fromdisease\_features\.tabanddrug\_features\.tab\. For disease nodes, the pipeline attempts to use available fields such asmondo\_name,mondo\_definition,umls\_description,orphanet\_clinical\_description,mayo\_symptoms,mayo\_causes,mayo\_risk\_factors, andorphanet\_management\_and\_treatment\. For drug nodes, it attempts to use fields such asdescription,indication,mechanism\_of\_action,pharmacodynamics,half\_life,state, andcategory\. When multiple rows are available for the same node, the generator keeps the first non\-empty value for each field\. Long text fields are truncated to at most 220 characters per field before question assembly\.

To reduce answer leakage, evidence text is sanitized before insertion into the question\. The sanitization step redacts lexical forms overlapping with relation answer options, including patterns matchingindication,contraindication, andoff\-label, and replaces them with a neutral placeholder\[REL\]\. In addition, the original drug\-table field nameindicationis rendered with the more neutral display label*Clinical use*when evidence blocks are assembled\. The evidence block is then attached in a fixed order: drug evidence first, followed by evidence for disease A and disease B\.

### D\.3Evaluation Protocol Details

All benchmark tasks are evaluated under the same letter\-only interface\. For binary tasks, the benchmark uses A/B labels rather than literal Yes/No strings, with A corresponding to Yes and B corresponding to No\. The evaluation code records per\-task accuracy for all tasks and additionally computes diagnostic quantities such as prediction\-AArate and balanced accuracy for A/B tasks\.

For local Hugging Face models, evaluation is performed by forced\-choice scoring rather than free\-form generation\. The prompt is constructed by appending the fixed anchor\\nAnswer:to each benchmark question\. If the tokenizer provides a chat template, the prompt is wrapped using the tokenizer’s chat\-template interface; otherwise, the raw question text is used directly\. Candidate answer letters are scored through multi\-token log\-probability accumulation over several surface forms, including\(A\), bare\-letter forms with a trailing newline, parenthesized\-letter forms with a trailing newline, and several short prefixes\. Scores for multiple surface forms corresponding to the same letter are merged by taking the maximum score for that letter, and the highest\-scoring allowed option is selected as the model prediction\. This procedure makes local evaluation deterministic given fixed model weights, tokenizer behavior, and benchmark inputs\.

The local evaluation code uses batch size 1, maximum sequence length 4096,bfloat16model loading, and automatic device placement viadevice\_map="auto"\. Model loading is performed withtrust\_remote\_code=True\. For API\-based models, the evaluation pipeline queries the model with temperature set to 0 and a maximum completion length of 120 tokens, then applies strict answer parsing to recover a single option letter from the returned text\. The parser first searches for explicit answer patterns such asAnswer: \(X\)and then falls back to leading\-letter or in\-text letter matching when necessary\. Because some API\-based models occasionally return outputs that do not perfectly follow the requested constrained format, response validity is itself an important part of the measured evaluation behavior\.

### D\.4Randomness and Deterministic Settings

Benchmark generation uses a fixed Python random seed of 42\. This seed controls the sampling operations used during task construction, including option shuffling, entity selection, negative sampling, and task\-instance ordering\. Local model evaluation by forced\-choice scoring does not sample from the model and is therefore deterministic given fixed model weights, tokenizer behavior, and benchmark inputs\. API\-based evaluation is configured with temperature 0 to reduce generation variability\.

## Appendix EExtended Coverage and Fine\-Grained Results

This appendix section reports the full coverage tables and the detailed entity\- and relation\-level analyses that complement the compact summaries in the main text\.

### E\.1Knowledge Coverage Results

Table 6:Knowledge coverage \(%\) on the PrimeKG mental\-health subgraph\. Higher is better\. All models measure all 1,847 entities and all 7 retained relations\. Models are sorted byCov​\(T\)\\mathrm\{Cov\}\(T\)\.Table[6](https://arxiv.org/html/2605.15589#A5.T6)provides a complementary graph\-wide view of performance\. Because all 1,847 entities and all 7 retained relations are measured for every model, these metrics summarize behavior over the full benchmark graph rather than over a partial subset\. The strongest triple coverage is achieved by GPT\-5\-mini, withCov​\(T\)=65\.27%\\mathrm\{Cov\}\(T\)=65\.27\\%, followed by GPT\-4o at 64\.77% and GPT\-4\.1 at 63\.57%\. This ranking differs from the main task average, where GPT\-4\.1 is the top model\. The discrepancy suggests that average task accuracy and graph\-wide coverage capture different aspects of model behavior\.

At the entity level, GPT\-4\.1 achieves the highest mean entity coverage, withCovAvg​\(E\)=77\.91%\\mathrm\{CovAvg\}\(E\)=77\.91\\%, closely followed by GPT\-5\.1\-chat and GPT\-5\-mini\. However, GPT\-5\-mini attains the highest degree\-weighted relation coverage, withCovDeg​\(R\)=63\.30%\\mathrm\{CovDeg\}\(R\)=63\.30\\%, which helps explain why it leads on triple coverage\. In other words, GPT\-5\-mini is especially strong on graph\-central relation mass, even though GPT\-4\.1 remains slightly stronger on average benchmark accuracy\.

Coverage also changes the interpretation of open\-source models\. Qwen2\.5\-32B\-Instruct is the best open\-source model byAvgA​l​l\\mathrm\{Avg\}\_\{All\}in Table[1](https://arxiv.org/html/2605.15589#S4.T1), but it does not have the strongest open\-source triple coverage\. Mistral\-7B\-Instruct\-v0\.3 and Qwen2\.5\-7B\-Instruct both slightly exceed Qwen2\.5\-32B\-Instruct onCov​\(T\)\\mathrm\{Cov\}\(T\), suggesting that correctness on high\-degree graph components can differ from overall task\-level performance\. GPT\-5\.2\-chat is another notable case: despite ranking near the top on the main benchmark averages, its coverage scores are substantially lower than those of the other GPT models, indicating that its correctness is less evenly distributed across the graph\.

### E\.2Fine\-Grained Entity and Relation Analysis

For a modelmmand relationrr, we define

Accm​\(r\)=cm​\(r\)n​\(r\),\\mathrm\{Acc\}\_\{m\}\(r\)=\\frac\{c\_\{m\}\(r\)\}\{n\(r\)\},\(19\)wheren​\(r\)n\(r\)is the number of benchmark items whose gold annotation involves relationrr, andcm​\(r\)c\_\{m\}\(r\)is the number of those items answered correctly\. Similarly, for an entityee, we define

Accm​\(e\)=cm​\(e\)n​\(e\)\.\\mathrm\{Acc\}\_\{m\}\(e\)=\\frac\{c\_\{m\}\(e\)\}\{n\(e\)\}\.\(20\)We report high\-incidence entities and retained relations because these components exert a strong influence on observed benchmark behavior and help localize where model performance concentrates\.

Table 7:Fine\-grained accuracy \(%\) on retained relations in the mental\-health subgraph\. Relations are ordered by benchmark incidence in the evaluation set\. The highest model score in each row is bolded\. The final column reports the mean accuracy across models\.- •Model abbreviations:BioM7B=BioMistral\-7B; DS32=DeepSeek\-R1\-Distill\-Qwen\-32B; DS7=DeepSeek\-R1\-Distill\-Qwen\-7B; Med42=Llama3\-Med42\-8B; OpenBio=Llama3\-OpenBioLLM\-8B; Meditron=Meditron\-7B; Mistral7B=Mistral\-7B\-Instruct\-v0\.3; Q32=Qwen2\.5\-32B\-Instruct; Q7=Qwen2\.5\-7B\-Instruct; GPT4\.1=GPT\-4\.1; GPT4o=GPT\-4o; GPT5m=GPT\-5\-mini; GPT5\.1=GPT\-5\.1\-chat; GPT5\.2=GPT\-5\.2\-chat; L3\.1\-8B=Llama3\.1\-8B\-Instruct\.

Table 8:Fine\-grained accuracy \(%\) on Top\-15 high\-incidence mental\-health entities in the benchmark\. Entities are ordered by benchmark incidence in the evaluation set\. The final column reports the mean accuracy across models\.- •Model abbreviations are identical to Table[7](https://arxiv.org/html/2605.15589#A5.T7)\.

The fine\-grained relation results in Table[7](https://arxiv.org/html/2605.15589#A5.T7)reveal substantial variation across clinically salient relation families\. The easiest relation on average isindication, with a mean accuracy of 59\.1%, followed bydisease\_diseaseat 56\.6% andexposure\_diseaseat 55\.7%\. In contrast,contraindicationis by far the hardest relation, with a mean accuracy of only 35\.8%\. This is notable becausecontraindicationmarks one of the most safety\-sensitive boundaries in mental\-health pharmacotherapy\. Its difficulty helps explain why RP remains much weaker than ET or the small RT set, even for the strongest models\.

The relation\-level results also show that overall model quality does not imply uniform strength across relation families\. GPT\-4\.1 is strongest oncontraindicationat 43\.6%, GPT\-5\-mini is strongest ondisease\_proteinandexposure\_diseaseat 65\.9% and 74\.6%, respectively, GPT\-5\.1\-chat is strongest onoff\-label useat 70\.9%, and Qwen2\.5\-7B\-Instruct is strongest onindicationat 83\.9%\. Qwen2\.5\-32B\-Instruct achieves the best score ondisease\_diseaseanddisease\_phenotype\_positive\. These row\-wise inversions reinforce the idea that performance is relation\-dependent rather than uniformly ordered by overall benchmark average\.

The fine\-grained entity results in Table[8](https://arxiv.org/html/2605.15589#A5.T8)show that high benchmark incidence does not guarantee ease\. Among the Top\-15 high\-incidence mental\-health entities, the highest mean accuracies are observed for*major depressive disorder*\(57\.1%\),*schizophrenia*\(55\.7%\), and*unipolar depression*\(54\.7%\)\. However, several prominent entities remain difficult, most notably*anxiety disorder*, which has the highest benchmark incidence in this subset but only 40\.1% mean accuracy\.*Psychotic disorder*\(43\.0%\) and*schizoaffective disorder*\(40\.4%\) are also challenging despite their high incidence\. At the lower end,*anorexia nervosa*\(37\.8%\) and*obsessive\-compulsive disorder*\(38\.6%\) are among the hardest entities in this set\.

Taken together, the fine\-grained analyses suggest that model failures are concentrated in clinically important and diagnostically heterogeneous parts of the graph\. High\-incidence depressive disorders are often handled reasonably well by the stronger models, but anxiety\-spectrum, psychotic\-spectrum, and medication\-boundary distinctions remain much less stable\.

Similar Articles

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

arXiv cs.AI

BLINKG is a benchmark designed to evaluate the mapping capabilities of Large Language Models (LLMs) in constructing Knowledge Graphs from heterogeneous data sources. It provides a standardized framework to assess how effectively LLMs establish correspondences between data schemas and ontology concepts.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

arXiv cs.CL

MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.