Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation

arXiv cs.AI Papers

Summary

This paper proposes a knowledge-enhanced visual diagnostic system for traditional Chinese medicine that uses a Neo4j knowledge graph, a four-stage symptom matching pipeline, and an information gain-driven proactive questioning strategy to improve transparency and interpretability. Results demonstrate significant improvements in diagnostic trust and reduced cognitive load.

arXiv:2606.06869v1 Announce Type: new Abstract: Aim: Existing AI-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and limited treatment plan presentation. This study proposes a knowledge-enhanced visual diagnostic system to improve the transparency and interpretability of syndrome differentiation and treatment. Methods: The system is built upon a Neo4j knowledge graph comprising 241 syndromes, 1,263 symptoms, and 2,485 relations. It incorporates a four-stage symptom matching pipeline (exact, semantic, fuzzy, and large language model verification), an information gain-driven proactive questioning strategy optimized with genetic algorithms, and a multimodal treatment presentation integrating artificial intelligence-generated illustrations, three-dimensional meridian-acupoint models, and evidence-based literature. Results: Knowledge graph constraints reduced non-standard outputs by 32%. Case studies validated the effectiveness of the interactive workflow across patient self-assessment, clinician-assisted diagnosis, and traditional Chinese medicine education. Automated paired-comparison evaluation across 30 cases further demonstrated significant improvements in diagnostic trust (Cohen's d = 1.82, p < 0.001), reduced cognitive load (improvements in four of five dimensions), and higher credibility of evidence-based references (4.21 vs. 2.95). Conclusions: The proposed system enhances the transparency of traditional Chinese medicine diagnostic reasoning and the interpretability of treatment plans through knowledge graph-driven visualization and multimodal interaction, offering a practical solution for trustworthy artificial intelligence-assisted traditional Chinese medicine applications.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:14 AM

# Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation
Source: [https://arxiv.org/html/2606.06869](https://arxiv.org/html/2606.06869)
\\credit

Conceptualization, Methodology, Software, Visualization, Writing – original draft

\\credit

Software, Validation, Visualization, Writing – original draft

\\credit

Methodology, Data curation, Formal analysis, Writing – review and editing

\\credit

Resources, Validation, Investigation, Writing – review and editing

\\credit

Resources, Investigation, Validation, Writing – review and editing

\\credit

Resources, Project administration, Writing – review and editing

\\credit

Supervision, Methodology, Project administration, Writing – review and editing

\\cormark

\[1\]\\creditConceptualization, Funding acquisition, Methodology, Supervision, Project administration, Writing – review and editing

1\] organization=Harbin Institute of Technology, Weihai, city=Weihai, state=Shandong, country=China

2\] organization=Harbin Institute of Technology \(Weihai\) Qingdao Research Institute, city=Qingdao, state=Shandong, country=China

3\] organization=Shandong Key Laboratory of Digital Service Computing Technology and Systems, country=China

4\] organization=Weihai Municipal Hospital, city=Weihai, state=Shandong, country=China

5\] organization=Shanghai Taizhu Technology Co\., Ltd, city=Shanghai, country=China

6\] organization=Tianjin Zhifu Qihuang Medical Technology Co\., Ltd, city=Tianjin, country=China

\\cortext

\[cor1\]Corresponding author\.

Yuda WangZhiying TuMingqiang SongLi SongKun LiDianhui ChuBolin Zhangbolin@hit\.edu\.cn\[\[\[\[\[\[

###### Abstract

Aim:Existing AI\-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and limited treatment plan presentation\. This study proposes a knowledge\-enhanced visual diagnostic system to improve the transparency and interpretability of syndrome differentiation and treatment\.

Methods:The system is built upon a Neo4j knowledge graph comprising 241 syndromes, 1,263 symptoms, and 2,485 relations\. It incorporates a four\-stage symptom matching pipeline \(exact→\\rightarrowsemantic→\\rightarrowfuzzy→\\rightarrowlarge language model verification\), an information gain–driven proactive questioning strategy optimized with genetic algorithms, and a multimodal treatment presentation integrating artificial intelligence\-generated illustrations, three\-dimensional meridian–acupoint models, and evidence\-based literature\.

Results:Knowledge graph constraints reduced non\-standard outputs by 32%\. Case studies validated the effectiveness of the interactive workflow across patient self\-assessment, clinician\-assisted diagnosis, and traditional Chinese medicine education\. Automated paired\-comparison evaluation across 30 cases further demonstrated significant improvements in diagnostic trust \(Cohen’sd=1\.82d=1\.82,p<0\.001p<0\.001\), reduced cognitive load \(improvements in four of five dimensions\), and higher credibility of evidence\-based references \(4\.21 vs\. 2\.95\)\.

Conclusions:The proposed system enhances the transparency of traditional Chinese medicine diagnostic reasoning and the interpretability of treatment plans through knowledge graph–driven visualization and multimodal interaction, offering a practical solution for trustworthy artificial intelligence\-assisted traditional Chinese medicine applications\.

###### keywords:

Clinical decision support systems\\sepKnowledge graphs\\sepTraditional Chinese medicine\\sepLarge language models\\sepMedical visualization

\{highlights\}

KG\-guided visualization makes TCM syndrome reasoning traceable\.

Multi\-turn questioning narrows candidate syndrome patterns interactively\.

Multimodal plans reduce cognitive load in TCM treatment presentation\.

Structured references improve perceived evidence credibility\.

## 1Introduction

Traditional Chinese Medicine \(TCM\) diagnosis requires integrating inspection, auscultation\-olfaction, inquiry, and palpation to map symptom combinations to standardized syndrome patterns\[[1](https://arxiv.org/html/2606.06869#bib.bib1)\]\. Although AI\-based diagnostic assistance has advanced rapidly in healthcare\[[2](https://arxiv.org/html/2606.06869#bib.bib2),[3](https://arxiv.org/html/2606.06869#bib.bib3),[4](https://arxiv.org/html/2606.06869#bib.bib4)\], existing TCM AI tools face three critical problems: diagnostic reasoning remains opaque, functioning as a “black box”\[[5](https://arxiv.org/html/2606.06869#bib.bib5),[6](https://arxiv.org/html/2606.06869#bib.bib6)\]; interaction is predominantly passive, lacking multi\-turn probing capability\[[7](https://arxiv.org/html/2606.06869#bib.bib7),[8](https://arxiv.org/html/2606.06869#bib.bib8)\]; and treatment plans suffer from information overload due to monolithic text presentation\[[9](https://arxiv.org/html/2606.06869#bib.bib9),[10](https://arxiv.org/html/2606.06869#bib.bib10)\]\. These deficiencies span knowledge representation, where the parametric knowledge of large language

![Refer to caption](https://arxiv.org/html/2606.06869v1/x1.png)Figure 1:System overview\. The left panel shows the multi\-turn interactive consultation process: users describe symptoms in natural language, and the system progressively clarifies key differentiating information through knowledge\-enhanced active questioning\. The upper right panel illustrates the diagnostic reasoning process: four\-layer progressive symptom matching combined with an information\-gain\-optimized questioning strategy performs candidate syndrome screening and convergence on a dynamic knowledge graph subgraph\. The lower right panel shows multimodal diagnostic output: structured treatment plans, AI\-generated treatment illustrations, evidence\-based references, and an interactive three\-dimensional acupoint model\.models \(LLMs\) is prone to hallucinations in medical domains\[[11](https://arxiv.org/html/2606.06869#bib.bib11),[12](https://arxiv.org/html/2606.06869#bib.bib12),[13](https://arxiv.org/html/2606.06869#bib.bib13)\]; interaction design, where single\-turn question\-answer modes lack progressive symptom clarification\[[14](https://arxiv.org/html/2606.06869#bib.bib14),[15](https://arxiv.org/html/2606.06869#bib.bib15),[16](https://arxiv.org/html/2606.06869#bib.bib16),[17](https://arxiv.org/html/2606.06869#bib.bib17)\]; and information presentation, where diagnostic reasoning lacks visualization and treatment plans rely on text alone\[[18](https://arxiv.org/html/2606.06869#bib.bib18),[19](https://arxiv.org/html/2606.06869#bib.bib19)\]\. No existing system simultaneously integrates knowledge graph visualization, multi\-turn interactive diagnosis, and multimodal treatment presentation\.

Addressing these gaps, this paper investigates the following three research questions \(RQs\):

- •RQ1: How does interactive multi\-turn questioning improve candidate syndrome convergence efficiency and differentiation accuracy ? That is, can an active questioning strategy based on information gain effectively guide users to provide key differentiating symptoms while progressively narrowing the candidate syndrome space?
- •RQ2: How does knowledge graph visualization enhance the transparency of TCM diagnostic reasoning? Specifically, can progressive subgraph display help users understand the reasoning path from symptoms to syndrome types, thereby increasing trust in the system’s diagnostic conclusions?
- •RQ3: How does multimodal presentation improve patient comprehension of treatment plans ? Do AI\-generated treatment illustrations, three\-dimensional acupoint models, and structured references significantly lower the barrier to understanding treatment information?

To answer these questions, we designed a knowledge\-enhanced TCM visual diagnostic system \(Figure[1](https://arxiv.org/html/2606.06869#S1.F1)\), built around a Neo4j knowledge graph containing 241 syndrome types, 1,263 symptoms, and 2,485 relationships\. It combines four\-layer progressive symptom matching with an information\-gain\-driven questioning strategy, enabling end\-to\-end visual interaction from symptom input through diagnosis to treatment planning\. The main contributions are:

1. 1\.Evidence\-based diagnostic visualization \(C1\): A progressive knowledge graph visualization method that makes TCM syndrome differentiation reasoning transparent and verifiable by dynamically updating symptom\-syndrome subgraphs after each interaction round, allowing users to trace how diagnostic conclusions form\.
2. 2\.Multi\-turn interactive diagnostic framework \(C2\): An interactive framework combining four\-layer symptom matching \(exact, vector semantic, fuzzy, and LLM verification\) with information\-gain\-driven questioning to guide progressive symptom clarification within the knowledge\-graph\-defined syndrome space\.
3. 3\.Multimodal treatment presentation \(C3\): A framework integrating AI\-generated treatment illustrations, a three\-dimensional acupoint model, and structured evidence\-based references, transforming text\-only recommendations into an intuitive multimodal format\.

The system targets three user groups: patients seeking preliminary self\-assessment through guided visual interaction, clinicians using it as a “second opinion” tool with full reasoning path transparency, and TCM students learning syndrome differentiation through interactive diagnostic case replay\.

A comprehensive review of related work, including comparisons with representative systems, is provided in Supplementary Information Section S1\.

## 2Methods

### 2\.1Design Challenges

Before designing the system, we identified core design challenges facing AI\-assisted TCM diagnostic systems through three channels: \(1\) a systematic review of existing AI\-assisted diagnosis systems and medical visualization literature; \(2\) informal discussions with two TCM practitioners; and \(3\) an analysis of the limitations of current AI chat tools in clinical scenarios\. These challenges, distilled from literature and practitioner input, guided the system design decisions below\.

##### DC1: Prior Dependency and Hallucination Risk

AI\-assisted diagnosis faces two intertwined “black\-box” risks: opaque reasoning paths that users cannot trace\[[5](https://arxiv.org/html/2606.06869#bib.bib5),[6](https://arxiv.org/html/2606.06869#bib.bib6)\]and LLM hallucinations that fabricate rationales or associations\[[12](https://arxiv.org/html/2606.06869#bib.bib12),[13](https://arxiv.org/html/2606.06869#bib.bib13)\]\. We address both by grounding interaction in a progressively updated, case\-specific KG subgraph: it provides structured visualization for interpretability\[[18](https://arxiv.org/html/2606.06869#bib.bib18),[20](https://arxiv.org/html/2606.06869#bib.bib20)\]while also constraining generation within verified symptom–syndrome relations\[[21](https://arxiv.org/html/2606.06869#bib.bib21),[22](https://arxiv.org/html/2606.06869#bib.bib22)\]\. This constraint depends on KG coverage, label alignment, and prompt strategy\.

##### DC2: Ambiguous Expressions and Syndrome Space

Colloquial and imprecise symptom descriptions can be misaligned with KG entities\[[23](https://arxiv.org/html/2606.06869#bib.bib23)\], and early matching errors can cascade into diagnostic drift\. Meanwhile, effective medical consultation requires multi\-turn proactive clarification when user information is incomplete\[[14](https://arxiv.org/html/2606.06869#bib.bib14),[15](https://arxiv.org/html/2606.06869#bib.bib15),[16](https://arxiv.org/html/2606.06869#bib.bib16),[17](https://arxiv.org/html/2606.06869#bib.bib17)\], yet many existing systems lack such capability\. We therefore combine robust, progressive symptom matching with information gain\-driven, multi\-round follow\-up\[[7](https://arxiv.org/html/2606.06869#bib.bib7),[8](https://arxiv.org/html/2606.06869#bib.bib8),[24](https://arxiv.org/html/2606.06869#bib.bib24)\], using user confirmations/denials as implicit feedback to iteratively narrow the KG\-defined candidate syndrome space\.

![Refer to caption](https://arxiv.org/html/2606.06869v1/x2.png)Figure 2:Three\-tier system architecture\. The front\-end presentation layer adopts a left\-right split\-screen layout, with conversational interaction on the left and knowledge visualization on the right; the back\-end logic layer integrates core modules including symptom extraction, multi\-layer matching, and LLM interaction; the data layer centers on a graph database supporting KG storage and querying\.
##### DC3: Cognitive Threshold and Literacy Disparities

Text\-only treatment recommendations can overwhelm non\-expert users, especially for unfamiliar concepts such as formula composition and acupoint localization\[[9](https://arxiv.org/html/2606.06869#bib.bib9),[10](https://arxiv.org/html/2606.06869#bib.bib10)\]\. This is exacerbated by heterogeneous audiences \(patients, physicians, and students\) with different literacy and evidence needs\[[2](https://arxiv.org/html/2606.06869#bib.bib2),[4](https://arxiv.org/html/2606.06869#bib.bib4)\]\. We therefore use a multimodal presentation framework that integrates text with AI\-generated illustrations and interactive 3D acupoint visualization\[[19](https://arxiv.org/html/2606.06869#bib.bib19),[25](https://arxiv.org/html/2606.06869#bib.bib25)\], plus structured references, so users can access information at appropriate depth\.

### 2\.2System Design

We describe the three\-tier architecture and four functional modules \(M1–M4\), and summarize the diagnosis–treatment dual\-chain visualization spanning the workflow\.

System Overview

The system follows a three\-tier architecture \(Figure[2](https://arxiv.org/html/2606.06869#S2.F2)\) with a split\-screen interface \(Figure[3](https://arxiv.org/html/2606.06869#S2.F3)\)\. The front\-end supports conversational consultation and interactive visualization; the back\-end orchestrates symptom extraction, KG\-grounded follow\-up, and LLM interaction; and the data layer stores the KG and references for efficient retrieval\. Standardized APIs enable modular iteration across tiers\.

Module 1: Knowledge Graph Construction

Users begin by describing symptoms in the conversational panel \(Figure[4](https://arxiv.org/html/2606.06869#S2.F4)\)\. The system extracts candidate symptom entities, highlights matches, and asks users to confirm or deny them, providing implicit feedback that helps refine symptom alignment\[[7](https://arxiv.org/html/2606.06869#bib.bib7),[26](https://arxiv.org/html/2606.06869#bib.bib26)\]\. In parallel, it renders an initial case\-specific KG subgraph from confirmed symptoms to candidate syndrome types, making the starting hypothesis space visible\[[18](https://arxiv.org/html/2606.06869#bib.bib18)\]\.

![Refer to caption](https://arxiv.org/html/2606.06869v1/x3.png)Figure 3:System interface overview\. The interface integrates seven core functional areas: \(A\) Natural language conversational interaction area, where users describe symptoms as text and engage in multi\-round consultation with the system; \(B\) Chain\-of\-thought reasoning display area, showing the system’s step\-by\-step diagnostic reasoning in real time; \(C\) Proactive interaction confirmation area, with system\-generated symptom confirmation and follow\-up prompts; \(D\) Knowledge graph browsing area, displaying the case\-specific symptom\-syndrome association subgraph; \(E\) Function navigation menu, providing access to diagnostic history, settings, and other auxiliary functions; \(F\) Multimodal treatment plan area, integrating text recommendations, AI\-generated illustrations, and references; \(G\) 3D meridian\-acupoint model, an interactive bronze figure model for precise localization of recommended acupoints\.![Refer to caption](https://arxiv.org/html/2606.06869v1/x4.png)Figure 4:Overview of the four\-module interaction workflow\. This flowchart illustrates the complete workflow from symptom collection, progressive follow\-up, syndrome differentiation diagnosis, to treatment plan presentation\. The upper portion shows the core interaction logic and data flow of each module; the lower portion shows the knowledge graph and data service layer supporting the entire process\.Module 2: Knowledge\-guided Proactive Asking

Given the current candidate syndrome space, the system proactively asks follow\-up questions selected to maximize

information gain and reduce uncertainty\[[7](https://arxiv.org/html/2606.06869#bib.bib7),[24](https://arxiv.org/html/2606.06869#bib.bib24)\]\. After each response, confirmed and negated symptoms update the candidate space and the case\-specific KG subgraph, which refreshes in real time \(Figure[5](https://arxiv.org/html/2606.06869#S2.F5)\)\. Figure[6](https://arxiv.org/html/2606.06869#S2.F6)shows an example multi\-round session\. The progressive KG both bounds generation to KG\-defined associations and provides a verification surface for monitoring evidence accumulation\[[21](https://arxiv.org/html/2606.06869#bib.bib21),[22](https://arxiv.org/html/2606.06869#bib.bib22)\]\.

![Refer to caption](https://arxiv.org/html/2606.06869v1/x5.png)Figure 5:Knowledge graph visualization\. The figure shows a case\-specific knowledge subgraph constructed around the diphtheria syndrome type, with the target syndrome type as the central node surrounded by associated symptom nodes and treatment method nodes\. Edges represent semantic relations\.![Refer to caption](https://arxiv.org/html/2606.06869v1/x6.png)Figure 6:Multi\-round follow\-up consultation interaction process\. \(a\) Symptom extraction and KG construction: the user inputs symptoms \(e\.g\., fever, fatigue\), the system extracts keywords and renders the symptom relationship graph on the right; \(b\) Intelligent follow\-up: the system generates targeted follow\-up questions based on association relationships in the KG; \(c\) Multi\-round dialogue continuation: the user and system engage in multiple rounds of symptom confirmation and supplementation; \(d\) Follow\-up completion: the KG expands further, and after the user supplements symptoms such as pale lips, candidate syndrome types converge\.![Refer to caption](https://arxiv.org/html/2606.06869v1/x7.png)Figure 7:Diagnostic results and treatment plan presentation\. The left panel shows candidate syndrome type diagnostic cards ranked by matching degree; the right panel shows the personalized treatment recommendation overview generated for the confirmed syndrome type\.Module 3: In\-depth Consultation and Syndrome Differentiation Diagnosis

The system performs final differentiation among remaining candidate syndromes and may ask targeted questions about discriminative signs\[[16](https://arxiv.org/html/2606.06869#bib.bib16),[27](https://arxiv.org/html/2606.06869#bib.bib27)\]\. Results are presented as ranked diagnostic cards with matched symptom evidence and a visual matching degree \(Figure[7](https://arxiv.org/html/2606.06869#S2.F7)\), supporting progressive disclosure while preserving KG\-based traceability\[[28](https://arxiv.org/html/2606.06869#bib.bib28),[7](https://arxiv.org/html/2606.06869#bib.bib7)\]\.

Module 4: Diagnostic Results and Multimodal Treatment Recommendations

After diagnosis confirmation, the system generates a personalized treatment plan grounded in KG associations, covering

formulas, dietary therapy, lifestyle guidance, and acupoint massage, with evidence links for verifiability\[[29](https://arxiv.org/html/2606.06869#bib.bib29),[4](https://arxiv.org/html/2606.06869#bib.bib4)\]\. In the treatment view, the left side combines an evidence\-linked reference list with AI\-generated recommended\-action illustrations \(Figure[8](https://arxiv.org/html/2606.06869#S2.F8)\), while a separate 3D meridian–acupoint view remains available for massage guidance \(Figure[9](https://arxiv.org/html/2606.06869#S2.F9)\)\[[30](https://arxiv.org/html/2606.06869#bib.bib30)\]\. This keeps the main text focused on the two presentation aids most relevant to comprehension, without unpacking every interface submodule\[[19](https://arxiv.org/html/2606.06869#bib.bib19)\]\.

![Refer to caption](https://arxiv.org/html/2606.06869v1/x8.png)Figure 8:Treatment recommendation interface\. The upper\-left panel lists evidence\-linked references supporting the recommendation, and the lower\-left panel presents AI\-generated recommended\-action illustrations\.![Refer to caption](https://arxiv.org/html/2606.06869v1/x9.png)Figure 9:3D acupoint interactive display\. The interface provides a rotatable bronze\-figure meridian model for locating recommended acupoints and viewing associated massage guidance\.#### 2\.2\.1Diagnosis\-Treatment Dual\-Chain Visualization

We conceptualize the workflow as adiagnosis–treatment dual\-chain visualizationarchitecture\. The diagnostic chain \(Modules 1–3\) externalizes the evolving symptom–syndrome hypothesis space through progressive KG subgraphs and multi\-round follow\-up, enabling transparency and verification\. The treatment chain \(Module 4\) transforms the confirmed syndrome type into an accessible, multimodal plan\. Both chains share the KG as a unified backbone that links evidence accumulation during diagnosis to recommendation presentation during treatment\.

### 2\.3Implementation

The system is implemented as a three\-tier web architecture with a separated front\-end, back\-end, and graph\-data layer\. The front\-end uses Next\.js 15 and React 19 to provide the interactive diagnostic and treatment interface\. The back\-end uses Flask REST APIs to host diagnostic reasoning, symptom matching, follow\-up question selection, treatment recommendation, and evaluation\-related services\. The data layer uses Neo4j to store the TCM knowledge graph, which contains 241 syndrome types, 1,263 symptom entities, and 2,485 symptom–syndrome relationships\. The LLM service is accessed through Gemini 3\.1 Pro \(Google\), and semantic matching uses BGE\-M3 embeddings\. The system runs on Python 3\.11 and Node\.js 18\+ and supports Docker\-based deployment\.

##### Four\-layer symptom matching

To align free\-text symptom descriptions with standardized KG symptom entities, the system applies a progressive four\-layer matching pipeline\. Layer 1 performs case\-insensitive exact matching against KG symptom labels\. Layer 2 applies semantic vector matching with BGE\-M3 embeddings\. For a user symptom mentionν\\nuand a KG symptom entityξ∈Ω\\xi\\in\\Omega, cosine similarity is computed as:

sim​\(ν,ξ\)=𝐞​\(ν\)⋅𝐞​\(ξ\)‖𝐞​\(ν\)‖⋅‖𝐞​\(ξ\)‖,\\mathrm\{sim\}\(\\nu,\\xi\)=\\frac\{\\mathbf\{e\}\(\\nu\)\\cdot\\mathbf\{e\}\(\\xi\)\}\{\\\|\\mathbf\{e\}\(\\nu\)\\\|\\cdot\\\|\\mathbf\{e\}\(\\xi\)\\\|\},\(1\)where𝐞​\(⋅\)\\mathbf\{e\}\(\\cdot\)denotes the embedding function\. The best semantic match is accepted when its similarity exceeds thresholdδ\\delta\. Layer 3 applies fuzzy matching with the rapidfuzztoken\_set\_ratioalgorithm to handle spelling variants and abbreviated expressions, using thresholdτ\\tau\. Layer 4 invokes the LLM only for unresolved or ambiguous cases and requires a structured JSON verification output\.

Lets∗s^\{\\ast\}andf∗f^\{\\ast\}denote the best semantic and fuzzy candidates, respectively\. The resulting cascading matching function is:

match​\(ν\)=\{ξif​ν=ξ​by exact match,s∗if​sim​\(ν,s∗\)≥δ,f∗if​fuzz​\(ν,f∗\)≥τ,LLM​\(ν,Ω\)otherwise\.\\mathrm\{match\}\(\\nu\)=\\begin\{cases\}\\xi&\\text\{if \}\\nu=\\xi\\text\{ by exact match\},\\\\ s^\{\\ast\}&\\text\{if \}\\mathrm\{sim\}\(\\nu,s^\{\\ast\}\)\\geq\\delta,\\\\ f^\{\\ast\}&\\text\{if \}\\mathrm\{fuzz\}\(\\nu,f^\{\\ast\}\)\\geq\\tau,\\\\ \\mathrm\{LLM\}\(\\nu,\\Omega\)&\\text\{otherwise\}\.\\end\{cases\}\(2\)This design resolves common cases through efficient symbolic, vector, and fuzzy operations, while reserving the higher\-cost LLM fallback for cases where deterministic matching is insufficient\.

##### Information\-gain follow\-up selection

At each interaction round, the system selects a discriminative symptom subsetQtQ\_\{t\}from the unasked symptom space to reduce uncertainty over the current candidate syndrome set𝒴t\\mathcal\{Y\}\_\{t\}\. We use a weighted utility function:

U​\(Qt\)=α⋅IG​\(Qt;𝒴t\)−β⋅R​\(Qt\)\+γ⋅C​\(Qt\),U\(Q\_\{t\}\)=\\alpha\\cdot\\mathrm\{IG\}\(Q\_\{t\};\\mathcal\{Y\}\_\{t\}\)\-\\beta\\cdot R\(Q\_\{t\}\)\+\\gamma\\cdot C\(Q\_\{t\}\),\(3\)whereIG\\mathrm\{IG\}measures entropy reduction in the candidate syndrome set after askingQtQ\_\{t\},R​\(Qt\)R\(Q\_\{t\}\)penalizes highly correlated symptom questions, andC​\(Qt\)C\(Q\_\{t\}\)encourages symptoms that cover discriminative features across candidate syndrome types\. The coefficientsα\\alpha,β\\beta, andγ\\gammacontrol the relative weights\. Because exhaustive search over symptom subsets is computationally infeasible when the candidate symptom space is large, the system uses a genetic algorithm for approximate optimization, withU​\(Qt\)U\(Q\_\{t\}\)as the fitness function\.

##### Engineering and visualization implementation

Several implementation choices were made to support responsive interaction\. At application startup, syndrome–symptom mappings are loaded from Neo4j into a KG warm\-up cache, reducing repeated graph queries during diagnosis\. BGE\-M3 vectors for KG symptom names are precomputed and persisted with a SHA\-256 vocabulary hash as the cache invalidation key\. Batch embedding requests useThreadPoolExecutorfor parallel computation\. LLM structured outputs are handled through a three\-stage JSON parsing and repair pipeline: direct parsing, regular\-expression extraction from Markdown code fences, and LLM self\-repair/regeneration\. The front\-end uses skeleton loading states to keep the interface stable during API calls\.

The KG visualization is implemented with an ECharts force\-directed graph embedded in the right panel of the diagnostic interface\. The visualization refreshes when confirmed symptoms or candidate syndrome types change, using red symptom nodes, green syndrome nodes, and highlighted selected nodes to distinguish entity roles\. During treatment presentation, a 3D acupoint model supports rotation and zoom interactions for locating recommended acupoints, while symptom keywords in the chat interface are highlighted to make extracted symptom evidence visible to users\. Additional implementation details are provided in Supplementary Information Section S2\.

### 2\.4Automated Evaluation Methodology

To evaluate key design decisions reproducibly, we used a mixed strategy: a quantitative accuracy evaluation for syndrome differentiation \(RQ1\) and anLLM\-as\-a\-Judgeframework for paired, structured assessments of interaction and presentation choices \(RQ2, RQ3\.1, RQ3\.2\)\. RQ1 evaluates Top\-1 syndrome differentiation accuracy on a de\-identified CEMRs dataset containing 2,000 records and 147 fine\-grained syndrome labels\. For the LLM\-as\-a\-Judge experiments, cases were drawn from a separate de\-identified set of 160 TCM electronic medical records\. We used stratified random sampling by syndrome type \(random seed=42=42\), allocated quotas proportionally to syndrome frequency while ensuring at least one case per sampled syndrome type, and selectedN=30N=30cases covering no fewer than five syndrome types\. RQ2 and RQ3\.1 used all 30 cases; RQ3\.2 used 25 cases after excluding records for which the treatment recommendation API did not return valid references\.

Claude Sonnet 4\.5 \(Anthropic\) served as the automated judge model\. To reduce same\-source bias, the judge model was selected from a different vendor than the system model and patient simulator, which were from the Google model family\. All judging calls used deterministic decoding \(temperature=0\)\. For each evaluation dimension, the judge assigned a 1–5 Likert score, where 1 indicates very poor performance and 5 indicates very good performance, and returned structured JSON containing scores and brief reasoning\. Outputs were validated through the same three\-stage JSON handling strategy used by the system: direct parsing, regular\-expression extraction, and pattern\-based fallback\. Missing dimensions triggered up to two retries; if retries failed, the default minimum score was assigned for the missing dimension\.

Given the paired design, small sample size \(n≤30n\\leq 30\), and ordinal score data, between\-condition differences were tested using the two\-sided Wilcoxon signed\-rank test with significance levelα=0\.05\\alpha=0\.05\. Benjamini–Hochberg false discovery rate correction was applied across dimensions for each research question\. Effect sizes were reported as paired\-sample Cohen’sddfor RQ2 and RQ3\.2\. For RQ3\.1, because reverse\-scored cognitive\-load dimensions produced many zero differences, we used rank\-biserialrr, computed from the Wilcoxon statistic asr=1−2​W/\[n​\(n\+1\)\]r=1\-2W/\[n\(n\+1\)\]\.

##### Evaluation Framework Overview

We address four evaluation questions aligned with the three research questions\.RQ1evaluates syndrome differentiation accuracy against representative LLM baselines\.RQ2tests whether KG visualization improves diagnostic trust by comparing the same consultation with and without KG augmentation\. A Gemini 3 Flash patient simulator completed up to five diagnostic rounds for each case, and the judge then evaluated diagnostic transparency, evidence traceability, reasoning confidence, information completeness, and overall trust from the patient perspective\.RQ3\.1tests whether multimodal treatment plans reduce cognitive load by comparing a plain\-text plan \(Plan A\) with a structured multimodal plan \(Plan B\) containing step\-by\-step guidance, AI\-generated ingredient illustrations, and 3D acupoint\-model descriptions\. The judge used the persona of a non\-expert user with limited medical background and assessed language comprehensibility, preparation clarity, acupoint localizability, action execution confidence, and overall cognitive load; the cognitive\-load item was reverse\-scored so that higher values consistently indicate better user experience\.RQ3\.2tests whether reference display improves evidence credibility by comparing treatment recommendations with and without references\. In the reference condition, the judge received titles, URLs, abstract snippets, and scraped page\-content excerpts truncated to the first 2,000 characters, and assessed source credibility, evidence verifiability, information confidence, comprehension support, and overall evidence quality\. For all paired A/B judging experiments, condition order was randomized deterministically by case ID\. Full prompts for the patient simulator and all judge conditions are documented in Supplementary Information Section S5\.

## 3Results

### 3\.1Evaluation Strategy

This paper focuses on overall system design and interaction experience rather than algorithmic peak performance\. We adopted a four\-pronged evaluation strategy: \(1\)Technical accuracy verification: quantitative syndrome differentiation accuracy on a clinical dataset, compared against mainstream large language models; \(2\)Case study: a complete diagnostic workflow demonstrating how design decisions operate together \(presented in Supplementary Information Section S3\); \(3\)System capability assessment: response time and rendering performance analysis; \(4\)LLM\-as\-a\-Judge automated evaluation: paired comparison evaluations assessing trust, cognitive load, and reference credibility to quantify each design decision’s effect\. All evaluation data were de\-identified, and no real patient interactions were involved\.

### 3\.2RQ1: Syndrome Differentiation Accuracy

##### Dataset

We used the CEMRs dataset containing 2,000 de\-identified Chinese electronic medical records covering 147 unique syndrome pattern labels drawn from real clinical settings\.

##### Comparison Methods

To illustrate the benefit of knowledge graph constraints, we compared against two baselines: \(1\) DeepSeek R1\[[31](https://arxiv.org/html/2606.06869#bib.bib31)\], a general\-purpose large language model with strong reasoning capabilities; \(2\) HuatuoGPT\-Vision\[[32](https://arxiv.org/html/2606.06869#bib.bib32)\], a medical domain\-specific model fine\-tuned on medical corpora\. This comparison aims to demonstrate the effect of knowledge graph constraints, not to claim comprehensive superiority\.

##### Results and Analysis

Table[1](https://arxiv.org/html/2606.06869#S3.T1)presents the Top\-1 syndrome differentiation accuracy on the CEMRs dataset\.

Table 1:Syndrome Differentiation Accuracy Comparison on the CEMRs DatasetMethodTop\-1 AccuracyOur System47\.11%DeepSeek R119\.76%HuatuoGPT\-Vision11\.23%Several contextual factors inform these results\. The dataset contains 147 fine\-grained syndrome labels \(far exceeding the 10–20 coarse\-grained categories in common benchmarks\), and cases come from real clinical records with inherent noise such as colloquial language and ambiguity\.

After introducing knowledge graph constraints, non\-standardized outputs decreased by approximately 32% compared to unconstrained LLMs, indicating that the knowledge graph constrains the generation space and reduces non\-standard outputs, though hallucination cannot be eliminated entirely\.

##### Reproducibility Statement

Due to privacy restrictions, the CEMRs dataset cannot be publicly released\. We provide 160 de\-identified sample cases \(public\_dataset\.json\) with complete prompt templates for independent validation\.

### 3\.3RQ2: Effect of Knowledge Graph Enhancement on Trust

Using the LLM\-as\-a\-Judge approach, we generated diagnostic results with and without knowledge graph enhancement for 30 CEMRs cases\. A large language model simulating target users performed paired comparison evaluations across five trust dimensions: diagnostic transparency, evidence traceability, reasoning confidence, information completeness, and trust level\. Each dimension was tested using the Wilcoxon signed\-rank test with Benjamini–Hochberg correction\. Table[2](https://arxiv.org/html/2606.06869#S3.T2)summarizes the paired trust evaluation results\.

Table 2:RQ1: Impact of Knowledge Graph Visualization on Diagnostic Trust \(LLM\-as\-Judge, N=30\)DimensionWith KGWithout KGpp\-valuepadjp\_\{\\mathrm\{adj\}\}Cohen’sddDiagnostic Transparency‡2\.87±0\.632\.87\\pm 0\.632\.17±0\.382\.17\\pm 0\.38<0\.001<0\.001<0\.001<0\.0011\.351\.35Evidence Traceability‡3\.37±0\.613\.37\\pm 0\.612\.40±0\.502\.40\\pm 0\.50<0\.001<0\.001<0\.001<0\.0011\.731\.73Reasoning Confidence‡2\.50±0\.572\.50\\pm 0\.572\.10±0\.312\.10\\pm 0\.31<0\.001<0\.001<0\.001<0\.0010\.870\.87Information Completeness‡3\.40±0\.503\.40\\pm 0\.502\.33±0\.482\.33\\pm 0\.48<0\.001<0\.001<0\.001<0\.0012\.182\.18Trust Level‡2\.70±0\.602\.70\\pm 0\.602\.10±0\.312\.10\\pm 0\.31<0\.001<0\.001<0\.001<0\.0011\.271\.27Overall‡2\.97±0\.512\.97\\pm 0\.512\.22±0\.282\.22\\pm 0\.28<0\.001<0\.001<0\.001<0\.0011\.821\.82

- •padjp\_\{\{\\mathrm\{adj\}\}\}: Benjamini–Hochberg FDR\-correctedpp\-values\.
- •‡Wilcoxon signed\-rank test \(Shapiro–Wilkp<0\.05p<0\.05\); unmarked rows use pairedtt\-test\.

Knowledge graph enhancement produced significant improvements across all five dimensions \(allpadj<0\.001p\_\{\\mathrm\{adj\}\}<0\.001\), with an overall Cohen’sd=1\.82d=1\.82\.Information completenessshowed the largest effect \(d=2\.18d=2\.18\), indicating that structured knowledge graph presentation substantially enhanced users’ perception of diagnostic information coverage\.Evidence traceabilityfollowed \(d=1\.73d=1\.73\), reflecting clearer reasoning paths through graph visualization\.Reasoning confidencehad the smallest effect \(d=0\.87d=0\.87\), still within the large effect range, possibly because this dimension involves deeper cognitive judgment\.

### 3\.4RQ3\.1: Effect of Multimodal Plans on Cognitive Load

We generated text\-only plans \(Plan A\) and multimodal plans \(Plan B, with 3D acupoint models and AI\-generated illustrations\) for 30 cases, evaluated by a large language model simulating non\-specialist users\. Five cognitive dimensions were assessed; cognitive load was reverse\-scored \(higher = lower load, positiveΔ\\Delta= Plan B advantage\)\. Table[3](https://arxiv.org/html/2606.06869#S3.T3)reports the cognitive load comparison\.

Table 3:RQ3: Cognitive Load Comparison — Text\-Only vs\. Multimodal Presentation \(LLM\-as\-Judge, N=30\)DimensionPlan A \(Text\)Plan B \(Multimodal\)pp\-valuepadjp\_\{\\mathrm\{adj\}\}Effect Size \(rr\)Δ\\DeltaLanguage Comprehensibility2\.10±0\.402\.10\\pm 0\.402\.60±0\.622\.60\\pm 0\.62<0\.001<0\.001<0\.001<0\.0010\.880\.88\+0\.50\+0\.50Preparation Clarity3\.40±0\.673\.40\\pm 0\.673\.67±0\.763\.67\\pm 0\.760\.0460\.0460\.0530\.0530\.570\.57\+0\.27\+0\.27Acupoint Locatability1\.63±0\.561\.63\\pm 0\.563\.50±0\.573\.50\\pm 0\.57<0\.001<0\.001<0\.001<0\.0011\.001\.00\+1\.87\+1\.87Action Confidence1\.97±0\.321\.97\\pm 0\.322\.67±0\.612\.67\\pm 0\.61<0\.001<0\.001<0\.001<0\.0010\.910\.91\+0\.70\+0\.70Overall Cognitive Load†1\.43±0\.501\.43\\pm 0\.502\.03±0\.722\.03\\pm 0\.72<0\.001<0\.001<0\.001<0\.0010\.900\.90\+0\.60\+0\.60

- •†Reverse scored: higher = lower cognitive load \(better\)\.
- •Δ\\Delta= Plan B mean−\-Plan A mean\. Positive values indicate Plan B is better\.
- •padjp\_\{\\mathrm\{adj\}\}: Benjamini–Hochberg FDR\-correctedpp\-values\.

Four of five dimensions reached statistical significance \(padj<0\.001p\_\{\\mathrm\{adj\}\}<0\.001\)\.Acupoint localizationshowed the most pronounced improvement \(Δ=\+1\.87\\Delta=\+1\.87,r=1\.00r=1\.00\), directly reflecting the 3D bronze figure model’s advantage for spatial localization\.Action confidence\(Δ=\+0\.70\\Delta=\+0\.70,r=0\.91r=0\.91\) andlanguage comprehension\(Δ=\+0\.50\\Delta=\+0\.50,r=0\.88r=0\.88\) also improved significantly\.Preparation clarityshowed a positive trend \(Δ=\+0\.27\\Delta=\+0\.27\) but did not reach significance after correction \(padj=0\.053p\_\{\\mathrm\{adj\}\}=0\.053\)\. Overall cognitive load improved significantly \(Δ=\+0\.60\\Delta=\+0\.60,r=0\.90r=0\.90\), confirming that multimodal plans effectively reduced cognitive load\.

### 3\.5RQ3\.2: Effect of References on Evidence Credibility

Paired comparison evaluations of plans with and without references were conducted across 30 cases; five were excluded due to invalid reference URLs, yielding 25 valid cases\. Five dimensions were assessed: source credibility, evidence verifiability, information confidence, comprehension support, and overall evidence quality\. Table[4](https://arxiv.org/html/2606.06869#S3.T4)reports the reference credibility evaluation\.

Table 4:RQ3 References: Impact of Structured References on Evidence QualityDimensionWith RefsWithout Refspppadjp\_\{\\text\{adj\}\}ddSource Credibility4\.11±\\pm1\.182\.60±\\pm0\.53<<0\.001‡<<0\.0011\.52Evidence Verifiability3\.86±\\pm1\.301\.30±\\pm0\.00<<0\.001‡<<0\.0011\.98Information Confidence4\.48±\\pm1\.123\.73±\\pm0\.84<<0\.001‡<<0\.0011\.43Comprehension Support4\.61±\\pm1\.104\.66±\\pm1\.030\.317‡0\.317\-0\.20Overall Evidence Quality4\.00±\\pm1\.302\.44±\\pm0\.43<<0\.001‡<<0\.0011\.55Overall Mean4\.212\.95——Δ\\Delta=\+1\.27

- •N=25N=25cases\.
- •pp\-values: pairedtt\-test \(normal\) or Wilcoxon signed\-rank \(‡\)\.
- •padjp\_\{\\text\{adj\}\}: Benjamini–Hochberg FDR\.
- •dd: Cohen’sdd\(paired\)\.

Plans with references scored 4\.21 overall, significantly higher than 2\.95 without \(Δ=\+1\.27\\Delta=\+1\.27\); four of five dimensions reached significance \(padj<0\.001p\_\{\\mathrm\{adj\}\}<0\.001\)\.Evidence verifiabilityhad the largest effect \(Cohen’sd=1\.98d=1\.98\), indicating that references transformed recommendations from unverifiable assertions into traceable evidence\-based information\.Overall evidence quality\(d=1\.55d=1\.55\) andsource credibility\(d=1\.52d=1\.52\) also showed large improvements\.Comprehension supportdid not reach significance \(padj=0\.317p\_\{\\mathrm\{adj\}\}=0\.317,d=−0\.20d=\-0\.20\), aligning with expectations: references primarily enhance traceability and credibility rather than plan comprehensibility\.

Beyond these quantitative evaluations, we conducted illustrative case studies across representative clinical scenarios to validate the system’s end\-to\-end diagnostic workflow, covering patient self\-assessment, physician\-assisted diagnosis, and TCM education contexts\. Full case study results are presented in the Supplementary Information \(Section S3\)\.

### 3\.6System Performance and Limitations

##### System Performance

At the engineering level, the system demonstrated good response efficiency: the average response time per interaction round was under 30 seconds, the frontend ECharts knowledge graph visualization update rendered within 200ms, and the overall interaction fluency met real\-time dialogue requirements\.

##### Limitations

The system has several limitations, including knowledge graph syndrome coverage, evaluation scale, LLM output reliability, dataset external validation, and cross\-language applicability\. These limitations are discussed in detail in Section[4\.3](https://arxiv.org/html/2606.06869#S4.SS3)\.

## 4Discussion

This section discusses design implications derived from the system’s design and evaluation \(§[4\.1](https://arxiv.org/html/2606.06869#S4.SS1)\), clinical integration considerations \(§[4\.2](https://arxiv.org/html/2606.06869#S4.SS2)\), and limitations \(§[4\.3](https://arxiv.org/html/2606.06869#S4.SS3)\)\.

### 4\.1Design Implications and Responses to Research Questions

##### Implication 1: Diagnosis as Explanation\. Progressive Visualization Builds Trust \(RQ2\)

The system’s real\-time progressive KG updating transforms the diagnostic process itself into an explanatory tool: users observe how candidate syndromes narrow across interaction rounds, understanding the basis behind each follow\-up question\. The diagnosis\-treatment dual\-chain visualization further enables users to trace reasoning paths from symptoms to syndrome patterns and onward to treatments, making conclusions traceable and verifiable\. This design implies that health informatics systems should treat the diagnostic process, not merely the final decision, as the core object of visualization\[[18](https://arxiv.org/html/2606.06869#bib.bib18),[20](https://arxiv.org/html/2606.06869#bib.bib20)\]\. A transparent reasoning process is an important factor for building trust\[[5](https://arxiv.org/html/2606.06869#bib.bib5)\], although transparency alone is not sufficient to address all user risks and must be complemented by contextualized usage guidelines\[[6](https://arxiv.org/html/2606.06869#bib.bib6)\]\. Evaluation confirmed that this approach produced a large effect improvement across all five trust dimensions\.

##### Implication 2: Soft Scaffold, Hard Boundary\. KG Constraints and Multi\-Round Interaction Synergy \(RQ1\)

Our system uses the KG simultaneously as a “soft scaffold” guiding interaction and a “hard boundary” constraining outputs\. At the interaction level, the information gain\-driven follow\-up strategy selects the most discriminative symptoms within the candidate space, transforming diagnosis into a natural dialogue requiring only 2 to 3 follow\-up rounds\[[7](https://arxiv.org/html/2606.06869#bib.bib7),[24](https://arxiv.org/html/2606.06869#bib.bib24)\]\. At the constraint level, the LLM handles flexible symptom extraction on the input side, while the KG constrains the output space to validated syndrome\-symptom associations on the output side, reducing the risk of hallucinated diagnostic conclusions\[[21](https://arxiv.org/html/2606.06869#bib.bib21),[22](https://arxiv.org/html/2606.06869#bib.bib22),[12](https://arxiv.org/html/2606.06869#bib.bib12),[13](https://arxiv.org/html/2606.06869#bib.bib13)\]\. This complementary architecture addresses the respective limitations of KG\-only rigidity and LLM\-only unreliability\.

##### Implication 3: Heterogeneous Modality Division\. Multimodal Presentation Lowers Cognitive Barriers \(RQ3\)

Different modalities serve different cognitive functions: text conveys logical reasoning, images convey intuitive understanding, and 3D models convey spatial localization\[[9](https://arxiv.org/html/2606.06869#bib.bib9),[10](https://arxiv.org/html/2606.06869#bib.bib10)\]\. This division enables differentiated information access: patients grasp treatment essentials through intuitive images, physicians consult detailed formula compositions and KG reasoning paths, and students use interactive 3D models for acupoint localization\[[19](https://arxiv.org/html/2606.06869#bib.bib19),[25](https://arxiv.org/html/2606.06869#bib.bib25)\]\. Evaluation confirmed that the multimodal plan significantly reduced cognitive load compared to text\-only presentation, with acupoint localization showing the most pronounced improvement\.

### 4\.2Clinical Integration Considerations

The system should function as a decision\-support tool embedded within physicians’ existing workflows, with final decisions made by licensed physicians\. Deployment requires professional training, informed consent regarding AI involvement, and data privacy compliance\[[2](https://arxiv.org/html/2606.06869#bib.bib2),[3](https://arxiv.org/html/2606.06869#bib.bib3),[4](https://arxiv.org/html/2606.06869#bib.bib4)\]\.

### 4\.3Limitations

We identify the following limitations of the current system:

1. 1\.Limited KG Coverage: The knowledge graph covers 241 syndrome patterns, but TCM literature documents over 1000 types; rare syndromes and regional differentiation systems \(e\.g\., Lingnan medicine\) remain excluded\.
2. 2\.Lack of Large\-Scale User Study: Evaluation relied on the LLM\-as\-a\-Judge framework with limited sample sizes \(N=25N=25–3030\), which may exhibit systematic biases\. A controlled user study with practitioners remains necessary for validating clinical utility\.
3. 3\.LLM Content Reliability: Although KG constraints reduced non\-standardized outputs by 32%, treatment plans may still contain inaccurate recommendations\. The system currently lacks a human\-AI collaborative review mechanism for physician oversight\.
4. 4\.No Real Clinical Validation: Evaluation used retrospective CEMRs data without prospective validation in real clinical settings\. A prospective clinical trial is necessary to assess real\-world performance\.

### 4\.4Conclusion

This paper presented a knowledge graph\-driven TCM visual diagnostic system that addresses three interrelated challenges in AI\-assisted syndrome differentiation: opaque reasoning, passive interaction, and simplistic treatment presentation\. By combining progressive KG visualization with LLM\-based symptom extraction, the system makes diagnostic reasoning transparent and interactive while constraining outputs to validated medical knowledge\. Evaluation demonstrated substantial trust improvements \(Cohen’sd=1\.82d=1\.82\) and reduced cognitive load through multimodal treatment plans\. KG constraints also reduced non\-standardized LLM outputs by 32%\. These results suggest that structured knowledge constraints can complement generative AI capabilities in clinical decision support\. Current limitations include reliance on automated evaluation with limited sample sizes and incomplete KG coverage\. Future work will prioritize prospective clinical validation with practitioners and patients, expanded syndrome coverage, and EHR integration to support real\-world deployment\.

## Ethics statement

This study focuses on system design and technical implementation\. It does not involve human experimentation or patient data collection, and no ethics review approval was required\. Expert consultations during the system evaluation phase were conducted for the purpose of technical functionality verification and did not involve clinical intervention\.

## Funding

This work was supported in part by the National Natural Science Foundation of China \(Grant No\. 62472121\), the Key Technology Research and Development Program of Shandong Province \(Grant No\. 2025CXPT077\), and the Special Funding Program of Shandong Taishan Scholars Project\.

## Declaration of competing interest

All authors declare no conflicts of interest\.

## Data availability

Due to privacy restrictions, the complete CEMRs dataset cannot be publicly released\. A de\-identified sample dataset and complete prompt templates will be provided as supplementary material for independent validation\.

## Declaration of generative AI and AI\-assisted technologies in the manuscript preparation process

During the preparation of this work, the authors used OpenAI Codex for LaTeX template migration and formatting support\. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the article\.

\\printcredits

## References

- Zhu \[2007\]Wenfeng Zhu\.*Zhongyi Zhenduan Xue \(Diagnostics of Traditional Chinese Medicine\)*\.China Press of Traditional Chinese Medicine, Beijing, 2 edition, 2007\.
- Thirunavukarasu et al\. \[2023\]Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting\.Large language models in medicine\.*Nature medicine*, 29\(8\):1930–1940, 2023\.[10\.1038/s41591\-023\-02448\-8](https://arxiv.org/doi.org/10.1038/s41591-023-02448-8)\.
- Nazi and Peng \[2024\]Zabir Al Nazi and Wei Peng\.Large language models in healthcare and medical domain: A review\.*Informatics*, 11\(3\):57, 2024\.[10\.3390/informatics11030057](https://arxiv.org/doi.org/10.3390/informatics11030057)\.
- Yang et al\. \[2023\]Rui Yang, Ting Fang Tan, Wei Lu, Arun James Thirunavukarasu, Daniel Shu Wei Ting, and Nan Liu\.Large language models in health care: Development, applications, and challenges\.*Health Care Science*, 2\(4\):255–263, 2023\.[10\.1002/hcs2\.61](https://arxiv.org/doi.org/10.1002/hcs2.61)\.
- Savage et al\. \[2024\]Thomas Savage, Ashwin Nayak, Robert Gallo, Ekanath Rangan, and Jonathan H\. Chen\.Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine\.*npj Digital Medicine*, 7\(1\):20, 2024\.[10\.1038/s41746\-024\-01010\-1](https://arxiv.org/doi.org/10.1038/s41746-024-01010-1)\.
- Barman et al\. \[2024\]Kristian González Barman, Nathan Wood, and Pawel Pawlowski\.Beyond transparency and explainability: on the need for adequate and contextualized user guidelines for LLM use\.*Ethics and Information Technology*, 26\(3\):47, 2024\.[10\.1007/s10676\-024\-09778\-2](https://arxiv.org/doi.org/10.1007/s10676-024-09778-2)\.
- Andukuri et al\. \[2024\]Chinmaya Andukuri, Jan\-Philipp Fränken, Tobias Gerstenberg, and Noah D Goodman\.STaR\-GATE: Teaching language models to ask clarifying questions\.*arXiv preprint arXiv:2403\.19154*, 2024\.[10\.48550/arXiv\.2403\.19154](https://arxiv.org/doi.org/10.48550/arXiv.2403.19154)\.URL[https://arxiv\.org/abs/2403\.19154](https://arxiv.org/abs/2403.19154)\.Preprint\.
- Liao et al\. \[2023\]Lizi Liao, Grace Hui Yang, and Chirag Shah\.Proactive conversational agents in the post\-ChatGPT world\.In*Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 3452–3455, 2023\.[10\.1145/3539618\.3594250](https://arxiv.org/doi.org/10.1145/3539618.3594250)\.
- Zaretsky et al\. \[2024\]Jonah Zaretsky, Jeong Min Kim, Samuel Baskharoun, Yunan Zhao, Jonathan Austrian, Yindalon Aphinyanaphongs, Ravi Gupta, Saul B\. Blecker, and Jonah Feldman\.Generative artificial intelligence to transform inpatient discharge summaries to patient\-friendly language and format\.*JAMA Network Open*, 7\(3\):e240357, 2024\.[10\.1001/jamanetworkopen\.2024\.0357](https://arxiv.org/doi.org/10.1001/jamanetworkopen.2024.0357)\.
- Baxter et al\. \[2025\]Kimberley A\. Baxter, Nidhi Sachdeva, and Sabine Baker\.The application of cognitive load theory to the design of health and behavior change programs: Principles and recommendations\.*Health Education & Behavior*, 52\(4\):469–477, 2025\.[10\.1177/10901981251327185](https://arxiv.org/doi.org/10.1177/10901981251327185)\.
- Pal et al\. \[2023\]Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu\.Med\-HALT: Medical domain hallucination test for large language models\.In*Proceedings of the 27th Conference on Computational Natural Language Learning \(CoNLL\)*, pages 314–334, Singapore, 2023\. Association for Computational Linguistics\.[10\.18653/v1/2023\.conll\-1\.21](https://arxiv.org/doi.org/10.18653/v1/2023.conll-1.21)\.URL[https://aclanthology\.org/2023\.conll\-1\.21/](https://aclanthology.org/2023.conll-1.21/)\.
- Huang et al\. \[2025\]Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al\.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions\.*ACM Transactions on Information Systems*, 43\(2\):1–55, 2025\.[10\.1145/3703155](https://arxiv.org/doi.org/10.1145/3703155)\.
- Kim et al\. \[2025\]Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al\.Medical hallucination in foundation models and their impact on healthcare\.*medRxiv*, 2025\.[10\.1101/2025\.02\.28\.25323115](https://arxiv.org/doi.org/10.1101/2025.02.28.25323115)\.URL[https://www\.medrxiv\.org/content/10\.1101/2025\.02\.28\.25323115v2](https://www.medrxiv.org/content/10.1101/2025.02.28.25323115v2)\.Preprint\.
- Chen et al\. \[2023a\]Wei Chen, Zhiwei Li, Hongyi Fang, Qianyuan Yao, Cheng Zhong, Jianye Hao, Qi Zhang, Xuanjing Huang, Jiajie Peng, and Zhongyu Wei\.A benchmark for automatic medical consultation system: frameworks, tasks and datasets\.*Bioinformatics*, 39\(1\):btac817, 2023a\.[10\.1093/bioinformatics/btac817](https://arxiv.org/doi.org/10.1093/bioinformatics/btac817)\.
- Yi et al\. \[2025\]Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, and Ying Shen\.A survey on recent advances in LLM\-based multi\-turn dialogue systems\.*ACM Computing Surveys*, 58\(6\):1–38, 2025\.[10\.1145/3771090](https://arxiv.org/doi.org/10.1145/3771090)\.
- Liu et al\. \[2025\]Xinyi Liu, Dachun Sun, Yi Fung, Dilek Hakkani\-Tur, and Tarek F\. Abdelzaher\.DocCHA: Towards LLM\-augmented interactive online diagnosis system\.In*Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue*, pages 609–619, Avignon, France, 2025\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2025\.sigdial\-1\.49/](https://aclanthology.org/2025.sigdial-1.49/)\.
- Chen et al\. \[2023b\]Yirong Chen, Zhenyu Wang, Xiaofen Xing, Huimin Zheng, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, and Xiangmin Xu\.BianQue: Balancing the questioning and suggestion ability of health LLMs with multi\-turn health conversations polished by ChatGPT\.*arXiv preprint arXiv:2310\.15896*, 2023b\.[10\.48550/arXiv\.2310\.15896](https://arxiv.org/doi.org/10.48550/arXiv.2310.15896)\.URL[https://arxiv\.org/abs/2310\.15896](https://arxiv.org/abs/2310.15896)\.Preprint\.
- Gómez\-Romero et al\. \[2018\]Juan Gómez\-Romero, Miguel Molina\-Solana, Axel Oehmichen, and Yike Guo\.Visualizing large knowledge graphs: A performance analysis\.*Future Generation Computer Systems*, 89:224–238, 2018\.[10\.1016/j\.future\.2018\.06\.015](https://arxiv.org/doi.org/10.1016/j.future.2018.06.015)\.
- Chen et al\. \[2024a\]Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo, Xiang Chen, Qian Li, Wen Zhang, Jiaoyan Chen, Yushan Zhu, et al\.Knowledge graphs meet multi\-modal learning: A comprehensive survey\.*arXiv preprint arXiv:2402\.05391*, 2024a\.[10\.48550/arXiv\.2402\.05391](https://arxiv.org/doi.org/10.48550/arXiv.2402.05391)\.URL[https://arxiv\.org/abs/2402\.05391](https://arxiv.org/abs/2402.05391)\.Preprint\.
- Yan et al\. \[2025\]Youfu Yan, Yu Hou, Yongkang Xiao, Rui Zhang, and Qianwen Wang\.KNowNEt: Guided health information seeking from LLMs via knowledge graph integration\.*IEEE Transactions on Visualization and Computer Graphics*, 31\(1\):547–557, 2025\.[10\.1109/TVCG\.2024\.3456364](https://arxiv.org/doi.org/10.1109/TVCG.2024.3456364)\.
- Sukhwal et al\. \[2025\]Prakash C Sukhwal, Vaibhav Rajan, and Atreyi Kankanhalli\.A joint LLM\-KG system for disease Q&A\.*IEEE Journal of Biomedical and Health Informatics*, 29\(3\):2257–2270, 2025\.[10\.1109/JBHI\.2024\.3514659](https://arxiv.org/doi.org/10.1109/JBHI.2024.3514659)\.
- Zhao et al\. \[2025\]Xuejiao Zhao, Siyan Liu, Su\-Yin Yang, and Chunyan Miao\.MedRAG: Enhancing retrieval\-augmented generation with knowledge graph\-elicited reasoning for healthcare copilot\.In*Proceedings of the ACM on Web Conference 2025*, pages 4442–4457\. ACM, 2025\.[10\.1145/3696410\.3714782](https://arxiv.org/doi.org/10.1145/3696410.3714782)\.
- Limsopatham and Collier \[2016\]Nut Limsopatham and Nigel Collier\.Normalising medical concepts in social media texts by learning semantic representation\.In*Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1014–1023, Berlin, Germany, 2016\. Association for Computational Linguistics\.[10\.18653/v1/P16\-1096](https://arxiv.org/doi.org/10.18653/v1/P16-1096)\.URL[https://aclanthology\.org/P16\-1096/](https://aclanthology.org/P16-1096/)\.
- Rao and Daumé III \[2018\]Sudha Rao and Hal Daumé III\.Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2737–2746, Melbourne, Australia, 2018\. Association for Computational Linguistics\.[10\.18653/v1/P18\-1255](https://arxiv.org/doi.org/10.18653/v1/P18-1255)\.URL[https://aclanthology\.org/P18\-1255/](https://aclanthology.org/P18-1255/)\.
- Galmarini et al\. \[2024\]Elisa Galmarini, Laura Marciano, and Peter Johannes Schulz\.The effectiveness of visual\-based interventions on health literacy in health care: a systematic review and meta\-analysis\.*BMC Health Services Research*, 24\(1\):718, 2024\.[10\.1186/s12913\-024\-11138\-1](https://arxiv.org/doi.org/10.1186/s12913-024-11138-1)\.
- Wei et al\. \[2018\]Zhongyu Wei, Qianlong Liu, Baolin Peng, Huaixiao Tou, Ting Chen, Xuanjing Huang, Kam\-fai Wong, and Xiangying Dai\.Task\-oriented dialogue system for automatic diagnosis\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pages 201–207, Melbourne, Australia, 2018\. Association for Computational Linguistics\.[10\.18653/v1/P18\-2033](https://arxiv.org/doi.org/10.18653/v1/P18-2033)\.URL[https://aclanthology\.org/P18\-2033/](https://aclanthology.org/P18-2033/)\.
- Liu et al\. \[2022\]Wenge Liu, Yi Cheng, Hao Wang, Jianheng Tang, Yafei Liu, Ruihui Zhao, Wenjie Li, Yefeng Zheng, and Xiaodan Liang\.“My Nose Is Running\.” “Are You Also Coughing?”: Building a medical diagnosis agent with interpretable inquiry logics\.In*Proceedings of the Thirty\-First International Joint Conference on Artificial Intelligence*, pages 4266–4272, 2022\.[10\.24963/ijcai\.2022/592](https://arxiv.org/doi.org/10.24963/ijcai.2022/592)\.
- Muralidhar et al\. \[2025\]Deepa Muralidhar, Rafik Belloum, and Ashwin Ashok\.Operationalizing selective transparency using progressive disclosure in artificial intelligence clinical diagnosis systems\.*International Journal of Human\-Computer Studies*, 204:103591, 2025\.[10\.1016/j\.ijhcs\.2025\.103591](https://arxiv.org/doi.org/10.1016/j.ijhcs.2025.103591)\.
- Wang et al\. \[2025\]Xiao Wang, Mengjue Tan, Qiao Jin, Guangzhi Xiong, Yu Hu, Aidong Zhang, Zhiyong Lu, and Minjia Zhang\.MedCite: Can language models generate verifiable text for medicine?In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 18891–18913, Vienna, Austria, 2025\. Association for Computational Linguistics\.[10\.18653/v1/2025\.findings\-acl\.967](https://arxiv.org/doi.org/10.18653/v1/2025.findings-acl.967)\.URL[https://aclanthology\.org/2025\.findings\-acl\.967/](https://aclanthology.org/2025.findings-acl.967/)\.
- Lee et al\. \[2013\]In\-Seon Lee, Soon\-Ho Lee, Song\-Yi Kim, Hyejung Lee, Hi\-Joon Park, and Younbyoung Chae\.Visualization of the meridian system based on biomedical information about acupuncture treatment\.*Evidence\-based complementary and alternative medicine : eCAM*, 2013:872142, 2013\.[10\.1155/2013/872142](https://arxiv.org/doi.org/10.1155/2013/872142)\.
- Guo et al\. \[2025\]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al\.DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.*Nature*, 645\(8081\):633–638, 2025\.[10\.1038/s41586\-025\-09422\-z](https://arxiv.org/doi.org/10.1038/s41586-025-09422-z)\.
- Chen et al\. \[2024b\]Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, and Benyou Wang\.Towards injecting medical visual knowledge into multimodal LLMs at scale\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 7346–7370, Miami, Florida, USA, 2024b\. Association for Computational Linguistics\.[10\.18653/v1/2024\.emnlp\-main\.418](https://arxiv.org/doi.org/10.18653/v1/2024.emnlp-main.418)\.URL[https://aclanthology\.org/2024\.emnlp\-main\.418/](https://aclanthology.org/2024.emnlp-main.418/)\.

Similar Articles

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

arXiv cs.CL

This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.