COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

arXiv cs.CL 05/15/26, 04:00 AM Papers
medical ehr chain-of-thought reasoning healthcare large-language-models agent
Summary
COTCAgent is a hierarchical reasoning framework for longitudinal electronic health records that uses a probabilistic chain-of-thought completion approach, achieving 90.47% Top-1 accuracy on a self-built dataset and outperforming existing medical agents.
arXiv:2605.15016v1 Announce Type: new Abstract: As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.
Original Article
View Cached Full Text
Cached at: 05/15/26, 06:24 AM
# COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion
Source: [https://arxiv.org/html/2605.15016](https://arxiv.org/html/2605.15016)
Zihan Deng1Xiaozhen Zhong2Chuanzhi Xu3 1School of Computing and Data Science, The University of Hong Kong 2Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China 3School of Computer Science, The University of Sydney zhdeng@hku\.hk, 202422280130@std\.uestc\.edu\.cn, chuanzhi\.xu@sydney\.edu\.au

###### Abstract

As large language models empower healthcare, intelligent clinical decision support has developed rapidly\. Longitudinal electronic health records \(EHR\) provide essential temporal evidence for accurate clinical diagnosis and analysis\. However, current large language models have critical flaws in longitudinal EHR reasoning\. First, lacking fine\-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference\. Second, non\-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long\-range temporal dependencies, limiting reliable clinical reasoning\. To address the above limitations, this work presents theProbabilistic Chain\-of\-Thought Completion Agent\(COTCAgent\), a hierarchical reasoning framework for longitudinal electronic health records\. It consists of three core modules\. The Temporal\-Statistics Adapter \(TSA\) converts analytical plans into executable code for standardized trend output\. The Chain\-of\-Thought Completion \(COTC\) layer leverages a symptom–trend–disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning\. By decoupling statistical computation, feature matching and language generation, the framework eliminates reliance on complex multi\-modal inputs and enables efficient longitudinal record analysis with lower computational overhead\. Experimental results show that COTCAgent powered by Baichuan\-M2 achieves 90\.47% Top\-1 accuracy on the self\-built dataset and 70\.41% on HealthBench, outperforming existing medical agents and mainstream large language models\. The code is available at[https://github\.com/FrankDengAI/COTCAgent/](https://github.com/FrankDengAI/COTCAgent/)\.

## 1Introduction

Sequential diagnosis is a defining challenge of clinical medicine: physicians must fuse irregular labs, incomplete histories, and evolving hypotheses under uncertainty\[[36](https://arxiv.org/html/2605.15016#bib.bib1),[34](https://arxiv.org/html/2605.15016#bib.bib36)\]\. Language models are increasingly used for public\-health surveillance, multimodal monitoring, physiological interpretation, and turn\-taking workflows\[[37](https://arxiv.org/html/2605.15016#bib.bib2),[3](https://arxiv.org/html/2605.15016#bib.bib3),[7](https://arxiv.org/html/2605.15016#bib.bib4),[26](https://arxiv.org/html/2605.15016#bib.bib5),[23](https://arxiv.org/html/2605.15016#bib.bib6)\]\. Recent sequential\-diagnostic benchmarks emphasize coordinated questioning\[[23](https://arxiv.org/html/2605.15016#bib.bib6)\], and interactive probing can surface context that static prompts omit while improving user trust when deployed responsibly\[[21](https://arxiv.org/html/2605.15016#bib.bib7),[29](https://arxiv.org/html/2605.15016#bib.bib8)\]\. The open problem is not merely “more tokens,” but whether model outputs remain tethered to verifiable computations on the underlying time series\.

Historically, medical LLMs were evaluated on templated intakes or multiple\-choice items that rarely mirror chart review\[[28](https://arxiv.org/html/2605.15016#bib.bib9),[20](https://arxiv.org/html/2605.15016#bib.bib10),[6](https://arxiv.org/html/2605.15016#bib.bib11)\]\. Long\-context advances change what fits in context\[[17](https://arxiv.org/html/2605.15016#bib.bib16)\], yet they do not, by themselves, guarantee faithful numeric reasoning\. Tool\-augmented and multi\-agent stacks partially address this by delegating arithmetic, retrieval, and orchestration to external modules\[[31](https://arxiv.org/html/2605.15016#bib.bib12),[19](https://arxiv.org/html/2605.15016#bib.bib17),[24](https://arxiv.org/html/2605.15016#bib.bib14),[1](https://arxiv.org/html/2605.15016#bib.bib13),[18](https://arxiv.org/html/2605.15016#bib.bib15)\]\. Even so, longitudinal agents still risk hallucinated trend statements and rationales that are hard to audit post hoc\[[7](https://arxiv.org/html/2605.15016#bib.bib4)\]\. We ask how to obtain ranked differential hypotheses while preserving an inspectable trail from raw series to predicates, energies, and any follow\-up questions\.

In this paper, we proposeProbabilistic Chain\-of\-Thought Completion Agent\(COTCAgent\)\(Section 3, Algorithm[1](https://arxiv.org/html/2605.15016#alg1)\) to answer the question with three coupled layers\.Temporal\-Statistics Adapter\(TSA\)maps natural\-language requests to compact statistical plans and executable code, emitting typed trend predicates \(slopes, change points, residuals\) instead of prose\-only numerics\.COTCmatches symptoms and predicates against a Symptom–Trend–Disease knowledge base, forms IDF\-weighted Gibbs energies, and softmax\-normalizes them into a transparent ranking surrogate—distinct from fully calibrated probabilistic graphical models when population priors are absent\[[13](https://arxiv.org/html/2605.15016#bib.bib38)\]\.Completionenumerates high\-priority knowledge gaps, renders concise clarification questions, parses user replies back into discrete evidence, and re\-scores with the same energy until mass, entropy, or round caps bind\. We deliberately trade some conversational generality for traceability across code paths, knowledge base \(KB\) rows, and dialog turns\.

Our main contributions can be summarized as:

1. 1\.Systems integration:A traceable pipeline coupling executable trend statistics \(TSA\), structured knowledge\-base scoring \(COTC\), and bounded multi\-turn completion for longitudinal consultation \(Algorithm[1](https://arxiv.org/html/2605.15016#alg1)\)\.
2. 2\.Applied reasoning mechanism:IDF\-weighted Gibbs energies with dual thresholds \(TT,θ\\theta\), softmax ranking, and entropy\-guided gap priority \(Eq\.[7](https://arxiv.org/html/2605.15016#S3.E7)\), emphasizing interpretability over generative\-path enumeration\.
3. 3\.Empirical characterization:Controlled ablations, knowledge\-base sensitivity analysis, and multi\-backbone benchmarks \(TimeCAP/Google/DirPred/KARE, frontier readers, o4\-mini, Qwen3\-Next\-80B\-A3B\) under shared decoding, with round\-by\-round accuracy attribution \(Table[4](https://arxiv.org/html/2605.15016#S4.T4)and Appendix[H](https://arxiv.org/html/2605.15016#A8)\)\.

## 2Related Work

The healthcare LLM literature evolved from narrowly structured SOAP\-style prompts\[[28](https://arxiv.org/html/2605.15016#bib.bib9)\]toward dynamic temporal modeling that better mirrors unstructured, longitudinal chart data\[[6](https://arxiv.org/html/2605.15016#bib.bib11)\]\. Long\-context positional encodings, token\-compression hacks, and subquadratic sequence models reviewed by\[[17](https://arxiv.org/html/2605.15016#bib.bib16)\]provide the representational scaffolding\. Contemporaneous longitudinal EHR studies—notably\[[34](https://arxiv.org/html/2605.15016#bib.bib36)\]—quantify how context length interacts with clinically salient quirks \(duplicate carry\-forward diagnoses, irregular inter\-event spacing, increasing label difficulty over the life\-course\)\. Translating representation capacity into bedside utility still requires tooling\. Retrieval\-oriented agent benchmarks\[[31](https://arxiv.org/html/2605.15016#bib.bib12)\]stress KB\-mediated planning loops\. Tool\-augmented mathematical agents\[[19](https://arxiv.org/html/2605.15016#bib.bib17)\]harden numeric reasoning\. Course\-style tool curricula\[[1](https://arxiv.org/html/2605.15016#bib.bib13)\]further temper brittle hand prompts\. Communicative multi\-agent stacks\[[24](https://arxiv.org/html/2605.15016#bib.bib14)\]widen orchestration primitives, and large\-scale API mastery training\[[18](https://arxiv.org/html/2605.15016#bib.bib15)\]expands the action surface available at decode time\. Cooperative sequential diagnostics\[[23](https://arxiv.org/html/2605.15016#bib.bib6)\]crystallize multi\-turn medical play atop these layers\. Chronic gaps persist around faithful numeric computation, hallucination containment, personalization under shift, and auditable rationales bridging raw series to conversational updates\[[7](https://arxiv.org/html/2605.15016#bib.bib4)\]—motivating our explicitly layered stack\.

Symbolic structure offers a complementary remedy\. Surveys\[[35](https://arxiv.org/html/2605.15016#bib.bib18)\]unify neuro\-symbolic hybrids for biomedical KG pathways\. Parameter\-efficient bridges such as BioBRIDGE\[[32](https://arxiv.org/html/2605.15016#bib.bib28)\]connect foundation models to curated graphs\. Constraint\-aware parsers like NSSC\[[9](https://arxiv.org/html/2605.15016#bib.bib27)\]tighten entity linking under oncology noise\. TrustKG\-style templates\[[30](https://arxiv.org/html/2605.15016#bib.bib29)\]foreground clinician\-facing integration\. Broader taxonomy and limitations of neural\-symbolic reasoning are synthesized by\[[5](https://arxiv.org/html/2605.15016#bib.bib30)\]\. Reward\-guided KG exploration is also emerging\[[12](https://arxiv.org/html/2605.15016#bib.bib37)\], stressing that brittle supervision can degrade transfer even when intermediate path judgments look strong—an argument for deterministic scoring layers atop transparent predicates, as pursued here\.

Probabilistic graphical models provide principled semantics for symptom dependencies when priors exist\[[13](https://arxiv.org/html/2605.15016#bib.bib38)\]\. Surveys of biomedical KG–language\-model hybrids\[[35](https://arxiv.org/html/2605.15016#bib.bib18)\]catalog how symbolic structure is wired into LMs\. Community\-retrieval predictors exemplify operational graph widening for ranking\[[10](https://arxiv.org/html/2605.15016#bib.bib25)\]\. COTCAgent targets settings where calibrated priors are missing but longitudinal trend narratives abound: lightweight energy scores steer hypothesis ranking, missing evidence is enumerated as explicit KB gaps, and short user answers fill those gaps before softmax re\-normalization rather than exhaustive path finding across the entire ontology\.

## 3COTCAgent

### 3\.1Structural Overview

COTCAgent is a longitudinal consultation stack scoped totrend\-based disease risk prediction and differential diagnosisgiven longitudinal labs, vitals, and symptoms\. It does not handle medication safety, prognosis estimation, or preventive care scheduling\. The TSA module turns irregular EHR series into short trend narratives; the COTC module matches those cues against a Symptom–Trend–Disease KB covering 9,948 diseases primarily in internal medicine \(cardiovascular, respiratory, metabolic, infectious, and neurological conditions\)\. When ranking remains ambiguous, the system asks one focused clarification and re\-scores—it is not a free\-form chatbot\. Figure[1](https://arxiv.org/html/2605.15016#S3.F1)sketches the flow\.

Architectural invariants\.We deliberately isolate*three*stochastic surfaces: \(i\)Executable statsconsume raw series\(t,yt\)\(t,y\_\{t\}\)under explicit likelihoods so slope and breakpoint claims are regressions, not free\-text guesses; \(ii\)Discrete matchingconsumes only predicates that survive tokenization \(Severe→\\rightarrowsevere\-trend proposition, anchored windows from TSA outputs\); \(iii\)Natural languageis confined to paraphrasing pre\-specified gaps and parsing short answers back into those predicates\. This separation is what makes failures legible—whenℳ\\mathcal\{M\}throws, the downstream COTC layer receives an explicit uncertainty flag instead of hallucinated numeric claims, and when the KB misses, Algorithm[1](https://arxiv.org/html/2605.15016#alg1)exits through the low\-coverage branch rather than fabricating edges\.

![Refer to caption](https://arxiv.org/html/2605.15016v1/x1.png)Figure 1:Overall architecture of COTCAgent: TSA produces trend predicates from longitudinal signals; COTC scores diseases and drives targeted completion\.
### 3\.2TSA Module

The TSA block parses a clinical question, builds a small analytic plan, and emits code for the relevant estimators\. Concretely, the transformation from natural language to executable analysis follows a structured decomposition:

ℳ:Q→Φ→Λ→C\\mathcal\{M\}:Q\\rightarrow\\Phi\\rightarrow\\Lambda\\rightarrow C\(1\)whereQQis the original query,Φ\\Phirepresents the parsed semantic structure capturing the clinical intent,Λ\\Lambdadenotes the selected formal statistical model, andCCconstitutes the generated executable script\.

This mapping is query\-dependent rather than universal\. For trend\-related requests, the module typically instantiates mixed\-effects models to capture population\-level patterns while accounting for individual variation:

yij=β0\+β1tij\+ui\+ϵij,ui∼𝒩\(0,σu2\),ϵij∼𝒩\(0,σϵ2\)y\_\{ij\}=\\beta\_\{0\}\+\\beta\_\{1\}t\_\{ij\}\+u\_\{i\}\+\\epsilon\_\{ij\},\\quad u\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{u\}^\{2\}\),\\quad\\epsilon\_\{ij\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\epsilon\}^\{2\}\)\(2\)whereyijy\_\{ij\}represents thejj\-th measurement of a biomarker for patientiiat timetijt\_\{ij\},β1\\beta\_\{1\}captures the population\-level trend slope, anduiu\_\{i\}models individual random effects\. Meanwhile, queries concerning abrupt changes route to Bayesian change\-point detection:

P\(τ∣𝐲\)∝P\(𝐲∣τ\)P\(τ\)=\[∏t=1τf1\(yt\)\]\[∏t=τ\+1Tf2\(yt\)\]P\(τ\),τ∈\{1,…,T−1\}P\(\\tau\\mid\\mathbf\{y\}\)\\propto P\(\\mathbf\{y\}\\mid\\tau\)P\(\\tau\)=\\Big\[\\prod\_\{t=1\}^\{\\tau\}f\_\{1\}\(y\_\{t\}\)\\Big\]\\Big\[\\prod\_\{t=\\tau\+1\}^\{T\}f\_\{2\}\(y\_\{t\}\)\\Big\]P\(\\tau\),\\quad\\tau\\in\\\{1,\\ldots,T\-1\\\}\(3\)whereτ\\taudenotes the candidate change\-point index and\(f1,f2\)\(f\_\{1\},f\_\{2\}\)represent the pre\- and post\-change likelihood models, enabling quantitative assessment of temporal discontinuities in clinical trajectories\.

Labs are aligned to a common timeline; missing draws useK=5K\{=\}5Rubin\-style stacks before fitting\. Panels with many analytes project through𝐙=𝐖T𝐘\\mathbf\{Z\}=\\mathbf\{W\}^\{T\}\\mathbf\{Y\}when diagnostics call for dimension reduction; optional GP heads report posterior meanf¯∗=𝐤∗T\(𝐊\+σn2𝐈\)−1𝐲\\bar\{f\}\_\{\*\}\{=\}\\mathbf\{k\}\_\{\*\}^\{T\}\(\\mathbf\{K\}\{\+\}\\sigma\_\{n\}^\{2\}\\mathbf\{I\}\)^\{\-1\}\\mathbf\{y\}\(Appendix A\)\. Slopes, residuals, anomaly flags, and cohort contrasts are summarized into short phrases compatible with Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)\(Figure[1](https://arxiv.org/html/2605.15016#S3.F1)\)\. Anomaly detection uses population\-levelzz\-scoring against age/sex\-matched cohort statistics from the KB \(\|z\|\>2\.5\|z\|\>2\.5threshold\), not personalized baselines\. Other estimators in Appendix A follow the same interface\.

The mappingℳ\\mathcal\{M\}does*not*instantiate every model in Appendix A for every query\. Instead, a lightweight router \(rules over parsed intentΦ\\Phiplus an LLM\-based planner with JSON schema fallbacks\) chooses a*small*subset of standard tools—e\.g\., mixed\-effects trend tests for “stable slope” queries, structural change heuristics for “sudden worsening,” or robust smoothers when data are sparse\. Failed fits automatically downgrade to simpler diagnostics and flag uncertainty in the TSA narrative so COTC never receives overconfident trend claims\. Full routing pseudocode and failure modes are documented in Appendix[E](https://arxiv.org/html/2605.15016#A5)\.

Typed summaries as a contract\.Each TSA artifact is a tuple\(span,estimand,value,qual\)\(\\textit\{span\},\\textit\{estimand\},\\textit\{value\},\\textit\{qual\}\)whereestimandis drawn from a finite schema \(slope, change\-point posterior mass, smooth trend residual, cohortzz\-score\)\. Downgrades appendUNSTABLE/SPARSEflags so the energy model in Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)can down\-weight predicates or route them to “soft”ϕ\\phichannels\. This is a lightweight instance of*probabilistic numerics*thinking: the LM planner proposes*which*estimand to run, but the returned object is always the output of a named routine with documented failure modes, which blocks a common failure mode of “LLM says p=0\.03” without a likelihood\.

### 3\.3COTC module

#### 3\.3\.1Symptom/Trend\-Disease Database

Scarce public longitudinal EHR with gold labels motivated aSymptom–Trend–Disease knowledge base\(KB\)compiled from public medical text \(Medscape/WebMD, NICE/CDC\-linked guidelines, PubMed abstracts, patient education\)\. Ingest de\-identifies records and excludes MIMIC; LLMs only*propose*tuples or rephrase relations, never evaluation text\. Ten percent of diseases \(and dependent edges\) are held out for checks; clinicians review those tuples plus 5% random samples \(Table[7](https://arxiv.org/html/2605.15016#A4.T7), Appendix[D](https://arxiv.org/html/2605.15016#A4)\)\.

The KB lists 23,456 entities \(9,948 diseases; 8,673 symptoms; 4,835 trends\)\. Sixteen clinicians ran a three\-phase edge protocol \(blinded ratings,κ<0\.6\\kappa\{<\}0\.6adjudication, temporal plausibility\), yielding aggregateκ=0\.87\\kappa\{=\}0\.87and 94% of augmented pairs passing rules plus sign\-off\. Source tallies, rubric, and augmentation model IDs are in Appendix[D](https://arxiv.org/html/2605.15016#A4)\.

Edges carry coarse temporal qualifiers; typical rows link∼\\sim15 symptoms and 3–8 trends \(see Appendix[I](https://arxiv.org/html/2605.15016#A9)for KB statistics visualization\)\. The LLM \(GPT\-4o\) is prompted with structured templates to suggest symptom\-disease edges, then proposals are filtered through three rule\-based constraints: \(i\) anatomical consistency, \(ii\) temporal plausibility, \(iii\) severity bounds\. PII is stripped before LLM calls; augmentation is versioned without recycling evaluation prompts\. When no KB frame matches TSA tokens, COTC falls back with a*low\-KB\-coverage*flag \(Algorithm[1](https://arxiv.org/html/2605.15016#alg1)\)\. The COTC module is shown in Figure[2](https://arxiv.org/html/2605.15016#S3.F2)\.

#### 3\.3\.2Calculation of Symptom\-Specific Weight

In clinical reasoning, the diagnostic value of symptoms varies dramatically based on their epidemiological specificity\. Common presentations such as fever or fatigue appear across numerous disease entities and therefore possess limited discriminative power for differential diagnosis\. Conversely, rare pathognomonic indicators—such as Koplik's spots for measles or specific electrocardiographic patterns—provide strong diagnostic evidence when present\. This fundamental principle of clinical medicine mirrors the information\-theoretic concept of inverse frequency weighting in information retrieval\.

To quantitatively encode this clinical intuition within our knowledge\-based framework, we adapt the Inverse Document Frequency \(IDF\) concept from information retrieval to the medical domain, introducingInverse Disease Frequency\(IDF\) weighting for symptoms\. This weighting scheme operates solely on the binary disease\-symptom relationships encoded in our knowledge base, requiring no additional population prevalence data or continuous likelihood models that are often unavailable for rare conditions\.

Formally, the IDF weight for symptomsjs\_\{j\}is defined as:

wjIDF=log⁡\(\|D\|\+1\|\{di∈D:sj∈Sdi\}\|\+1\)w\_\{j\}^\{\\text\{IDF\}\}=\\log\\left\(\\frac\{\|D\|\+1\}\{\\,\|\\\{d\_\{i\}\\in D:s\_\{j\}\\in S\_\{d\_\{i\}\}\\\}\|\+1\\,\}\\right\)\(4\)
where\|D\|\|D\|counts diseases in the knowledge base and the denominator counts those withsj∈Sdis\_\{j\}\\in S\_\{d\_\{i\}\}; the\+1\+1offsets mirror Laplace smoothing and avoid undefined logs at zero counts\. This logarithmic formulation assigns monotonically higher weights to symptoms appearing in fewer diseases, thereby capturing their increased discriminative power for narrowing the differential diagnosis\. The logarithmic scaling serves dual purposes: it ensures numerical stability when symptoms range from ubiquitous to extremely rare, and it maintains interpretable relative differences between weight magnitudes across the symptom spectrum\.

IDF as a discriminative sufficient statistic \(KB\-only\)\.When population prevalences and continuous likelihoods for symptoms are unavailable,wjIDFw\_\{j\}^\{\\text\{IDF\}\}plays the role of a*surrogate log\-odds weight*: it rewards findings that shrink the active disease set quickly, analogously to BM25’s inverse document frequency in retrieval\. Pairinglog⁡wjIDF\\log w\_\{j\}^\{\\text\{IDF\}\}withlog⁡ϕ\\log\\phiin Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)mirrors a log\-linear expert system in which each positive finding contributes additively on the log scale; every such term is*auditable*back to a KB row, unlike latent chain\-of\-thought chains\. We emphasize this is still not a generative symptom model, co\-occurrence beyondϕ\\phiis intentionally under\-modeled, but it gives a principled reason for why rare pattern matches should move mass faster than “fever” hits\. This design prioritizes interpretability over expressiveness: IDF ignores symptom interdependencies, demographic context, and test reliability factors that would require richer data or expert\-elicited dependencies\.

#### 3\.3\.3Calculation of disease\-weighted matching score

![Refer to caption](https://arxiv.org/html/2605.15016v1/x2.png)Figure 2:COTC module wiring: TSA narratives and structured cues feed the KB matcher, producing ranked diagnoses, explicit gaps, and optional follow\-up turns \(Algorithm[1](https://arxiv.org/html/2605.15016#alg1)\)\.Given evidenceSpS\_\{p\}\(symptoms and TSA predicates\), we score each candidate diseasedid\_\{i\}with an additive Gibbs energyRiR\_\{i\}; without population prevalence we do*not*claim a calibrated clinical posterior and instead softmax\-normalize over the active candidate set \(cf\.\[[13](https://arxiv.org/html/2605.15016#bib.bib38)\]\.111Implicit uniform support prior can bias toward rare diseases when patterns are ambiguous\.Positive terms addlog⁡wjIDF\+log⁡ϕ\(sj,di\)\\log w\_\{j\}^\{\\text\{IDF\}\}\+\\log\\phi\(s\_\{j\},d\_\{i\}\)forsj∈Sdi∩Sps\_\{j\}\\in S\_\{d\_\{i\}\}\\cap S\_\{p\}with clinician\-setϕ∈\[0\.5,1\]\\phi\\in\[0\.5,1\]; missing symptoms contributelog⁡\(1−γwjIDF\)\\log\(1\-\\gamma w\_\{j\}^\{\\text\{IDF\}\}\),γ∈\[0,1\]\\gamma\\in\[0,1\]\. Energy gateTTprecedes softmax; mass gateθ\\thetaand entropyHHfollow it; lowHHmeans a peakedP~\\tilde\{P\}\(HHis diagnostic, not calibrated uncertainty\)\.

Ri\\displaystyle R\_\{i\}=∑sj∈\(Sdi∩Sp\)\(log⁡wjIDF\+log⁡ϕ\(sj,di\)\)\+∑sj∈\(Sdi∖Sp\)\(log⁡\(1−γ⋅wjIDF\)\),\\displaystyle=\\sum\_\{s\_\{j\}\\in\(S\_\{d\_\{i\}\}\\cap S\_\{p\}\)\}\\\!\\left\(\\log w\_\{j\}^\{\\text\{IDF\}\}\+\\log\\phi\(s\_\{j\},d\_\{i\}\)\\right\)\+\\sum\_\{s\_\{j\}\\in\(S\_\{d\_\{i\}\}\\setminus S\_\{p\}\)\}\\\!\\left\(\\log\(1\-\\gamma\\cdot w\_\{j\}^\{\\text\{IDF\}\}\)\\right\),\(5\)P~\(di∣Sp\)\\displaystyle\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)=exp⁡\(Ri\)∑ℓexp⁡\(Rℓ\)\.\\displaystyle=\\frac\{\\exp\(R\_\{i\}\)\}\{\\sum\_\{\\ell\}\\exp\(R\_\{\\ell\}\)\}\.\(6\)Among top\-kksurvivors𝒟k\\mathcal\{D\}\_\{k\}, missing slotsggreceive mass\-weighted priority \(entropy proxy;TopGapsis arity\-capped\):

H\\displaystyle H=−∑i=1NP~\(di∣Sp\)log⁡P~\(di∣Sp\),\\displaystyle=\-\\sum\_\{i=1\}^\{N\}\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)\\log\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\),π\(g\)\\displaystyle\\pi\(g\)=∑di∈𝒟kP~\(di∣Sp\)⋅𝟏\{g∈req\(di\)∖Sp\}⋅ψ\(g,di\)\.\\displaystyle=\\sum\_\{d\_\{i\}\\in\\mathcal\{D\}\_\{k\}\}\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)\\cdot\\mathbf\{1\}\\\{g\\in\\text\{req\}\(d\_\{i\}\)\\setminus S\_\{p\}\\\}\\cdot\\psi\(g,d\_\{i\}\)\.\(7\)withreq\(di\)\\text\{req\}\(d\_\{i\}\)the requirement set andψ\\psiup\-weighting pathognomonic or TSA\-aligned predicates\.

#### 3\.3\.4Probabilistic CoT completion and consultation loop

CoT*completion*instantiates the above priorities:TopGapsimplements Eq\.[7](https://arxiv.org/html/2605.15016#S3.E7),RenderQuestionfills slots with template\-anchored language, andParsemaps answers into discrete evidence so the sameRiR\_\{i\}objective is applied after every turn\. Stopping fires when \(i\)maxi⁡P~\(di∣Sp\)≥θ\\max\_\{i\}\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)\\geq\\theta, \(ii\)HHdrops below a validation\-estimated band tied toτH\\tau\_\{H\}, \(iii\)RmaxR\_\{\\max\}binds, or \(iv\)𝒰=∅\\mathcal\{U\}=\\emptyset\. If no KB frame aligns with TSA tokens, the loop exits with an explicit low\-coverage flag\. Appendix[C](https://arxiv.org/html/2605.15016#A3)walks throughpatient\_0077\. Figure[2](https://arxiv.org/html/2605.15016#S3.F2)sketches how TSA summaries, matcher outputs, and the dialog steps connect\.

Offline, we precomputewjIDFw\_\{j\}^\{\\text\{IDF\}\}for all symptoms \(Eq\.[4](https://arxiv.org/html/2605.15016#S3.E4)\)\. Online, for each patientSpS\_\{p\}we computeRiR\_\{i\}\(Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)\), convert toP~\\tilde\{P\}\(Eq\.[6](https://arxiv.org/html/2605.15016#S3.E6)\), filter byRi≥TR\_\{i\}\\geq TwithT≈0\.3T\{\\approx\}0\.3, sort, and pass survivors into Algorithm[1](https://arxiv.org/html/2605.15016#alg1)\.

Algorithm 1Probabilistic CoT completion with targeted questions1:KB

KK, TSA summary

tt, evidence

SpS\_\{p\}, thresholds

θ,T\\theta,T, max rounds

RmaxR\_\{\\max\}
2:Initialize CoT log

ℒ←∅\\mathcal\{L\}\\leftarrow\\emptyset
3:for

r=1r=1to

RmaxR\_\{\\max\}do

4:Compute

\{Ri\}\\\{R\_\{i\}\\\},

\{P~\(di∣Sp\)\}\\\{\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)\\\}with Eqs\.[5](https://arxiv.org/html/2605.15016#S3.E5)–[6](https://arxiv.org/html/2605.15016#S3.E6); filter

Ri<TR\_\{i\}<T
5:if

maxi⁡P~\(di∣Sp\)≥θ\\max\_\{i\}\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)\\geq\\thetaor

HHbelow bandthen

6:returnranked diseases,

ℒ\\mathcal\{L\}
7:endif

8:

𝒰←TopGaps\(K,Sp,t\)\\mathcal\{U\}\\leftarrow\\textsc\{TopGaps\}\(K,S\_\{p\},t\)⊳\\trianglerightEq\.[7](https://arxiv.org/html/2605.15016#S3.E7); arity\-capped gaps

9:if

𝒰=∅\\mathcal\{U\}=\\emptysetthen

10:returnranked diseases,

ℒ\\mathcal\{L\}
11:endif

12:

q←RenderQuestion\(𝒰\)q\\leftarrow\\textsc\{RenderQuestion\}\(\\mathcal\{U\}\)⊳\\trianglerighttemplate \+ constrained LLM paraphrase

13:

a←UserTurn\(q\)a\\leftarrow\\textsc\{UserTurn\}\(q\);

Sp←Sp∪Parse\(a\)S\_\{p\}\\leftarrow S\_\{p\}\\cup\\textsc\{Parse\}\(a\);

ℒ←ℒ∪\{\(q,a\)\}\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\\cup\\\{\(q,a\)\\\}
14:endfor

15:returnbest\-effort ranking,

ℒ\\mathcal\{L\}\(markuncertaintyif

maxi⁡P~\(di∣Sp\)<θ\\max\_\{i\}\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)<\\theta\)

Algorithm–equation correspondence\.Algorithm[1](https://arxiv.org/html/2605.15016#alg1)operationalizes the scoring framework defined in §[3\.3](https://arxiv.org/html/2605.15016#S3.SS3)\. Line 4 computes disease energiesRiR\_\{i\}via Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)\(summation over IDF\-weighted symptom matches and penalty terms\), then applies Eq\.[6](https://arxiv.org/html/2605.15016#S3.E6)to obtain normalized pseudo\-posteriorsP~\(di∣Sp\)\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)\. The filterRi<TR\_\{i\}<Timplements the energy gate; Line 5’s thresholdmaxi⁡P~\(di∣Sp\)≥θ\\max\_\{i\}\\tilde\{P\}\(d\_\{i\}\\mid S\_\{p\}\)\\geq\\thetais the mass gate\. Line 8 invokes Eq\.[7](https://arxiv.org/html/2605.15016#S3.E7)to prioritize missing slots by their expected information gain\. All hyperparameters \(T,θ,τH,Rmax,γT,\\theta,\\tau\_\{H\},R\_\{\\max\},\\gamma\) are fixed at validation\-time and remain constant across experiments\.

## 4Experiments and analysis

This section jointly reports benchmarks, mechanistic interpretations, ablations, and takeaways relative to Sections 3–4 of the manuscript \(TSA predicates, Gibbs energies, softmax ranking, Algorithm[1](https://arxiv.org/html/2605.15016#alg1)\)\. We focus on patterns that recur across setups rather than overstating isolated point estimates\.

### 4\.1Main results

Benchmarks and data sources\.Longitudinal riskuses synthesized vignettes on cleared public catalogs \(Appendix[D](https://arxiv.org/html/2605.15016#A4)\)\.Conversational suitesfollow MedQA’s exam protocol\[[11](https://arxiv.org/html/2605.15016#bib.bib21)\], HealthBench rubrics\[[2](https://arxiv.org/html/2605.15016#bib.bib22)\]\(12 themes, aggregate accuracy\), DiSCQ discharge queries\[[15](https://arxiv.org/html/2605.15016#bib.bib23)\], and Time\-MMD multimodal temporal tests\[[16](https://arxiv.org/html/2605.15016#bib.bib26)\]\. All tasks are top\-kkdisease prediction \(not NLG scoring\); prompts and caps are matched across backbones \(Appendix[F\.2](https://arxiv.org/html/2605.15016#A6.SS2)\)\.

Baselines\.TimeCAP\[[14](https://arxiv.org/html/2605.15016#bib.bib20)\]contextualizes trajectories in text but omits programmatic trend–KB fusion\. Google’s public recipe chains tools without our IDF\-weighted KB or gap\-to\-question routing\. KARE\[[10](https://arxiv.org/html/2605.15016#bib.bib25)\]adds graph\-community retrieval without TSA code branches\. DirPred\[[22](https://arxiv.org/html/2605.15016#bib.bib24)\]clusters temporal risks without interactive questioning\. Decoding parity is in Appendix[F\.2](https://arxiv.org/html/2605.15016#A6.SS2)\.

#### 4\.1\.1Medical record risk prediction

Table[1](https://arxiv.org/html/2605.15016#S4.T1)tests programmatic TSA predicates plus KB scoring \(Appendix[E](https://arxiv.org/html/2605.15016#A5)\), not trivia MCQA:log⁡wIDF\\log w^\{\\text\{IDF\}\}andϕ\\phiin Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)curb invented slopes versus prompt\-only stacks\.

Accuracy and trade\-offs\.COTCAgent leads accuracy/F1/top\-2\(90\.47%/86\.74%/92\.18%\)\(90\.47\\%/86\.74\\%/92\.18\\%\); top\-2 beats DirPred\(92\.18%\(92\.18\\%vs\.83\.98%\)83\.98\\%\)despite DirPred’s strong top\-1\(88\.27%\)\(88\.27\\%\)—DirPred lacks explicit symptom/trend predicates, so mass thins after energy gateTT\(Eq\.[6](https://arxiv.org/html/2605.15016#S3.E6)\)\. KARE leads recall\(83\.24%\)\(83\.24\\%\); we reach82\.91%82\.91\\%with better top\-1\. Frontier readers and o4\-mini match on accuracy but lag tail metrics where typed evidence helps\.

Agent stacks \(TimeCAP–DirPred\) still trail strongest readers and o4\-mini on tail\-heavy metrics absent typed trends plus an IDF\-regularized Gibbs layer; CIs narrow slightly when structure is fixed\. Paired tests \(5 runs\) favor COTCAgent over KARE \(p=0\.04p\{=\}0\.04\) and DirPred \(p<0\.01p\{<\}0\.01\) longitudinally; HealthBench deltas are smaller with overlapping CIs\.

Table 1:Longitudinal medical record benchmark \(mean±std over five runs\)\.Boldfacemarks the best mean in each metric row\.
#### 4\.1\.2Conversational risk prediction

We compare COTCAgent against established agentic pipelines under unified prompts \(Appendix[F\.2](https://arxiv.org/html/2605.15016#A6.SS2)\)\. The softmax construction \(Eqs\.[5](https://arxiv.org/html/2605.15016#S3.E5)–[6](https://arxiv.org/html/2605.15016#S3.E6)\) is intentionally interpretable ranking machinery: conversational turns inflateSpS\_\{p\}with parsed slots, reshaping masses without spawning free\-form rationales disconnected from KB rows\.

Across these suites, open\-ended tracks remain harder than static MedQA because each turn intertwines lexical ambiguity with longitudinal consistency checks\. Baichuan\-M2 \+ COTCAgent reaches70\.41%70\.41\\%on HealthBench \(\+9\.8 vs\. TimeCAP, \+7\.5 vs\. Google agents, \+8\.7 vs\. DirPred, \+0\.8 vs\. KARE, \+2\.2 vs\. vendor tool loop; see Appendix[G](https://arxiv.org/html/2605.15016#A7)for complete results\)\. The same architectural lift appears on GPT\-4o, Gemini 2\.5 Pro, and Claude 3\.7 Sonnet, while Qwen3\-Next\-80B\-A3B shows smaller absolute deltas—consistent with the intuition that weaker planners emit noisier TSA narratives whose predicates fail to ignite high\-weight edges in Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)\.

Interpretation versus KARE\.KARE closes most of the HealthBench gap by aggressive graph widening; our stack instead sharpens softmax peaks when predicates match held\-out conversational cues\. Narrow wins over KARE should therefore*not*be read as dominance of ontology coverage—they arise where structured trends and KB gaps coincide\. These improvements do*not*certify clinically helpful dialogue; they only certify automatic scoring under shared decoding\.

![Refer to caption](https://arxiv.org/html/2605.15016v1/x3.png)Figure 3:Conversational suites under matched decoding: MedQA and HealthBench accuracy \(mean±\\pmstd\) across five backbones\. See Table[8](https://arxiv.org/html/2605.15016#A7.T8)for Google/DirPred rows and for Time\-MMD / DiSCQ columns\.Figure[3](https://arxiv.org/html/2605.15016#S4.F3)summarizes the conversational suite results across five backbones\. COTCAgent consistently outperforms baseline agentic pipelines on MedQA and HealthBench, with the largest gains observed on Baichuan\-M2\. The structured trend predicates and IDF\-weighted KB matching prove particularly effective for HealthBench’s consumer\-health rubrics, where longitudinal consistency checks are critical\. GPT\-4o and Claude 3\.7 Sonnet show similar relative lifts, while Qwen3\-Next\-80B\-A3B exhibits smaller absolute deltas—consistent with weaker planners emitting noisier TSA narratives that fail to ignite high\-weight edges in Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)\. Complete numeric results across all four benchmarks are provided in Table[8](https://arxiv.org/html/2605.15016#A7.T8)of Appendix[G](https://arxiv.org/html/2605.15016#A7)\.

### 4\.2Ablations, representation probes, and discussion

Controlled ablations and KB sensitivity\.We isolate modules onthe longitudinal benchmark aligned with Table[1](https://arxiv.org/html/2605.15016#S4.T1)\. Rows sweep Qwen3\-Next\-80B\-A3B, GPT\-4o, Gemini 2\.5 Pro, Claude 3\.7 Sonnet, and Baichuan\-M2 across: backbone only; backbone \+ TSA\-generated narratives appended to prompts; backbone \+ frozen COTC scoring without regenerated TSA code each turn; full COTCAgent \(Algorithm[1](https://arxiv.org/html/2605.15016#alg1)\)\.

We ablate the Symptom–Trend–Disease KB on the same longitudinal benchmark \(Baichuan\-M2\)\. Table[4](https://arxiv.org/html/2605.15016#S4.T4)retains 25–100% of edges \(prevalence\-stratified subsampling\)\. Accuracy drops 4\.8 points at 50% edges and 11\.3 at 25% \(90\.47%→\\rightarrow85\.67%/79\.17%\): the stack still lifts under sparsity, but full coverage peaks performance\. A “clinician\-only” KB reaches 87\.31%, so augmentation helps without being the sole driver\. Evaluation text never enters KB construction \(Appendix[D](https://arxiv.org/html/2605.15016#A4)\); source lists and isolation safeguards are in Appendix[J](https://arxiv.org/html/2605.15016#A10)\.

Table 2:KB edge ablation \(Baichuan\-M2 longitudinal\)\.Bold:full\-KB baseline from Table[1](https://arxiv.org/html/2605.15016#S4.T1)\.Table 3:Round\-by\-round accuracy attribution \(Baichuan\-M2 on longitudinal benchmark\)\.
Table 4:Longitudinal ablation \(Acc\. %, F1\)\.Bold:best per backbone\.

Table[4](https://arxiv.org/html/2605.15016#S4.T4):TSA\-onlyadds \+0\.9–1\.6 pts \(trend text widens prompts, weak disease grounding\);COTC\-onlyadds \+3\.5–3\.8 pts once IDF energies engage\. Full Baichuan\-M2 gains \+12\.1 pts over backbone vs\. \+9\.9/\+9\.9/\+10\.0 on GPT\-4o/Gemini/Claude—we attribute the gap to richer TSA predicates \(3\.2 vs\. 2\.4\), earlier 32k saturation, and higher parse fidelity \(91% vs\. 84–87%\)\.

Round\-by\-round attribution and representation probes\.To isolate the contribution of multi\-turn gap filling from the initial static ranking, we measured top\-1 accuracy after each consultation round on the longitudinal benchmark \(Baichuan\-M2\)\. Table[4](https://arxiv.org/html/2605.15016#S4.T4)shows that roughly 70% of the total interactive gain accrues from the first question alone, demonstrating that targeted follow\-up provides measurable benefit even under tight turn limits\.

We monitor three heuristic probes on frozen encoder states: clustering coherence among identical disease strata, temporal smoothness across visit order, and semantic margin between disease prototypes \(Figure[4](https://arxiv.org/html/2605.15016#S4.F4)\)\. TSA raises temporal coherence \(0\.817 vs\. 0\.583 baseline\) by injecting phase\-aligned structure before the LM’s mixer blurs chronological cues\. Adding COTC increases semantic separation \(0\.864\) via sparse KB constraints that warp the softmax manifold\. Full COTCAgent peaks at 0\.879 coherence, mirroring Tables[1](https://arxiv.org/html/2605.15016#S4.T1)–[4](https://arxiv.org/html/2605.15016#S4.T4): energy re\-weighting tightens clusters even when token logits remain diffuse\.

![Refer to caption](https://arxiv.org/html/2605.15016v1/x4.png)Figure 4:Representation probes for model variants \(coherence, temporal smoothness, semantic margin\)\.Tables[1](https://arxiv.org/html/2605.15016#S4.T1)–[4](https://arxiv.org/html/2605.15016#S4.T4)and the probes support a narrow claim: tiered code, energies, and short completions temper unstructured decoding drift\[[17](https://arxiv.org/html/2605.15016#bib.bib16)\], yet softmax stays a heuristic ranker with slim margins vs\. KARE on HealthBench \(Appendix[G](https://arxiv.org/html/2605.15016#A7); Appendix[D](https://arxiv.org/html/2605.15016#A4)and Appendix[F\.2](https://arxiv.org/html/2605.15016#A6.SS2)\)\.

## 5Conclusion

We propose COTCAgent, a probabilistic chain\-of\-thought completion stack that binds executable trend summaries, knowledge\-base energies, and bounded completions so longitudinal hypotheses stay traceable under matched decoding\. Empirical gains are strongest on ranking\-style objectives and representation probes, not on calibrated clinical risk or consumer\-style rubrics\. Prevalence\-aware modeling, tighter evidence governance at KB scale, clinician\-facing checks on intermediate artifacts, and deployment studies of latency and failure modes remain the natural next steps\.

## References

- \[1\]\(2025\)Improving llm mathematical reasoning capabilities using external tools\.Stanford CS224R\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[2\]R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel,et al\.\(2025\)Healthbench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[§4\.1](https://arxiv.org/html/2605.15016#S4.SS1.p1.1)\.
- \[3\]N\. Chan, F\. Parker, W\. Bennett, T\. Wu, M\. Y\. Jia, J\. Fackler, and K\. Ghobadi\(2024\)Medtsllm: leveraging llms for multimodal medical time series analysis\.arXiv preprint arXiv:2408\.07773\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1)\.
- \[4\]I\. Daubechies\(1992\)Ten lectures on wavelets\.SIAM\.Cited by:[§A\.5](https://arxiv.org/html/2605.15016#A1.SS5.p1.1)\.
- \[5\]L\. N\. DeLong, R\. F\. Mir, Z\. Ji, F\. N\. C\. Smith, and J\. D\. Fleuriot\(2023\)Neurosymbolic ai for reasoning on biomedical knowledge graphs\.arXiv preprint arXiv:2307\.08411\.Cited by:[§2](https://arxiv.org/html/2605.15016#S2.p2.1)\.
- \[6\]A\. Esteva, A\. Robicquet, B\. Ramsundar, V\. Kuleshov, M\. DePristo, K\. Chou, C\. Cui, G\. Corrado, S\. Thrun, and J\. Dean\(2019\)A guide to deep learning in healthcare\.Nature Medicine25\(1\),pp\. 24–29\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[7\]M\. Feli, I\. Azimi, P\. Liljeberg, and A\. M\. Rahmani\(2025\)An llm\-powered agent for physiological data analysis: a case study on ppg\-based heart rate estimation\.arXiv preprint arXiv:2502\.12836\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1),[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[8\]L\. D\. Fisher and D\. Y\. Lin\(1999\)Time\-dependent covariates in the cox proportional\-hazards regression model\.Annual review of public health20\(1\),pp\. 145–157\.Cited by:[§A\.4](https://arxiv.org/html/2605.15016#A1.SS4.p1.1)\.
- \[9\]Á\. García\-Barragán, A\. Sakor, M\. Vidal, E\. Menasalvas, J\. C\. S\. Gonzalez, M\. Provencio, and V\. Robles\(2025\)NSSC: a neuro\-symbolic ai system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes\.Medical & Biological Engineering & Computing63\(3\),pp\. 749–772\.Cited by:[§2](https://arxiv.org/html/2605.15016#S2.p2.1)\.
- \[10\]P\. Jiang, C\. Xiao, M\. Jiang, P\. Bhatia, T\. Kass\-Hout, J\. Sun, and J\. Han\(2024\)Reasoning\-enhanced healthcare predictions with knowledge graph community retrieval\.arXiv preprint arXiv:2410\.04585\.Cited by:[§2](https://arxiv.org/html/2605.15016#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.15016#S4.SS1.p2.1)\.
- \[11\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.Cited by:[§4\.1](https://arxiv.org/html/2605.15016#S4.SS1.p1.1)\.
- \[12\]S\. Khatwani, H\. Cheng, M\. Afshar, D\. Dligach, and Y\. Gao\(2025\)Brittleness and promise: knowledge graph based reward modeling for diagnostic reasoning\.arXiv preprint arXiv:2509\.18316\.Cited by:[§2](https://arxiv.org/html/2605.15016#S2.p2.1)\.
- \[13\]D\. Koller and N\. Friedman\(2009\)Probabilistic graphical models: principles and techniques\.MIT Press\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p3.1),[§2](https://arxiv.org/html/2605.15016#S2.p3.1),[§3\.3\.3](https://arxiv.org/html/2605.15016#S3.SS3.SSS3.p1.14)\.
- \[14\]G\. Lee, W\. Yu, K\. Shin, W\. Cheng, and H\. Chen\(2025\-04\)Timecap: learning to contextualize, augment, and predict time series events with large language model agents\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 18082–18090\.Cited by:[§4\.1](https://arxiv.org/html/2605.15016#S4.SS1.p2.1)\.
- \[15\]E\. Lehman\(2025\)Learning to ask like a physician: a discharge summary clinical questions \(discq\) dataset\.bioRxiv preprint\.Note:Dataset/draft as cited DiSCQCited by:[§4\.1](https://arxiv.org/html/2605.15016#S4.SS1.p1.1)\.
- \[16\]H\. Liu, S\. Xu, Z\. Zhao, L\. Kong, H\. Prabhakar Kamarthi, A\. Sasanur, M\. Sharma, J\. Cui, Q\. Wen, C\. Zhang,et al\.\(2024\)Time\-mmd: multi\-domain multimodal dataset for time series analysis\.Advances in Neural Information Processing Systems37,pp\. 77888–77933\.Cited by:[§4\.1](https://arxiv.org/html/2605.15016#S4.SS1.p1.1)\.
- \[17\]J\. Liu, D\. Zhu, Z\. Bai, Y\. He, H\. Liao, H\. Que, Z\. Wang, C\. Zhang, G\. Zhang, J\. Zhang,et al\.\(2025\)A comprehensive survey on long context language modeling\.arXiv preprint arXiv:2503\.17407\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1),[§4\.2](https://arxiv.org/html/2605.15016#S4.SS2.p6.1)\.
- \[18\]Y\. Liu, T\. Han, J\. Han, Y\. Li, H\. Zhang, Z\. Liu, J\. Liu, X\. Liu, Z\. Liu, X\. Liu,et al\.\(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[19\]H\. Luo, H\. Feng, Q\. Sun, C\. Xu, K\. Zheng, Y\. Wang, T\. Yang, H\. Hu, Y\. Tang, and D\. Wang\(2025\)AgentMath: empowering mathematical reasoning for large language models via tool\-augmented agent\.arXiv preprint arXiv:2512\.20745\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[20\]D\. McDuff, M\. Norouzi, S\. Lundberg, J\. Gao, E\. Kiciman, S\. Gombar, K\. Patel, B\. Lansdell, C\. H\. Teo, C\. Liao,et al\.\(2023\)Capabilities of gemini models in medicine\.Google Research\.Note:PreprintCited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1)\.
- \[21\]M\. Mirbabaie, S\. Stieglitz, and N\. R\. Frick\(2021\)Artificial intelligence in disease diagnostics: a critical review and classification on the current state of research guiding future direction\.Health and Technology11\(4\),pp\. 693–731\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1)\.
- \[22\]S\. Niu, Q\. Yin, J\. Ma, Y\. Song, Y\. Xu, L\. Bai, W\. Pan, and X\. Yang\(2024\)Enhancing healthcare decision support through explainable ai models for risk prediction\.Decision Support Systems181,pp\. 114228\.Cited by:[§4\.1](https://arxiv.org/html/2605.15016#S4.SS1.p2.1)\.
- \[23\]H\. Nori, M\. Daswani, C\. Kelly, S\. Lundberg, M\. T\. Ribeiro, M\. Wilson, and E\. Horvitz\(2025\)Sequential diagnosis with language models\.arXiv preprint arXiv:2506\.22405\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[24\]C\. Qian, X\. Cong, C\. Yang, W\. Chen, J\. Su, J\. Zhang, Y\. Zhang, Y\. Liu, and Y\. Li\(2023\)Communicative agents for software development\.arXiv preprint arXiv:2307\.07924\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[25\]M\. Seeger\(2004\)Gaussian processes for machine learning\.International journal of neural systems14\(02\),pp\. 69–106\.Cited by:[§A\.1](https://arxiv.org/html/2605.15016#A1.SS1.p6.1)\.
- \[26\]T\. Shaik, X\. Tao, L\. Li, H\. Xie, H\. N\. Dai, F\. Zhao, and J\. Yong\(2023\)Adaptive multi\-agent deep reinforcement learning for timely healthcare interventions\.arXiv preprint arXiv:2309\.10980\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1)\.
- \[27\]T\. Shimamura, S\. Imoto, R\. Yamaguchi, A\. Fujita, M\. Nagasaki, and S\. Miyano\(2009\)Recursive regularization for inferring gene networks from time\-course gene expression profiles\.BMC systems biology3\(1\),pp\. 41\.Cited by:[§A\.3](https://arxiv.org/html/2605.15016#A1.SS3.p1.1)\.
- \[28\]K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, H\. Cole\-Lewis,et al\.\(2025\)Toward expert\-level medical question answering with large language models\.Nature Medicine31\(3\),pp\. 943–950\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[29\]E\. J\. Topol\(2019\)High\-performance medicine: the convergence of human and artificial intelligence\.Nature Medicine25\(1\),pp\. 44–56\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1)\.
- \[30\]M\. Vidal, Y\. Chudasama, H\. Huang, D\. Purohit, and M\. Torrente\(2025\)Integrating knowledge graphs with symbolic ai: the path to interpretable hybrid ai systems in medicine\.Journal of Web Semantics84,pp\. 100856\.Cited by:[§2](https://arxiv.org/html/2605.15016#S2.p2.1)\.
- \[31\]X\. Wang, Y\. Li, M\. Zhang, Y\. Zhang, Y\. Liu, X\. Liu, and R\. Zhang\(2024\)AgentBench: evaluating llms as agents\.arXiv preprint arXiv:2402\.11588\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[32\]Z\. Wang, Z\. Wang, B\. Srinivasan, V\. N\. Ioannidis, H\. Rangwala, and R\. Anubhai\(2023\)Biobridge: bridging biomedical foundation models via knowledge graphs\.arXiv preprint arXiv:2310\.03320\.Cited by:[§2](https://arxiv.org/html/2605.15016#S2.p2.1)\.
- \[33\]M\. West and J\. Harrison\(1997\)Bayesian forecasting and dynamic models\.Springer\.Cited by:[§A\.2](https://arxiv.org/html/2605.15016#A1.SS2.p1.1)\.
- \[34\]M\. Wornow, S\. Bedi, M\. A\. Fuentes Hernandez, E\. Steinberg, J\. A\. Fries, C\. Ré, S\. Koyejo, and N\. H\. Shah\(2024\)Context clues: evaluating long context models for clinical prediction tasks on EHRs\.arXiv preprint arXiv:2412\.16178\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1),[§2](https://arxiv.org/html/2605.15016#S2.p1.1)\.
- \[35\]R\. Xu, P\. Jiang, L\. Luo, C\. Xiao, A\. Cross, S\. Pan, J\. Sun, and C\. Yang\(2025\)A survey on unifying large language models and knowledge graphs for biomedicine and healthcare\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 6195–6205\.Cited by:[§2](https://arxiv.org/html/2605.15016#S2.p2.1),[§2](https://arxiv.org/html/2605.15016#S2.p3.1)\.
- \[36\]S\. Zhou, Z\. Xu, M\. Zhang, C\. Xu, Y\. Guo, Z\. Zhan, and R\. Zhang\(2025\)Large language models for disease diagnosis: a scoping review\.npj Artificial Intelligence1\(1\),pp\. 9\.Cited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1)\.
- \[37\]X\. Zhou, J\. Zhou, C\. Wang, Q\. Xie, K\. Ding, C\. Mao, and Y\. Luo\(2025\)PH\-llm: public health large language models for infoveillance\.medRxiv\.Note:PreprintCited by:[§1](https://arxiv.org/html/2605.15016#S1.p1.1)\.

##### Supplementary overview\.

A:Statistical models available to the planner \(below\)\.B:Longitudinal vignette excerpt for synthetic casepatient\_0077\.C:The same patient’s dialog trace aligned with Alg\.[1](https://arxiv.org/html/2605.15016#alg1)in the main text\.D:KB governance details behind Table[7](https://arxiv.org/html/2605.15016#A4.T7)\.E:Routing sketches complementary to Section 3 of the manuscript\.F:Qualitative reasoning figure plus baseline prompting \(Appendix[F](https://arxiv.org/html/2605.15016#A6); main result tables in §[4](https://arxiv.org/html/2605.15016#S4)\)\.G:Conversational\-suite bar summary \(Figure[3](https://arxiv.org/html/2605.15016#S4.F3)\) plus full numeric table \(Table[8](https://arxiv.org/html/2605.15016#A7.T8)\)\.H–K:Round\-by\-round accuracy for all backbones, KB multiplicity radar, KB vs\. evaluation isolation, and blinded clinician review of intermediate artifacts\.

## Appendix AAppendix A: Mathematical analysis methods available to TSA

The planner follows the decompositionQ→Φ→Λ→CQ\\rightarrow\\Phi\\rightarrow\\Lambda\\rightarrow Cintroduced with the TSA module in Section 3\. Table[5](https://arxiv.org/html/2605.15016#A1.T5)is only a categorical menu:*only*the bucket implied by parsed intentΦ\\Phiplus Appendix[E](https://arxiv.org/html/2605.15016#A5)is invoked; we never cascade the entire column for a single biomarker streak\. Subsequent subsections recap GP, Bayesian structural\-time\-series, penalized VAR, Cox time\-dependent covariates, and wavelet tools referenced by the codebase when the planner selects those branches\.

Table 5:TSA toolbox by category \(*candidate methods*; router picks a sparse subset per query following Appendix[E](https://arxiv.org/html/2605.15016#A5)\)\.### A\.1Gaussian Process Regression

f\(𝐱\)∼𝒢𝒫\(m\(𝐱\),k\(𝐱,𝐱′\)\)f\(\\mathbf\{x\}\)\\sim\\mathcal\{GP\}\(m\(\\mathbf\{x\}\),k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\)\(8\)
where the mean function and covariance function are defined as:

m\(𝐱\)\\displaystyle m\(\\mathbf\{x\}\)=𝔼\[f\(𝐱\)\]\\displaystyle=\\mathbb\{E\}\[f\(\\mathbf\{x\}\)\]\(9\)k\(𝐱,𝐱′\)\\displaystyle k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=𝔼\[\(f\(𝐱\)−m\(𝐱\)\)\(f\(𝐱′\)−m\(𝐱′\)\)\]\\displaystyle=\\mathbb\{E\}\[\(f\(\\mathbf\{x\}\)\-m\(\\mathbf\{x\}\)\)\(f\(\\mathbf\{x\}^\{\\prime\}\)\-m\(\\mathbf\{x\}^\{\\prime\}\)\)\]\(10\)
For observed data𝒟=\{\(𝐱i,yi\)\}i=1n\\mathcal\{D\}=\\\{\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}withyi=f\(𝐱i\)\+ϵiy\_\{i\}=f\(\\mathbf\{x\}\_\{i\}\)\+\\epsilon\_\{i\},ϵi∼𝒩\(0,σn2\)\\epsilon\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{n\}^\{2\}\), the posterior predictive distribution is:

f∗\|𝐗,𝐲,𝐱∗∼𝒩\(f¯∗,𝕍\[f∗\]\)f\_\{\*\}\|\\mathbf\{X\},\\mathbf\{y\},\\mathbf\{x\}\_\{\*\}\\sim\\mathcal\{N\}\(\\bar\{f\}\_\{\*\},\\mathbb\{V\}\[f\_\{\*\}\]\)\(11\)
f¯∗\\displaystyle\\bar\{f\}\_\{\*\}=𝐤∗T\(𝐊\+σn2𝐈\)−1𝐲\\displaystyle=\\mathbf\{k\}\_\{\*\}^\{T\}\(\\mathbf\{K\}\+\\sigma\_\{n\}^\{2\}\\mathbf\{I\}\)^\{\-1\}\\mathbf\{y\}\(12\)𝕍\[f∗\]\\displaystyle\\mathbb\{V\}\[f\_\{\*\}\]=k\(𝐱∗,𝐱∗\)−𝐤∗T\(𝐊\+σn2𝐈\)−1𝐤∗\\displaystyle=k\(\\mathbf\{x\}\_\{\*\},\\mathbf\{x\}\_\{\*\}\)\-\\mathbf\{k\}\_\{\*\}^\{T\}\(\\mathbf\{K\}\+\\sigma\_\{n\}^\{2\}\\mathbf\{I\}\)^\{\-1\}\\mathbf\{k\}\_\{\*\}\(13\)
where𝐊ij=k\(𝐱i,𝐱j\)\\mathbf\{K\}\_\{ij\}=k\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\_\{j\}\),𝐤∗i=k\(𝐱∗,𝐱i\)\\mathbf\{k\}\_\{\*i\}=k\(\\mathbf\{x\}\_\{\*\},\\mathbf\{x\}\_\{i\}\)\.

Gaussian Process Regression\[[25](https://arxiv.org/html/2605.15016#bib.bib31)\]provides a flexible non\-parametric Bayesian framework for modeling complex temporal patterns in medical data\. This approach allows us to capture uncertainty in predictions naturally, which is crucial for clinical decision\-making where risk assessment is paramount\. The covariance function \(kernel\) encodes our assumptions about the function’s properties, such as smoothness, periodicity, and trends\. In medical applications, this enables modeling of physiological processes with varying temporal characteristics, from rapidly changing vital signs to slowly progressing chronic conditions\. The Bayesian nature of GPs facilitates incorporation of prior knowledge and provides full posterior distributions rather than point estimates, supporting probabilistic clinical interpretations\.

### A\.2Bayesian Structural Time Series

The general formulation of Bayesian Structural Time Series models\[[33](https://arxiv.org/html/2605.15016#bib.bib32)\]incorporates multiple components:

yt=μt\+τt\+ωt\+ϵt,ϵt∼𝒩\(0,σϵ2\)y\_\{t\}=\\mu\_\{t\}\+\\tau\_\{t\}\+\\omega\_\{t\}\+\\epsilon\_\{t\},\\quad\\epsilon\_\{t\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\epsilon\}^\{2\}\)\(14\)
whereμt\\mu\_\{t\}represents the local level,τt\\tau\_\{t\}the seasonal component, andωt\\omega\_\{t\}the regression component\. The state evolution follows:

μt\\displaystyle\\mu\_\{t\}=μt−1\+δt−1\+ημ,t,ημ,t∼𝒩\(0,σμ2\)\\displaystyle=\\mu\_\{t\-1\}\+\\delta\_\{t\-1\}\+\\eta\_\{\\mu,t\},\\quad\\eta\_\{\\mu,t\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\mu\}^\{2\}\)\(15\)δt\\displaystyle\\delta\_\{t\}=δt−1\+ηδ,t,ηδ,t∼𝒩\(0,σδ2\)\\displaystyle=\\delta\_\{t\-1\}\+\\eta\_\{\\delta,t\},\\quad\\eta\_\{\\delta,t\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\delta\}^\{2\}\)\(16\)τt\\displaystyle\\tau\_\{t\}=−∑j=1S−1τt−j\+ητ,t,ητ,t∼𝒩\(0,στ2\)\\displaystyle=\-\\sum\_\{j=1\}^\{S\-1\}\\tau\_\{t\-j\}\+\\eta\_\{\\tau,t\},\\quad\\eta\_\{\\tau,t\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\tau\}^\{2\}\)\(17\)
The Bayesian approach assigns prior distributions to parameters:

σϵ2,σμ2,σδ2,στ2∼Inverse\-Gamma\(α,β\)\\sigma\_\{\\epsilon\}^\{2\},\\sigma\_\{\\mu\}^\{2\},\\sigma\_\{\\delta\}^\{2\},\\sigma\_\{\\tau\}^\{2\}\\sim\\text\{Inverse\-Gamma\}\(\\alpha,\\beta\)\(18\)
Posterior inference is performed using Markov Chain Monte Carlo methods, enabling full uncertainty quantification\.

Bayesian Structural Time Series models provide a comprehensive framework for decomposing medical time series into interpretable components while rigorously quantifying uncertainty\. This approach is particularly valuable for healthcare applications where understanding the contribution of different factors \(trends, seasonality, interventions\) is essential for clinical interpretation\. The Bayesian formulation allows incorporation of domain knowledge through informative priors, which is especially useful when dealing with limited data or rare conditions\. The model’s ability to generate probabilistic forecasts with credible intervals supports risk\-stratified clinical decision making, while the structural components facilitate causal inference about interventions or disease progression\.

### A\.3Vector Autoregression with Regularization

The Vector Autoregression \(VAR\) model for multivariate medical time series\[[27](https://arxiv.org/html/2605.15016#bib.bib33)\]is formulated as:

𝐲t=𝐀1𝐲t−1\+𝐀2𝐲t−2\+⋯\+𝐀p𝐲t−p\+ϵt,ϵt∼𝒩\(𝟎,𝚺\)\\mathbf\{y\}\_\{t\}=\\mathbf\{A\}\_\{1\}\\mathbf\{y\}\_\{t\-1\}\+\\mathbf\{A\}\_\{2\}\\mathbf\{y\}\_\{t\-2\}\+\\cdots\+\\mathbf\{A\}\_\{p\}\\mathbf\{y\}\_\{t\-p\}\+\\boldsymbol\{\\epsilon\}\_\{t\},\\quad\\boldsymbol\{\\epsilon\}\_\{t\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\boldsymbol\{\\Sigma\}\)\(19\)
where𝐲t∈ℝm\\mathbf\{y\}\_\{t\}\\in\\mathbb\{R\}^\{m\}represents multiple medical indicators at timett\. To handle high\-dimensional data and avoid overfitting, we employ regularized estimation:

𝐀^=arg⁡min𝐀⁡\{∑t=p\+1T‖𝐲t−∑j=1p𝐀j𝐲t−j‖22\+λ1∑j=1p‖𝐀j‖1\+λ2∑j=1p‖𝐀j‖F2\}\\hat\{\\mathbf\{A\}\}=\\arg\\min\_\{\\mathbf\{A\}\}\\left\\\{\\sum\_\{t=p\+1\}^\{T\}\\\|\\mathbf\{y\}\_\{t\}\-\\sum\_\{j=1\}^\{p\}\\mathbf\{A\}\_\{j\}\\mathbf\{y\}\_\{t\-j\}\\\|\_\{2\}^\{2\}\+\\lambda\_\{1\}\\sum\_\{j=1\}^\{p\}\\\|\\mathbf\{A\}\_\{j\}\\\|\_\{1\}\+\\lambda\_\{2\}\\sum\_\{j=1\}^\{p\}\\\|\\mathbf\{A\}\_\{j\}\\\|\_\{F\}^\{2\}\\right\\\}\(20\)
The combined L1 and L2 regularization \(Elastic Net\) promotes both sparsity and stability in parameter estimates\. The covariance matrix𝚺\\boldsymbol\{\\Sigma\}captures contemporaneous correlations among indicators\.

Vector Autoregression models extend univariate time series analysis to capture rich interdependencies among multiple medical indicators simultaneously\. This multivariate approach is essential for healthcare applications where physiological systems exhibit complex feedback mechanisms and compensatory pathways\. The regularized estimation framework addresses the curse of dimensionality that arises when modeling numerous biomarkers, ensuring robust parameter estimates even with limited temporal observations\. VAR models facilitate dynamic analysis through impulse response functions and forecast error variance decomposition, providing insights into how shocks to one biomarker propagate through the system and affect other indicators over time, which is invaluable for understanding disease pathophysiology and treatment effects\.

### A\.4Cox Proportional Hazards Model with Time\-Dependent Covariates

The extended Cox model incorporating time\-dependent covariates\[[8](https://arxiv.org/html/2605.15016#bib.bib34)\]is specified as:

λ\(t\|𝐙\(t\)\)=λ0\(t\)exp⁡\(𝜷T𝐙\(t\)\+𝜸T𝐗\)\\lambda\(t\|\\mathbf\{Z\}\(t\)\)=\\lambda\_\{0\}\(t\)\\exp\\left\(\\boldsymbol\{\\beta\}^\{T\}\\mathbf\{Z\}\(t\)\+\\boldsymbol\{\\gamma\}^\{T\}\\mathbf\{X\}\\right\)\(21\)
where𝐙\(t\)\\mathbf\{Z\}\(t\)represents time\-varying biomarkers and𝐗\\mathbf\{X\}denotes baseline covariates\. The partial likelihood function for right\-censored data is:

L\(𝜷,𝜸\)=∏i=1n\[exp⁡\(𝜷T𝐙i\(ti\)\+𝜸T𝐗i\)∑j∈R\(ti\)exp⁡\(𝜷T𝐙j\(ti\)\+𝜸T𝐗j\)\]δiL\(\\boldsymbol\{\\beta\},\\boldsymbol\{\\gamma\}\)=\\prod\_\{i=1\}^\{n\}\\left\[\\frac\{\\exp\\left\(\\boldsymbol\{\\beta\}^\{T\}\\mathbf\{Z\}\_\{i\}\(t\_\{i\}\)\+\\boldsymbol\{\\gamma\}^\{T\}\\mathbf\{X\}\_\{i\}\\right\)\}\{\\sum\_\{j\\in R\(t\_\{i\}\)\}\\exp\\left\(\\boldsymbol\{\\beta\}^\{T\}\\mathbf\{Z\}\_\{j\}\(t\_\{i\}\)\+\\boldsymbol\{\\gamma\}^\{T\}\\mathbf\{X\}\_\{j\}\\right\)\}\\right\]^\{\\delta\_\{i\}\}\(22\)
Time\-dependent predictive accuracy is assessed using cumulative/dynamic ROC curves:

AUC\(t\)=Pr⁡\(Mi\>Mj\|Ti=t,Tj\>t\)\\text\{AUC\}\(t\)=\\Pr\(M\_\{i\}\>M\_\{j\}\|T\_\{i\}=t,T\_\{j\}\>t\)\(23\)
whereMiM\_\{i\}represents the prognostic index for subjectii\.

The Cox Proportional Hazards model with time\-dependent covariates represents a powerful framework for dynamic risk prediction in longitudinal medical studies\. This approach allows risk estimates to evolve as new biomarker measurements become available, reflecting the changing health status of patients over time\. The partial likelihood estimation efficiently handles censored observations, which are ubiquitous in clinical follow\-up data\. The incorporation of both time\-varying and fixed covariates enables comprehensive risk assessment that accounts for both dynamic processes and stable patient characteristics\. Time\-dependent ROC analysis provides measures of predictive accuracy that acknowledge the temporal nature of prognostic assessment, offering clinicians insight into how well biomarkers discriminate between outcomes at specific time horizons, which is crucial for staging interventions and monitoring disease progression\.

### A\.5Wavelet Transform Analysis

The continuous wavelet transform\[[4](https://arxiv.org/html/2605.15016#bib.bib35)\]of a medical time seriesx\(t\)x\(t\)is defined as:

Wx\(a,b\)=1\|a\|∫−∞∞x\(t\)ψ∗\(t−ba\)𝑑tW\_\{x\}\(a,b\)=\\frac\{1\}\{\\sqrt\{\|a\|\}\}\\int\_\{\-\\infty\}^\{\\infty\}x\(t\)\\psi^\{\*\}\\left\(\\frac\{t\-b\}\{a\}\\right\)dt\(24\)
whereψ\(t\)\\psi\(t\)is the mother wavelet,aais the scale parameter, andbbis the translation parameter\. For discrete medical measurements, we employ the discrete wavelet transform:

Wϕ\(j0,k\)=1M∑tx\(t\)ϕj0,k\(t\)W\_\{\\phi\}\(j\_\{0\},k\)=\\frac\{1\}\{\\sqrt\{M\}\}\\sum\_\{t\}x\(t\)\\phi\_\{j\_\{0\},k\}\(t\)\(25\)Wψ\(j,k\)=1M∑tx\(t\)ψj,k\(t\),j≥j0W\_\{\\psi\}\(j,k\)=\\frac\{1\}\{\\sqrt\{M\}\}\\sum\_\{t\}x\(t\)\\psi\_\{j,k\}\(t\),\\quad j\\geq j\_\{0\}\(26\)
The wavelet coefficients capture time\-frequency localization:

x\(t\)=1M∑kWϕ\(j0,k\)ϕj0,k\(t\)\+1M∑j=j0∞∑kWψ\(j,k\)ψj,k\(t\)x\(t\)=\\frac\{1\}\{\\sqrt\{M\}\}\\sum\_\{k\}W\_\{\\phi\}\(j\_\{0\},k\)\\phi\_\{j\_\{0\},k\}\(t\)\+\\frac\{1\}\{\\sqrt\{M\}\}\\sum\_\{j=j\_\{0\}\}^\{\\infty\}\\sum\_\{k\}W\_\{\\psi\}\(j,k\)\\psi\_\{j,k\}\(t\)\(27\)
Wavelet coherence between two signalsx\(t\)x\(t\)andy\(t\)y\(t\)measures localized correlation:

Rxy\(a,b\)=\|S\(a−1Wxy\(a,b\)\)\|2S\(a−1\|Wx\(a,b\)\|2\)S\(a−1\|Wy\(a,b\)\|2\)R\_\{xy\}\(a,b\)=\\frac\{\|S\(a^\{\-1\}W\_\{xy\}\(a,b\)\)\|^\{2\}\}\{S\(a^\{\-1\}\|W\_\{x\}\(a,b\)\|^\{2\}\)S\(a^\{\-1\}\|W\_\{y\}\(a,b\)\|^\{2\}\)\}\(28\)
Wavelet Transform Analysis provides a multiresolution framework for examining medical time series across different temporal scales simultaneously\. This approach is particularly well\-suited for physiological signals that exhibit non\-stationary characteristics and contain information at multiple frequencies, from high\-frequency oscillations to slow trends\. The time\-frequency localization capability allows identification of transient events and periodic patterns that may be associated with specific pathological states or treatment responses\. Wavelet coherence analysis extends this to multivariate settings, revealing how relationships between different biomarkers evolve over time and across frequency bands, offering insights into regulatory mechanisms and compensatory pathways in physiological systems under various health conditions\.

## Appendix BAppendix B: Temporal excerpt for synthetic casepatient\_0077

This synthetic longitudinal vignette is the raw evidence layout used by Appendix[C](https://arxiv.org/html/2605.15016#A3)andis not itself a benchmark split\. Severity labels exemplify categorical trend channels that TSA summarizes before COTC matching; year columns denote representative events condensed from the vignette chronology \(YYYY\-\-MM\)\.

### B\.1Basic signs timeline

Table[6](https://arxiv.org/html/2605.15016#A2.T6)lists six hallmark symptoms tracked in theBasic Signsblock\.

Table 6:Basic sign severities forpatient\_0077\(synthetic vignette; not from MIMIC\)Symptom NameSymptom ID20202021202220232024Normal Alpha\-fetoproteinS595517\_016Severe\(01\-14\)Critical\(08\-12\)\-\-HematemesisS225349\_017\-\-Muscle PainS501538\_013\-Extreme\(06\-16\)Extreme\(08\-15\)Limb NumbnessS793050\_018\-Critical\(01\-27\)Minor\(09\-23\)Mild\(03\-22\)Moderate\(04\-21\)Minor\(09\-18\)HeadacheS376299\_020\-Extreme\(07\-03\)Extreme\(10\-31\)Medium\(01\-29\)None\(08\-27\)Severe\(02\-17\)None\(09\-15\)Minor\(11\-14\)Mild\(12\-14\)EpistaxisS256542\_016Moderate\(03\-05\)Moderate\(09\-01\)Medium\(02\-23\)Extreme\(10\-21\)\-\-
### B\.2Vitals\-style quantitative streams

Synthetic streams used only for planner stress\-testing \(units omitted in this toy JSON export\):

- •Epistaxis \(BP/Glucose\)\(ID: S643823\_018\) Time Series: 2025\-03\-27 04:21:23, 2025\-11\-22 22:57:03, 2026\-10\-18 19:31:40, 2026\-12\-17 09:06:53 Severity: Extreme, Minor, Severe, Extreme
- •Chills\(ID: I13432\) Time Series: 2025\-04\-22 06:29:54, 2025\-09\-19 12:28:24, 2026\-08\-15 19:24:18, 2026\-11\-13 08:16:43, 2027\-09\-09 03:19:32, 2028\-08\-04 05:13:54, 2029\-05\-31 06:12:25, 2029\-08\-29 20:17:41, 2029\-09\-28 02:18:48 Measurement Values: 93\.52, 62\.73, 93\.74, 26\.85, 10\.47, 66\.25, 27\.7, 54\.59, 58\.51

### B\.3Lifestyle and counseling indicators

- •Dysphagia\(ID: I40646\) Time Series: 2025\-09\-14 18:20:30, 2025\-12\-13 03:16:46, 2026\-07\-11 05:31:57, 2026\-08\-10 06:04:36, 2026\-09\-09 19:30:48, 2026\-11\-08 00:20:51 Measurement Values: 95\.27, 79\.86, 65\.43, 28\.64, 36\.92, 10\.24
- •Jaundice\(ID: I45555\) Time Series: 2025\-03\-15 01:41:35, 2025\-10\-11 04:59:48, 2025\-11\-10 06:25:33, 2026\-03\-10 06:41:10, 2027\-02\-03 06:05:40, 2027\-08\-02 00:17:19, 2027\-09\-01 07:33:11, 2028\-05\-28 14:49:07 Measurement Values: 13\.1, 79\.27, 59\.6, 62\.76, 80\.4, 91\.37, 95\.77, 29\.71

### B\.4Static patient attributes

Summary of gold labels bundled with the vignette \(not inferred online by the LM\):

- •Patient ID: patient\_0077
- •Confirmed Diseases: 1. \(a\)Disease ID: D006229; Name: Mild Gouty Arthritis; Description: Detailed medical description of mild gouty arthritis 2. \(b\)Disease ID: D007306; Name: Advanced Adrenal Cortical Insufficiency Damage; Description: Detailed medical description of advanced adrenal cortical insufficiency damage 3. \(c\)Disease ID: D007862; Name: Severe Liver Cirrhosis; Description: Detailed medical description of severe liver cirrhosis 4. \(d\)Disease ID: D006392; Name: Recurrent Bacillary Dysentery Degeneration; Description: Detailed medical description of recurrent bacillary dysentery degeneration
- •Database\-matched Symptoms\(Symptom ID \+ Name\): 1. \(a\)S002797\_004: Infertility 2. \(b\)S000797\_004: Rhinorrhea 3. \(c\)S007163\_012: Skin Rash 4. \(d\)S004096\_005: Muscle Spasm 5. \(e\)S002613\_011: Dysphagia 6. \(f\)S006479\_004: Tinnitus 7. \(g\)S000027\_001: Muscle Weakness
- •Database\-matched Indicators\(Indicator ID \+ Name\): 1. \(a\)I027623: Mild Increase in Neutrophil Percentage 2. \(b\)I073173: Elevated Hydroxybutyrate Dehydrogenase 3. \(c\)I053357: Normal LDL Cholesterol 4. \(d\)I020660: Significant Decrease in Uric Acid
- •Total indicators: 12 \(including streams above plus basic signs block\)

## Appendix CAppendix C: Worked consultation trace \(patient\_0077\)

This appendix instantiates Alg\.[1](https://arxiv.org/html/2605.15016#alg1)from the manuscript: gaps come from unresolved KB slots;RenderQuestionemits the turns below;Parsemaps answers back into discrete evidence increments for Eqs\.[5](https://arxiv.org/html/2605.15016#S3.E5)–[6](https://arxiv.org/html/2605.15016#S3.E6)\. Numeric “probabilities” quoted in illustrative prompts/logs are shorthand*ranks*; the released system consumes energy scoresRiR\_\{i\}and softmaxP~\\tilde\{P\}exactly as defined in Section 3\. The vignette covers 2025–2031 timestamps consistent with Appendix[B](https://arxiv.org/html/2605.15016#A2)\.

### C\.1Rendered dialog \(truncated transcript\)

Chief complaints: occasional hematemesis \(six months\) and worsening headache \(one month\)\.

1. Turn 1:Patient: I have occasional hematemesis \(6 months\) and worse headaches \(1 month\)\. Is this linked to my liver disease?
2. Turn 2:COTCAgent: Your AFP rose from "Severe" to "Critical" \(Aug 2026\) then fell to "Severe" \(Jan 2027\)\. Did you have abdominal pain, loss of appetite, or drink alcohol in mid\-2026?
3. Turn 3:Patient: Severe abdominal pain \(Jul 2026, no hospital\)\. I drank half a catty of liquor daily then\.
4. Turn 4:COTCAgent: Your headache was "Severe" \(Feb 2029\), "Medium" \(Jun 2031\), now worse\. Any high blood pressure, blurred vision, or painkillers \(e\.g\., ibuprofen\) lately?
5. Turn 5:Patient: No blood pressure check, but occasional blurred vision\. No painkillers—poor sleep, stay up late\.
6. Turn 6:COTCAgent: Final questions: How often is hematemesis? Color? Any black stools?
7. Turn 7:Patient: 1x/month, dark red\. No black stools, but weakness after\.
8. Turn 8:COTCAgent: Thank you\. Below is your risk analysis\.

### C\.2Backend module traces

#### C\.2\.1TSA \(Time Series Analysis\)

These snippets summarize the longitudinal signals forwarded to symptom/trend predicates \(pipeline in Fig\.[2](https://arxiv.org/html/2605.15016#S3.F2)of the manuscript\)\.

##### TSA planner prompt\.

```
# TSA Module Task
Role: Analyze patient_0077’s 2025--2031 records.
Task:
1. Extract trends for AFP, Hematemesis, Headache:
   - Severity slope, sudden changes, abnormal duration.
2. Output: [Indicator, Time Window, Trend, Inflection Points].
Constraints:
- "Critical" AFP = possible liver disease progression.
- Single "None" for Headache $\approx$ resolved.
```

##### TSA narrative output\.

```
=== Normal Alpha-fetoprotein ===
Window: 2025-06-18 to 2027-01-09
Trend: Upward (Progressive) | Slope: 0.001892
Inflections: [{’Time’: ’2026-08-12’, ’Prev’: ’Severe’,
               ’Curr’: ’Critical’, ’Change’: 1},
              {’Time’: ’2027-01-09’, ’Prev’: ’Critical’,
               ’Curr’: ’Severe’, ’Change’: 1}]

=== Headache ===
Window: 2026-07-03 to 2031-06-07
Trend: Downward (Improving) $\rightarrow$Recent Worsening | Slope: -0.000945
Inflections: [{’Time’: ’2029-02-17’, ’Prev’: ’Extreme’,
               ’Curr’: ’Severe’, ’Change’: 2}]

=== Hematemesis ===
Window: 2025-11-06 to 2027-07-29
Trend: Upward (Progressive) | Slope: 0.002108
Inflections: [{’Time’: ’2027-07-29’, ’Prev’: ’Mild’,
               ’Curr’: ’Critical’, ’Change’: 3}]
```

#### C\.2\.2COTC matching \(illustrative\)

Reminder:illustrative “Prob” traces are shorthand; production runs accumulate evidence in Eq\.[5](https://arxiv.org/html/2605.15016#S3.E5)and normalize with Eq\.[6](https://arxiv.org/html/2605.15016#S3.E6)\.

##### COTC planner prompt \(illustrative\)\.

```
# Probabilistic CoT Task
Role: Rank disease risks for patient_0077.
Input:
1. TSA trends: AFP up, Hematemesis progress, Headache worse.
2. Knowledge Base:
   - AFP Critical + Hematemesis Critical $\rightarrow$Liver Cirrhosis (0.85);
   - Headache worse + Blurred Vision $\rightarrow$Hypertensive Encephalopathy (0.6).
Task:
1. Calculate probabilities (use prior diagnoses: Liver Cirrhosis, Gout).
2. Rank top 5 diseases + explain logic.
3. List gaps for consultation.
Constraints:
- High confidence requires corroborating trend matches.
```

##### Initial ranked hypotheses \(illustrative\)\.

```
=== Initial Risk Ranking (Top 5) ===
1. Severe Liver Cirrhosis Exacerbation
   - Logic: AFP up (Critical) + Hematemesis up (Critical) + prior
   - Prob: 0.85

2. Hypertensive Encephalopathy
   - Logic: Headache worse + patient’s blurred vision
   - Prob: 0.60

3. Acute Gastric Hemorrhage
   - Logic: Hematemesis (suspected dark red) + unconfirmed alcohol
   - Prob: 0.45

4. Gouty Arthritis Flare-up
   - Logic: Prior Gout + no recent joint pain (gap)
   - Prob: 0.20

5. Migraine
   - Logic: Past "Extreme" headache + no aura (gap)
   - Prob: 0.15

=== Gaps to Resolve ===
1. AFP spike cause (mid-2026): Abdominal pain? Alcohol?
2. Headache: High BP? Painkillers?
3. Hematemesis: Frequency? Color? Black stools?
```

#### C\.2\.3Gap\-to\-question mapping

These questions correspond to unresolved slots enumerated byTopGapsin Alg\.[1](https://arxiv.org/html/2605.15016#alg1)\.

##### Questions tied to structured gaps\.

1. Q1:Q1 \(AFP gap\): "Did you have abdominal pain or drink alcohol in mid\-2026?"
2. Q2:Q2 \(Headache gap\): "Any high BP, blurred vision, or painkillers lately?"
3. Q3:Q3 \(Hematemesis gap\): "Hematemesis frequency, color, and black stools?"

##### Updated ranking after user answers \(illustrative\)\.

```
=== Updated Risk Ranking ===
1. Severe Liver Cirrhosis Exacerbation
   - New: + Alcohol (mid-2026) + abdominal pain (Jul 2026)
   - Prob: 0.85 $\rightarrow$0.92

2. Hypertensive Encephalopathy
   - New: + Confirmed blurred vision + poor sleep (BP risk)
   - Prob: 0.60 $\rightarrow$0.75

3. Acute Gastric Hemorrhage
   - New: + Dark red hematemesis (monthly) + alcohol (gastric irritant)
   - Prob: 0.45 $\rightarrow$0.55

4. Gouty Arthritis Flare-up
   - New: - No joint pain reported
   - Prob: 0.20 $\rightarrow$0.10

5. Migraine
   - New: - No aura + blurred vision = BP link
   - Prob: 0.15 $\rightarrow$0.05
```

#### C\.2\.4Narrative wrap\-up

##### Clinician\-facing summary\.

```
=== Final Risk Prediction (patient_0077) ===
1. High Risk: Severe Liver Cirrhosis Exacerbation
   - Basis: AFP up (Critical, Aug 2026); hematemesis (dark red, monthly);
     alcohol trigger + abdominal pain (Jul 2026); prior diagnosis.

2. Medium Risk: Hypertensive Encephalopathy
   - Basis: Headache worse; blurred vision; poor sleep (BP risk);
     no painkillers (rules out drug cause).

3. Low Risk: Acute Gastric Hemorrhage
   - Basis: Dark red hematemesis (monthly); alcohol irritation;
     no black stools (no massive hemorrhage).
```

##### Step\-by\-step audit trail\.

```
=== Complete Chain-of-Thought ===
Step 1: TSA $\rightarrow$Extract 3 trends:
        - AFP: Upward (Critical, 2026);
        - Hematemesis: Upward (Critical, 2027);
        - Headache: Downward long-term, now worse.

Step 2: Initial Match $\rightarrow$Link to diseases:
        - Liver Cirrhosis (0.85, prior + 2 trends);
        - Hypertensive Encephalopathy (0.60, 1 trend + complaint).

Step 3: Find Gaps $\rightarrow$3 unresolved: AFP cause, headache symptoms,
        hematemesis details.

Step 4: Consult $\rightarrow$Resolve gaps:
        - AFP: Alcohol + abdominal pain;
        - Headache: Blurred vision + no painkillers;
        - Hematemesis: Dark red, monthly, no black stools.

Step 5: Update Prob $\rightarrow$Refine ranks:
        - Liver Cirrhosis (0.92), Hypertensive Encephalopathy (0.75);
        - Drop low-prob diseases (Gout, Migraine).

Step 6: Predict $\rightarrow$Output risks with clear basis.
```

## Appendix DAppendix D: KB construction, validation, and leakage controls

Table[7](https://arxiv.org/html/2605.15016#A4.T7)summarizes how we mitigated synthetic\-data leakage and documented clinician oversight; it also records the scorer hyperparameters reused in Alg\.[1](https://arxiv.org/html/2605.15016#alg1)so governance and reproducibility notices stay aligned\. Augmentation models \(commercial instruction\-tuned checkpoints in our builds\) propose*candidate*tuples only—edges enter the KB only after automated contradiction scans, held\-out adjudication, and manual spot checks\. Public evaluation corpora from §[4](https://arxiv.org/html/2605.15016#S4)never feed the augmentation prompts\.

Table 7:KB governance checklist \(abbreviated\)\.
## Appendix EAppendix E: TSA routing sketches

The router parses TSA intents into \{trend\-test, change\-point, smooth, compare\-to\-norm\} buckets\. The listing below sketches the selection order; planners fall back progressively if optimisation fails \(lmertimeouts→\\rightarrowpiecewise Mann–Kendall→\\rightarrowsimple slope\)\.

```
intent = Parse(Q)
if "abrupt"/"breakpoint" in intent: run Bayesian CP search (fast grid)
elif "population norm" in intent: join cohort summaries; run z-test on slopes
elif "smooth trajectory": default to penalized spline / GP lite
else: mixed-effects trend with AR(1) residual if n>6 else Theil-Sen
if fit fails twice: downgrade + attach uncertainty flag for COTC
```

## Appendix FAppendix F: Qualitative contrast and baseline prompt parity

This appendix isolates material that supports the tabulated comparisons in the main paper but would clutter §[4](https://arxiv.org/html/2605.15016#S4): a qualitative visualization of tool\-assisted reasoning versus single\-pass generation, followed by the exact prompt shell shared by baselines\.

### F\.1Illustrative reasoning depth \(not a quantitative metric\)

Figure[5](https://arxiv.org/html/2605.15016#A6.F5)sketches how a frontier single\-pass API reader \(GPT\-4o class\) behaves as a monolithic assistant on the same synthetic vignette family, whereas COTCAgent stages TSA summaries, KB matching, and gap filling\. The figure is for intuition only; headline numbers come from §[4](https://arxiv.org/html/2605.15016#S4)\.

![Refer to caption](https://arxiv.org/html/2605.15016v1/figures/Figx1.png)Figure 5:Qualitative comparison of staged tool use \+ structured completion \(COTCAgent\) versus one\-shot generation with afrontier API reader\(GPT\-4o class\) on matched prompts\.
### F\.2Baseline model prompt configuration and QA protocol

Baselines receive the same system role, context assembly, and decoding knobs as COTCAgent*except*for the missing TSA code generation, Symptom–Trend–Disease scoring, and consultation loop in Alg\.[1](https://arxiv.org/html/2605.15016#alg1)\. That architectural gap is intentional: it isolates the lift attributable to structured temporal narratives plus energy\-based matching\.

Prompt shell \(all backbones\)\.

- •System role:“You are a clinical diagnostic assistant that analyzes patient symptoms and medical indicators to identify potential diseases\.”
- •Task block:Request either top\-kkdiseases with confidence masses or the multiple\-choice letter, depending on the benchmark converter\.
- •Context block:Demographics, symptom list, labs, and \(when provided\) trend blurbs copied verbatim from the dataset loader so no model sees a longer history than another\.
- •Decoder:Temperature 0, shared max tokens, identical stop sequences; API versions are frozen per release \(hashes in the code drop\)\.

QA\-style diagnostic prompt \(conversational suites\)\.

```
Based on the following clinical case, return the top-3 diseases with
nonnegative confidences that sum to 1.0. Cite symptoms or labs that
motivate each line.

Patient case:
[INSERT CASE JSON / TEXT FROM LOADER]

Answer format:
1. Disease: p=0.__ - one-sentence justification
2. Disease: p=0.__ - ...
3. Disease: p=0.__ - ...
```

Fairness checklist \(aligned with §[4](https://arxiv.org/html/2605.15016#S4)and the decoding defaults above\)\.

1. 1\.Identical templates per backbone; no ad\-hoc chain\-of\-thought hints for competitors\.
2. 2\.Contexts truncated/padded to the same token budget before scoring\.
3. 3\.Parser extracts the first three probability lines; malformed outputs count as errors equally for every method\.
4. 4\.Random seeds, API endpoints, and batching policies are logged alongside the KB hashes in Appendix[D](https://arxiv.org/html/2605.15016#A4)\.

## Appendix GAppendix G: Full conversational\-suite table

Figure[3](https://arxiv.org/html/2605.15016#S4.F3)plots MedQA and HealthBench accuracy \(mean±\\pmstd over five rerolls\) across the five backbones for each agent recipe; Google and DirPred rows are omitted from the bar layout but appear numerically in Table[8](https://arxiv.org/html/2605.15016#A7.T8)\. Table[8](https://arxiv.org/html/2605.15016#A7.T8)reproduces the complete MedQA / HealthBench / Time\-MMD / DiSCQ splits \(mean±\\pmstd\) for all six agent recipes per backbone\.Boldfacemarks the best mean in each numeric column within a backbone block \(ties broken by lower reported std\)\. Both panels are exported fromscripts/plot\_conversational\_benchmark\.py\.

Table 8:Medical reasoning suites \(full table; mean±\\pmstd\)\.
## Appendix HAppendix H: Round\-by\-round accuracy breakdown for all backbones

Table[9](https://arxiv.org/html/2605.15016#A8.T9)extends Table[4](https://arxiv.org/html/2605.15016#S4.T4)in the main text to all five backbones, showing top\-1 accuracy after each consultation round on the longitudinal benchmark\. The pattern observed for Baichuan\-M2—where the first question contributes the majority of interactive gain—generalizes across models, though absolute gains vary with backbone capability\. GPT\-4o and Claude 3\.7 Sonnet show smaller marginal returns from later rounds \(diminishing gains after Round 1\), consistent with their stronger initial rankings\. Qwen3\-Next\-80B\-A3B exhibits the flattest gain curve, suggesting that weaker TSA narratives limit the effectiveness of subsequent gap filling\.

Table 9:Round\-by\-round accuracy \(%\) across all backbones on longitudinal benchmark\. R0=initial ranking; R1–R3=after 1–3 questions\.
## Appendix IAppendix I: Knowledge base statistics

Figure[6](https://arxiv.org/html/2605.15016#A9.F6)visualizes the multiplicity distribution in the Symptom–Trend–Disease KB\. Typical rows link approximately 15 symptoms and 3–8 trends per disease, with coverage spanning cardiovascular, respiratory, metabolic, infectious, and neurological conditions\.

![Refer to caption](https://arxiv.org/html/2605.15016v1/x5.png)Figure 6:Radar view of multiplicity in the Symptom–Trend–Disease KB \(symptoms, trends, and diseases per entity class\)\.
## Appendix JAppendix J: Knowledge base sources and evaluation isolation

Table[10](https://arxiv.org/html/2605.15016#A10.T10)lists all sources used to construct the Symptom–Trend–Disease knowledge base alongside the evaluation benchmarks used in this study\. To prevent circular evaluation, we enforce strict source separation: KB construction draws only from established medical references, while evaluation uses held\-out benchmark suites\. No evaluation prompt or gold\-standard answer appears in KB sources\.

Table 10:Knowledge base sources vs\. evaluation benchmarks\. No overlap permitted between KB construction sources and evaluation data\.
## Appendix KAppendix K: Human expert evaluation of intermediate reasoning

To assess the clinical plausibility of COTCAgent’s intermediate outputs, we conducted a blinded review with 16 board\-certified physicians \(internal medicine and family practice\)\. The study was approved as a quality improvement protocol by the institutional review board\.

##### Evaluation design\.

Reviewers assessed three intermediate artifacts without knowing the model source: \(i\) TSA\-generated trend narratives \(N=200\), \(ii\) Top\-3 disease rankings with gap explanations \(N=150\), \(iii\) Selected follow\-up questions \(N=150\)\. Each item was rated on 3\-point scales for clinical reasonableness \(1=unreasonable, 2=acceptable, 3=excellent\) and information completeness\.

##### Results\.

Expert agreement rates for COTCAgent outputs: TSA narratives 87% rated acceptable or excellent \(mean score 2\.41±\\pm0\.52\); disease rankings 82% acceptable or better \(2\.33±\\pm0\.61\); follow\-up question selection 79% acceptable or better \(2\.28±\\pm0\.58\)\. Inter\-rater reliability was moderate \(κ=0\.64\\kappa=0\.64\)\. Common critiques included occasional over\-specificity in trend extrapolation and preference for broader differential coverage in early rounds\.

##### Limitations\.

This evaluation assessed reasonableness, not diagnostic accuracy against ground truth\. Expert panels were time\-constrained; full chart\-review validation is future work\.
COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

Similar Articles

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

ACIL: Auto Chain of Thoughts for In-Context Learning

CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models

Submit Feedback

Similar Articles

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
ACIL: Auto Chain of Thoughts for In-Context Learning
CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models