Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

arXiv cs.AI Papers

Summary

This paper presents Traj-Evolve, a self-evolving multi-agent system that uses an experience pool and multi-agent reinforcement learning to model patient trajectories from longitudinal EHRs for lung cancer early detection, outperforming strong baselines.

arXiv:2606.02812v1 Announce Type: new Abstract: Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:41 AM

# Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection
Source: [https://arxiv.org/html/2606.02812](https://arxiv.org/html/2606.02812)
Sihang Zeng1,2,Matthew Thompson3,Ruth Etzioni2,Meliha Yetisgen1 1University of Washington,2Fred Hutch Cancer Center,3Google [zengsh@uw\.edu](https://arxiv.org/html/2606.02812v1/mailto:[email protected]),[melihay@uw\.edu](https://arxiv.org/html/2606.02812v1/mailto:[email protected])

###### Abstract

Modeling patient trajectories from longitudinal electronic health records \(EHRs\) requires reasoning over sparse, noisy, and long\-context multimodal sequences\. Existing LLM\-based multi\-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases\. We present Traj\-Evolve, a self\-evolving multi\-agent system with two complementary evolving mechanisms\. First, an Experience Pool \(ExPool\) acts as a non\-parametric memory, indexing rejection\-sampled reasoning traces to retrieve similar patients as few\-shot contexts\. Second, multi\-agent reinforcement learning \(MARL\) via reward\-ranked fine\-tuning parametrically optimizes inter\-agent and agent\-memory collaboration\. A leave\-one\-out cross\-retrieval strategy unifies the two, aligning training\- and inference\-time behavior under retrieval augmentation\. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj\-Evolve outperforms 9 strong baselines on the overall population and a challenging never\-smoker population\. Analysis of the evolving dynamics highlights three key findings: \(1\) expanding the ExPool shifts optimal retrieval from diverse to specific samples; \(2\) under MARL, the manager agent’s prediction loss converges quickly while the worker agents’ temporal reasoning continues to benefit from more verified patients; and \(3\) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity\.

Traj\-Evolve: A Self\-Evolving Multi\-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Sihang Zeng1,2, Matthew Thompson3, Ruth Etzioni2, Meliha Yetisgen11University of Washington,2Fred Hutch Cancer Center,3Google[zengsh@uw\.edu](https://arxiv.org/html/2606.02812v1/mailto:[email protected]),[melihay@uw\.edu](https://arxiv.org/html/2606.02812v1/mailto:[email protected])

## 1Introduction

Lung cancer is the leading cause of cancer\-related mortality worldwideSunget al\.\([2021](https://arxiv.org/html/2606.02812#bib.bib1)\); Lancasteret al\.\([2022](https://arxiv.org/html/2606.02812#bib.bib2)\), and early detection substantially improves patient outcomesLancasteret al\.\([2022](https://arxiv.org/html/2606.02812#bib.bib2)\)\. Longitudinal electronic health records \(EHRs\) offer a uniquely powerful opportunity for early detection, as they accumulate a rich, multimodal clinical history including diagnoses, procedures, laboratory values, vital signs, medications, and unstructured clinical notes, which collectively encode subtle disease trajectories preceding a cancer diagnosisJensenet al\.\([2012](https://arxiv.org/html/2606.02812#bib.bib19)\); Kimet al\.\([2019](https://arxiv.org/html/2606.02812#bib.bib20)\)\. Within these trajectories lie early signals of risk and their trends, such as recurrent respiratory symptoms, chronic pulmonary conditions, or incidental radiographic findings documented years before diagnosisD’Arcyet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib12)\); Gantiet al\.\([2021](https://arxiv.org/html/2606.02812#bib.bib3)\)\.

Extracting and temporally reasoning over these signals from long and noisy patient trajectories, however, is challenging\. Recent studies evaluated LLM\-based approaches for generalizable modeling from heterogeneous EHR dataCuiet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib53)\); Kruseet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib54)\); Zenget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib35),[2026](https://arxiv.org/html/2606.02812#bib.bib55)\)\. Among these, Traj\-CoA is a multi\-agent framework that leverages chain\-of\-agents and a long\-term memory to facilitate temporal reasoning over patient trajectories for cancer early detection, eliminating complex feature engineering while achieving zero\-shot performance comparable to supervised machine learning and deep learning modelsZenget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib35),[2026](https://arxiv.org/html/2606.02812#bib.bib55)\)\.

Despite these advances, existing LLM\-based longitudinal EHR modeling systems share a fundamental limitation: they are static\. Every patient is processed in isolation, relying solely on the LLM’s frozen parametric knowledge and a fixed prompt\. This stands in sharp contrast to expert clinical practice, where diagnostic judgement is continually refined by accumulated experience with similar patients\. This process is central to how clinicians recognise atypical presentations, such as early lung cancer in a never\-smoker with an otherwise unremarkable historyEva \([2005](https://arxiv.org/html/2606.02812#bib.bib37)\); Patelet al\.\([2005](https://arxiv.org/html/2606.02812#bib.bib38)\)\. For lung cancer early detection, where cases are clinically heterogeneous and often subtly distinguished from controls by patterns distributed across years of records, the inability of a system to learn from past verified cases can limit both performance and robustness, particularly in minority subgroups such as never\-smokers\.

Emerging research on self\-evolving LLM agents has the potential to address this gap\. Rather than treating the model as immutable, self\-evolving agents continually update their behavior through interaction and feedback, evolving their memory, prompts, tools, or parameters as new experience accumulatesGaoet al\.\([2025a](https://arxiv.org/html/2606.02812#bib.bib39)\); Zhanget al\.\([2025c](https://arxiv.org/html/2606.02812#bib.bib40)\)\. For example, memory\-based approaches save the problem\-solving trajectories as experience into an external database, which could guide future decisions through retrieval\-augmented generation \(RAG\)Shinnet al\.\([2023](https://arxiv.org/html/2606.02812#bib.bib41)\); Zhaoet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib42)\); Wuet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib43)\); Zhouet al\.\([2025a](https://arxiv.org/html/2606.02812#bib.bib44)\); Tanget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib56)\)\. In parallel, reinforcement\-learning \(RL\) based approaches such as reward\-ranked fine\-tuning \(RAFT\)Donget al\.\([2023](https://arxiv.org/html/2606.02812#bib.bib45)\); Xionget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib46)\); Zhanget al\.\([2025b](https://arxiv.org/html/2606.02812#bib.bib62)\)and multi\-agent RL variantsMaet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib47)\); Liaoet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib48)\); Zhanget al\.\([2025a](https://arxiv.org/html/2606.02812#bib.bib49)\)enable collaborative agent systems to internalise successful reasoning patterns directly into their parameters\.

In healthcare, self\-evolving agents have been explored in synthetic or simulated patient interactionsLiet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib50)\); Almansooriet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib51)\)and medical question answeringChenet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib52)\)\. Designing self\-evolving systems for patient trajectory modeling in early cancer detection poses a distinct challenge: it requires complex temporal reasoning over years of noisy, multimodal data and the ability to draw reusable insights from heterogeneous clinical cases\. Existing techniques may not be readily applicable to this scenario, and performance remains unclear\. To our knowledge, no prior work has designed self\-evolving agents to enhance longitudinal EHR modeling for real\-world early cancer detection\.

To bridge this gap, we present Traj\-Evolve, a self\-evolving multi\-agent framework for patient trajectory modelling that extends the Traj\-CoA architectureZenget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib35)\)with two complementary evolutionary mechanisms, an evolving experience pool \(ExPool\) and multi\-agent reinforcement learning \(MARL\)\. Collectively, these mechanisms enable Traj\-Evolve to learn from its own experience as it processes more patients, continually refining its temporal reasoning and learning from “patients\-like\-me", eventually improving performance over time\. These two mechanisms transform patient trajectory modeling from a static, isolated prediction task into a continuously improving clinical learning system\.

We evaluated Traj\-Evolve on a large longitudinal cohort from a medical center, using five years of multimodal EHR history to predict incident lung cancer within the subsequent year among the overall population and the particularly challenging subgroup of never\-smokers\. We benchmarked against a comprehensive suite of baselines spanning clinical risk models, supervised machine learning, sequential deep learning, clinical BERT\-based models, and LLM\-based systems\.

The main contributions of this work are:

- •We introduce Traj\-Evolve, to our knowledge, the first self\-evolving multi\-agent framework for longitudinal EHR modelling applied to a real\-world clinical prediction task\.
- •We design two complementary evolutionary mechanisms: an evolving experience pool \(ExPool\) that provides non\-parametric, few\-shot “patients\-like\-me” retrieval, and a MARL procedure that parametrically optimizes inter\-agent and agent\-memory collaboration using rejection\-sampled high\-reward trajectories\.
- •We demonstrate that Traj\-Evolve achieves state\-of\-the\-art discrimination for one\-year lung cancer prediction in the overall population and the challenging never\-smoker population\.
- •We provide detailed analyses of the self\-evolving dynamics, supporting the vision of a continuously improving clinical decision\-support system\.

![Refer to caption](https://arxiv.org/html/2606.02812v1/x1.png)Figure 1:Overview of the Traj\-Evolve architecture and self\-evolving workflow\.The top panel illustrates the self\-evolving process, wherein the system accumulates experience from prior verified patients to iteratively update Traj\-Evolve and facilitate prediction for a new patient\. The bottom panel details the pipeline\.
## 2Related Works

#### LLM\-based Patient Trajectory Modeling

Recent work increasingly leverages strong LLMs for zero\- or few\-shot reasoning over heterogeneous clinical histories, including DT\-GPT for clinical variable forecastingMakarovet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib33)\), EHR2Path for scalable patient pathway predictionPellegriniet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib29)\), TIMER for temporal instruction tuningCuiet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib53)\), andKruseet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib54)\)for long\-context summarization\. Yet single\-LLM pipelines remain limited by the lost\-in\-the\-middle phenomenonLiuet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib79)\)on very long EHRs and by the complexity in specific clinical prediction tasks, motivating multi\-agent designs to decompose longitudinal EHR modeling into simpler subtasks\. MoMAGaoet al\.\([2025b](https://arxiv.org/html/2606.02812#bib.bib70)\)coordinates modality\-specialized agents for clinical prediction, CARE\-ADLiet al\.\([2025b](https://arxiv.org/html/2606.02812#bib.bib71)\)and ClinNoteAgentsZhouet al\.\([2025b](https://arxiv.org/html/2606.02812#bib.bib72)\)decompose reasoning across specialist agents, and Traj\-CoAZenget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib35),[2026](https://arxiv.org/html/2606.02812#bib.bib55)\)extend chain\-of\-agentsZhanget al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib69)\)with long\-term memory for cancer early detection\. Complementary efforts such as CliCARELiet al\.\([2025a](https://arxiv.org/html/2606.02812#bib.bib73)\)and TRACEQu and Färber \([2026](https://arxiv.org/html/2606.02812#bib.bib74)\)further explore temporal knowledge graph and dual\-memory approaches\. However, these systems are static: each patient is reasoned about in isolation, with no mechanism or evaluation for accumulating verified clinical experience over time\.

#### Self\-Evolving Agents

Gaoet al\.\([2025a](https://arxiv.org/html/2606.02812#bib.bib39)\)organize self\-evolving agents along three axes: what, when, and how to evolve\. Along what to evolve, prior work targets memory, prompts, tools, or model parameters; along when, adaptation can be intra\- or inter\-test\-time; along how, it is driven by textual feedback or scalar rewards in single\- or multi\-agent settings\. For example, memory\-evolving methods such as ReflexionShinnet al\.\([2023](https://arxiv.org/html/2606.02812#bib.bib41)\), ExpeLZhaoet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib42)\), MementoZhouet al\.\([2025a](https://arxiv.org/html/2606.02812#bib.bib44)\), and Agent KBTanget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib56)\)store and retrieve past trajectories as non\-parametric experience\. Parameter\-evolving methods internalize successful experience via model training, including supervised fine\-tuning and reinforcement learningZelikmanet al\.\([2022](https://arxiv.org/html/2606.02812#bib.bib75)\); Zuoet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib76)\); Donget al\.\([2023](https://arxiv.org/html/2606.02812#bib.bib45)\); Wanget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib77)\)and multi\-agent extensionsZhanget al\.\([2025a](https://arxiv.org/html/2606.02812#bib.bib49)\); Maet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib47)\); Liaoet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib48)\)\. These two families are typically pursued in isolation\. In healthcare, self\-evolution has so far been confined to simulated or interactive settings, including Agent HospitalLiet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib50)\), MedAgentSimAlmansooriet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib51)\), MDTeamGPTChenet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib52)\), and EvoClinicianHeet al\.\([2026](https://arxiv.org/html/2606.02812#bib.bib78)\)\.

To our knowledge, no prior work has applied self\-evolving agents to longitudinal EHR modeling for real\-world clinical prediction\. Traj\-Evolve fills this gap by jointly evolving memory \(ExPool\) and parameters \(MARL\) at inter\-test\-time, unified by a leave\-one\-out cross\-retrieval procedure that aligns training\- and inference\-time augmentation for lung cancer early detection\.

## 3Methods

### 3\.1Problem Formulation

#### Lung Cancer Early Detection

Let𝒫=\{pi\}i=1N\\mathcal\{P\}=\\\{p\_\{i\}\\\}\_\{i=1\}^\{N\}denote a cohort of patients\. For each patientpip\_\{i\}, we observe a longitudinal multimodal EHR sequence

𝒳i=\{\(ti,j,ei,j\)\}j=1Ti,ti,j≤ti⋆,\\mathcal\{X\}\_\{i\}=\\big\\\{\(t\_\{i,j\},e\_\{i,j\}\)\\big\\\}\_\{j=1\}^\{T\_\{i\}\},\\quad t\_\{i,j\}\\leq t\_\{i\}^\{\\star\},\(1\)whereti⋆t\_\{i\}^\{\\star\}is the patient\-specific index date \(time of prediction\),TiT\_\{i\}is the number of dated entries within the available EHR, and each eventei,je\_\{i,j\}at timeti,jt\_\{i,j\}consists of either a structured record \(diagnosis, medication, lab, vital, or procedure code\) or unstructured clinical text \(notes and radiology reports\)\. The binary targetyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}indicates whetherpip\_\{i\}receives a first primary lung\-cancer diagnosis within one year afterti⋆t\_\{i\}^\{\\star\}\.

The task of lung cancer early detection is to learn a functionfθ:𝒳i↦\(si,ri\)f\_\{\\theta\}:\\mathcal\{X\}\_\{i\}\\mapsto\(s\_\{i\},r\_\{i\}\)that maps the longitudinal record to an integer risk scoresi∈\{1,…,10\}s\_\{i\}\\in\\\{1,\\dots,10\\\}and a natural\-language rationalerir\_\{i\}\.

#### Self\-Evolving System

Beyond standard generalization, we additionally requirefθf\_\{\\theta\}to improve as more patients are seen and verified, yielding a self\-evolving system\. We formalize this as follows\. Patients arrive sequentially as a streamp1,p2,…p\_\{1\},p\_\{2\},\\dots\. At each steptt, the system maintains an experience setℰt\\mathcal\{E\}\_\{t\}that summarises all patients that have been processed and verified by a known diagnostic status so far\. The primary objective of a self\-evolving system is to continuously improve the performance offθ​\(ℰt\)f\_\{\\theta\(\\mathcal\{E\}\_\{t\}\)\}as the experience setℰt\\mathcal\{E\}\_\{t\}grows\. This captures two coupled challenges:fθ​\(ℰt\)f\_\{\\theta\(\\mathcal\{E\}\_\{t\}\)\}must perform complex temporal reasoning over long and noisy EHRs to produce\(st,rt\)\(s\_\{t\},r\_\{t\}\), and it additionally needs to construct and leverageℰt\\mathcal\{E\}\_\{t\}so that performance improves with\|ℰt\|\|\\mathcal\{E\}\_\{t\}\|\.

We randomly split the dataset into training𝒟t​r\\mathcal\{D\}\_\{tr\}, validation𝒟v​a​l\\mathcal\{D\}\_\{val\}, and test set𝒟t​e​s​t\\mathcal\{D\}\_\{test\}\. In practice, we simulated the stream of verified patients by drawing a random sample from the training set that grew incrementally over time\.

### 3\.2Background: Traj\-CoA Base System

We build on Traj\-CoA\(Zenget al\.,[2025](https://arxiv.org/html/2606.02812#bib.bib35),[2026](https://arxiv.org/html/2606.02812#bib.bib55)\), a static chain\-of\-agents \(CoA\)Zhanget al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib69)\)backbone that handles long\-context EHRs\. Each𝒳i\\mathcal\{X\}\_\{i\}is serialized into an LLM\-friendly XML representation and segmented intoCiC\_\{i\}chronologically ordered chunks\{ci,1,…,ci,Ci\}\\\{c\_\{i,1\},\\dots,c\_\{i,C\_\{i\}\}\\\}such that each chunk fits within the LLM context limit\.

A sequence ofCiC\_\{i\}worker agents, all parameterized byθw\\theta\_\{w\}, processes the chunks sequentially\. Concretely, theℓ\\ell\-th worker maintains a running summaryui,ℓu\_\{i,\\ell\}and a long\-term episodic memory \(LTM\)ℳi,ℓ\\mathcal\{M\}\_\{i,\\ell\}\. The final worker summaryui=ui,Ciu\_\{i\}=u\_\{i,C\_\{i\}\}encapsulates the full EHR trajectory, whileℳi=ℳi,Ci\\mathcal\{M\}\_\{i\}=\\mathcal\{M\}\_\{i,C\_\{i\}\}stores a condensed timeline of lung\-cancer\-related events extracted across all chunks\. A manager agentπθm\\pi\_\{\\theta\_\{m\}\}then synthesizes both signals to produce the final output:

\(si,ri\)=πθm​\(ui,ℳi\)\.\(s\_\{i\},r\_\{i\}\)=\\pi\_\{\\theta\_\{m\}\}\\big\(u\_\{i\},\\mathcal\{M\}\_\{i\}\\big\)\.\(2\)
This zero\-shot pipeline decomposes complex temporal reasoning into inter\-agent and agent\-memory collaboration tasks, improving patient trajectory modeling performance\. However, it treats each patient in isolation, conditioning only on the LLM’s parametric knowledge\. This creates a gap with expert clinical practice, which relies on accumulated case experience\. To bridge this gap, we design two complementary self\-evolving mechanisms atop Traj\-CoA for our self\-evolving Traj\-Evolve system \(Figure[1](https://arxiv.org/html/2606.02812#S1.F1)\)\.

![Refer to caption](https://arxiv.org/html/2606.02812v1/x2.png)Figure 2:Methodological design of the self\-evolving mechanisms\.\(A\) Construction and inference pipeline for the ExPool\. \(B\) Construction of MARL training data\. \(C\) Integration of ExPool and MARL\.
### 3\.3Evolving Experience Pool \(ExPool\)

Inspired by the procedural memory approach that saves successful reasoning traces for future problem solving of similar tasksTanget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib56)\), we design an evolving experience pool \(ExPool, Figure[2](https://arxiv.org/html/2606.02812#S3.F2)A\)\. ExPool equips Traj\-Evolve with a non\-parametric procedural memory that saves certain reasoning traces of verified patients\. As Traj\-Evolve generates predictions that can be subsequently verified against ground\-truth diagnostic status \(e\.g\., confirmed cancer diagnosis or benign status\), these verified reasoning traces can serve as experience for future cases\. This design is analogous to expert clinical reasoning, in which a patient’s presentation is rarely adjudicated in isolation but rather by analogy to remembered cases that share a similar longitudinal clinical pattern\. Correct reasoning traces may offer reusable diagnostic patterns, while incorrect predictions may expose model vulnerabilities and trigger self\-reflection\. These data\-driven patterns evolve the system beyond its current boundary, forming the self\-evolving capability\.

#### ExPool Construction

To capture the current system’s boundary, i\.e\., the optimal diagnostic ability, we constructed ExPool using a rejection sampling approach that selected best\-of\-N from a set of roll\-out reasoning traces\. Formally, for each patientpi∈𝒟t​rp\_\{i\}\\in\\mathcal\{D\}\_\{tr\}, we drawmmindependent roll\-outs from the base system under an elevated sampling temperatureτ\>1\\tau\>1to diversify reasoning traces,

\{\(ui\(j\),ℳi\(j\),si\(j\),ri\(j\)\)\}j=1m∼fθ​\(𝒳i;τ\),\\big\\\{\(u\_\{i\}^\{\(j\)\},\\mathcal\{M\}\_\{i\}^\{\(j\)\},s\_\{i\}^\{\(j\)\},r\_\{i\}^\{\(j\)\}\)\\big\\\}\_\{j=1\}^\{m\}\\sim f\_\{\\theta\}\(\\mathcal\{X\}\_\{i\};\\tau\),\(3\)and retain a single optimal trace via label\-conditioned rejection sampling

ji⋆=\{arg⁡maxj⁡si\(j\),yi=1,arg⁡minj⁡si\(j\),yi=0\.j^\{\\star\}\_\{i\}=\\begin\{cases\}\\arg\\max\_\{j\}s\_\{i\}^\{\(j\)\},&y\_\{i\}=1,\\\\ \\arg\\min\_\{j\}s\_\{i\}^\{\(j\)\},&y\_\{i\}=0\.\\end\{cases\}\(4\)Notably, these optimal reasoning traces may mix correct and incorrect predictions, which provides different signals for ExPool\.

These selected traces then populate ExPool, a vector database where the retrieval keys are the embeddings of the final worker agent summaries:

𝐯i=ϕ​\(ui⋆\),\\mathbf\{v\}\_\{i\}=\\phi\(u\_\{i\}^\{\\star\}\),\(5\)in whichϕ​\(⋅\)\\phi\(\\cdot\)denotes an embedding model and index each experience as a key\-value pair

ℰ=\{\(𝐯i,\(ri⋆,si⋆,yi\)\)\}i∈𝒟tr,\\mathcal\{E\}=\\Big\\\{\\big\(\\mathbf\{v\}\_\{i\},\\;\(r\_\{i\}^\{\\star\},s\_\{i\}^\{\\star\},y\_\{i\}\)\\big\)\\Big\\\}\_\{i\\in\\mathcal\{D\}\_\{\\text\{tr\}\}\},\(6\)where the value stores the manager’s rationale, predicted risk, and ground\-truth label\. This design makes it feasible to embed and index long patient trajectories in the latent space, providing an efficient approach for experience retrieval\.

#### ExPool Inference

During inference, ExPool functions using a retrieval\-augmented generation \(RAG\)Xuet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib60)\)mechanism\. For each new patientpqp\_\{q\}, we adopt semantic retrieval by queryingℰ\\mathcal\{E\}with𝐯q=ϕ​\(uq\)\\mathbf\{v\}\_\{q\}=\\phi\(u\_\{q\}\)and returning the top\-kknearest neighbors as “patients\-like\-me” using cosine similarity:

𝒩k​\(q\)=\[Top−⁡k\]i∈ℰ​𝐯q⊤​𝐯i‖𝐯q‖​‖𝐯i‖\.\\mathcal\{N\}\_\{k\}\(q\)=\[\\operatorname\*\{Top\-\}k\]\_\{i\\in\\mathcal\{E\}\}\\;\\frac\{\\mathbf\{v\}\_\{q\}^\{\\top\}\\mathbf\{v\}\_\{i\}\}\{\\\|\\mathbf\{v\}\_\{q\}\\\|\\,\\\|\\mathbf\{v\}\_\{i\}\\\|\}\.\(7\)The semantic matching is chosen over exact matching because it yields a soft neighborhood that balances diversity and specificity\. Retrieved patients may exhibit different clinical profiles, matching the patientpqp\_\{q\}in diverse ways\. Rather than collapsing this heterogeneity into a hard label via majority voting, we delegate the comparative reasoning to the manager agent, conditioning it on the full retrieved set:

\(sq,rq\)=πθm​\(uq,ℳq,\{\(ri⋆,si⋆,yi\)\}i∈𝒩k​\(q\)\)\.\(s\_\{q\},r\_\{q\}\)=\\pi\_\{\\theta\_\{m\}\}\\big\(u\_\{q\},\\,\\mathcal\{M\}\_\{q\},\\,\\\{\(r\_\{i\}^\{\\star\},s\_\{i\}^\{\\star\},y\_\{i\}\)\\\}\_\{i\\in\\mathcal\{N\}\_\{k\}\(q\)\}\\big\)\.\(8\)
ExPool therefore provides the manager agent with diverse but clinically relevant “patients\-like\-me” as in\-context examples, and the manager agent comprehensively reasons over all information for prediction\. As ExPool continuously scales with newly verified patients, the retrieval of highly specific neighbors becomes increasingly precise, shifting the system from isolated predictions to a progressively self\-evolving framework\.

### 3\.4Multi\-Agent Reinforcement Learning \(MARL\)

ExPool adapts the system at inference time, but it leaves the underlying agent parameters\(θw,θm\)\(\\theta\_\{w\},\\theta\_\{m\}\)unchanged and acts only on the final manager step\. The intermediate worker reasoning, which decides what gets distilled intoℳi\\mathcal\{M\}\_\{i\}\(agent\-memory communication\) and how the chunk\-by\-chunk summary is built up \(inter\-agent communication\), does not receive a learning signal from the verified outcomes\. To close this gap, we introduce a second self\-evolving mechanism that internalizes successful reasoning traces directly into the model parameters, adapting reward\-ranked fine\-tuning \(RAFT\)\(Donget al\.,[2023](https://arxiv.org/html/2606.02812#bib.bib45)\), a reinforcement learning strategy that optimizes models using the reasoning trace with the highest reward among all self\-generated roll\-outs, to our multi\-agent setting\. \(Figure[2](https://arxiv.org/html/2606.02812#S3.F2)B\)

#### Reward and Accepted Set

Parallel to the construction of ExPool, we generatedmmroll\-outs with an elevated temperature\. We use the ground\-truth labelyiy\_\{i\}as a binary reward signal\. Unlike ExPool, which retains the optimal trace per patient regardless of its correctness, MARL retains only clinically consistent traces in its accepted set𝒜\\mathcal\{A\}:

𝒜=\{\(i,j\):R​\(si\(j\),yi\)=1\},\\mathcal\{A\}=\\big\\\{\(i,j\):R\(s\_\{i\}^\{\(j\)\},y\_\{i\}\)=1\\big\\\},\(9\)where the reward function is

R​\(s,y\)=\{𝟙​\[s≥6\],y=1,𝟙​\[s≤4\],y=0\.R\(s,y\)=\\begin\{cases\}\\mathbbm\{1\}\[s\\geq 6\],&y=1,\\\\ \\mathbbm\{1\}\[s\\leq 4\],&y=0\.\\end\{cases\}\(10\)Eq\. \([10](https://arxiv.org/html/2606.02812#S3.E10)\) acts as a hard rejection filter, ensuring the resulting fine\-tuning data consists exclusively of logically sound and clinically accurate reasoning traces\.

#### Decoupled Optimization

For each accepted trace\(i,j\)∈𝒜\(i,j\)\\in\\mathcal\{A\}, we decompose the roll\-out into per\-agent input\-output pairs\. Let𝐱i,j,cw\\mathbf\{x\}^\{w\}\_\{i,j,c\}and𝐲i,j,cw\\mathbf\{y\}^\{w\}\_\{i,j,c\}denote the input and output of thecc\-th worker, and let𝐱i,jm,𝐲i,jm\\mathbf\{x\}^\{m\}\_\{i,j\},\\mathbf\{y\}^\{m\}\_\{i,j\}denote the manager pair\. To preserve temporal balance across the chain, we subsample worker positions to retain the first, the last, and two randomly selected intermediate worker agents per reasoning trace\.

𝒞i,j=\{1,Ci\}∪Sample2​\(\{2,…,Ci−1\}\),\\mathcal\{C\}\_\{i,j\}=\\\{1,\\,C\_\{i\}\\\}\\,\\cup\\,\\mathrm\{Sample\}\_\{2\}\\big\(\\\{2,\\dots,C\_\{i\}\-1\\\}\\big\),\(11\)yielding the worker and manager training sets

𝒟w\\displaystyle\\mathcal\{D\}\_\{w\}=\{\(𝐱i,j,cw,𝐲i,j,cw\):\(i,j\)∈𝒜,c∈𝒞i,j\},\\displaystyle=\\big\\\{\(\\mathbf\{x\}^\{w\}\_\{i,j,c\},\\mathbf\{y\}^\{w\}\_\{i,j,c\}\):\(i,j\)\\in\\mathcal\{A\},\\,c\\in\\mathcal\{C\}\_\{i,j\}\\big\\\},\(12\)𝒟m\\displaystyle\\mathcal\{D\}\_\{m\}=\{\(𝐱i,jm,𝐲i,jm\):\(i,j\)∈𝒜\}\.\\displaystyle=\\big\\\{\(\\mathbf\{x\}^\{m\}\_\{i,j\},\\mathbf\{y\}^\{m\}\_\{i,j\}\):\(i,j\)\\in\\mathcal\{A\}\\big\\\}\.\(13\)
The worker and manager parameters are updated independently to preserve their specialized roles:

θw⋆\\displaystyle\\theta\_\{w\}^\{\\star\}=arg⁡minθw​∑\(𝐱,𝐲\)∈𝒟w−log⁡πθw​\(𝐲∣𝐱\),\\displaystyle=\\arg\\min\_\{\\theta\_\{w\}\}\\\!\\\!\\\!\\sum\_\{\(\\mathbf\{x\},\\mathbf\{y\}\)\\in\\mathcal\{D\}\_\{w\}\}\\\!\\\!\\\!\-\\log\\pi\_\{\\theta\_\{w\}\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\),\(14\)θm⋆\\displaystyle\\theta\_\{m\}^\{\\star\}=arg⁡minθm​∑\(𝐱,𝐲\)∈𝒟m−log⁡πθm​\(𝐲∣𝐱\)\.\\displaystyle=\\arg\\min\_\{\\theta\_\{m\}\}\\\!\\\!\\\!\\sum\_\{\(\\mathbf\{x\},\\mathbf\{y\}\)\\in\\mathcal\{D\}\_\{m\}\}\\\!\\\!\\\!\-\\log\\pi\_\{\\theta\_\{m\}\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\.\(15\)Optimizingθw\\theta\_\{w\}refines sequential inter\-agent reasoning and worker\-memory collaboration, while optimizingθm\\theta\_\{m\}refines how the final summary and the LTM are aggregated into a risk estimation\. As verified patient cohort expands, this MARL approach benefits from larger training data and drives continuous self\-evolution through reinforcing successful temporal reasoning and collaborative memory distillation to progressively enhance Traj\-Evolve’s intrinsic diagnostic capabilities\.

### 3\.5Combining ExPool and MARL

ExPool provides retrieved “patients\-like\-me” but does not update parameters; Vanilla MARL updates parameters but lacks an explicit mechanism to incorporate similar patients during final risk estimation\. We unify the two through a leave\-one\-out cross\-retrieval procedure that prevents data leakage and matches training and inference input formats, while injecting retrieval signals into the optimization\. \(Figure[2](https://arxiv.org/html/2606.02812#S3.F2)C\)

For eachpi∈𝒟trp\_\{i\}\\in\\mathcal\{D\}\_\{\\text\{tr\}\}, we construct a patient\-specific dynamic poolℰ−i=ℰ∖\{i\}\\mathcal\{E\}\_\{\-i\}=\\mathcal\{E\}\\setminus\\\{i\\\}from which we retrieve𝒩k​\(i\)\\mathcal\{N\}\_\{k\}\(i\)for the manager agent during the MARL roll\-out phase\. The manager input is augmented with the retrieved patients for prediction, after which the rejection filter in Eq\. \([10](https://arxiv.org/html/2606.02812#S3.E10)\) is applied to obtain the accepted set𝒜loo\\mathcal\{A\}^\{\\,\\text\{loo\}\}\. Formally, the accepted set is defined as:

𝒜loo=\{\(i,j\):R​\(si\(j\)​\(ℰ−i\),yi\)=1\},\\mathcal\{A\}^\{\\,\\text\{loo\}\}=\\Big\\\{\(i,j\):R\\big\(s\_\{i\}^\{\(j\)\}\(\\mathcal\{E\}\_\{\-i\}\),\\,y\_\{i\}\\big\)=1\\Big\\\},\(16\)wheresi\(j\)​\(ℰ−i\)s\_\{i\}^\{\(j\)\}\(\\mathcal\{E\}\_\{\-i\}\)denotes thejj\-th roll\-out conditioned on neighbours retrieved fromℰ−i\\mathcal\{E\}\_\{\-i\}\. This methodological synthesis yields optimal traces that benefit from the augmented context provided by “patients\-like\-me” from the dynamic ExPool, thereby generating potentially better training data compared to MARL in isolation\. The worker and manager agents were subsequently trained using the standard MARL protocol, while inference was conducted utilizing the standard ExPool RAG methodology on the full ExPool from all training patients\.

## 4Experiments

#### Dataset

We predict first primary lung cancer diagnoses within a one\-year window using an in\-house longitudinal EHR dataset\. The input𝒳i\\mathcal\{X\}\_\{i\}comprises up to five years of EHR history before the index dateti⋆t\_\{i\}^\{\\star\}\(the completion of a chest radiology exam\)\. Cases and controls are 1:10 matched on index exam type and date\. The training set𝒟t​r\\mathcal\{D\}\_\{tr\}contains13,62913,629patients\. We evaluate on two disjoint, held\-out test sets: an overall cohort \(n=1,000n=1,000\) and a clinically more challenging never\-smoker cohort \(n=835n=835\)\. Notably, patients inputs are exceptionally long, with median XML token counts exceeding 60k overall and 80k for never\-smokers \(Table[3](https://arxiv.org/html/2606.02812#A1.T3); Appendix[A](https://arxiv.org/html/2606.02812#A1)\)\.

#### Implementation Details

Traj\-Evolve uses GPT\-OSS\-20BOpenAIet al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib66)\)as the base LLM and nomic\-embed\-text\-v1\.5Nussbaumet al\.\([2024](https://arxiv.org/html/2606.02812#bib.bib58)\)for ExPool embeddings\. During roll\-out, we samplem=4m=4traces at temperatureτ=1\.5\\tau=1\.5\. Both MARL agents are trained via QLoRADettmerset al\.\([2023](https://arxiv.org/html/2606.02812#bib.bib67)\)for one epoch\.

#### Baselines

We compare against five baseline categories \(Appendix[B\.2](https://arxiv.org/html/2606.02812#A2.SS2)\): \(1\) clinical risk models \(LCRAT\(Katkiet al\.,[2016](https://arxiv.org/html/2606.02812#bib.bib13)\)\); \(2\) supervised ML \(Logistic Regression, XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2606.02812#bib.bib63)\)\); \(3\) sequential deep learning \(RETAIN\(Choiet al\.,[2016](https://arxiv.org/html/2606.02812#bib.bib25)\), PatientTM\(Silva and Matos,[2022](https://arxiv.org/html/2606.02812#bib.bib64)\)\); \(4\) clinical BERT \(Clinical ModernBERT\(Leeet al\.,[2025](https://arxiv.org/html/2606.02812#bib.bib65)\)\); and \(5\) GPT\-OSS\-20B\-based LLM pipelines \(vanilla LLM, RAG, and Traj\-CoA\(Zenget al\.,[2025](https://arxiv.org/html/2606.02812#bib.bib35)\)\)\. Further details are in Appendix[B\.2](https://arxiv.org/html/2606.02812#A2.SS2)\.

#### Evaluation

For binary risk classification, we report AUROC, AUPRC, and F1 as primary metrics, alongside sensitivity, specificity, PPV, and NPV from the predicted risksis\_\{i\}and binary targetyiy\_\{i\}\. All results are reported as bootstrap means and standard errors over1,0001,000resamples of each test set\.

## 5Results

Table 1:Model performance comparison on overall population\.### 5\.1Main Results

#### Overall Population

As shown in Table[1](https://arxiv.org/html/2606.02812#S5.T1), Traj\-Evolve achieves state\-of\-the\-art discrimination in the overall population\. The combined Traj\-Evolve \(ExPool\+MARL\) achieves the best AUROC \(0\.86\), AUPRC \(0\.32\), and F1 \(0\.42\)\. It outperforms the strongest static LLM baseline, Traj\-CoA \(AUROC 0\.81\), as well as LCRAT \(0\.69\), XGBoost \(0\.76\), zero\-shot GPT\-OSS\-20B \(0\.79\), and RAG \(0\.81\)\. The two self\-evolving variants, ExPool \(0\.84\) and MARL \(0\.84\), also outperform all baselines when evaluated independently\. These results show that both mechanisms contribute meaningful improvements\.

Table 2:Model performance comparison on never\-smoker population
#### Never\-smoker Population

Table[2](https://arxiv.org/html/2606.02812#S5.T2)presents the never\-smoker population results\. For this cohort, traditional models degrade sharply \(LCRAT 0\.61; XGBoost 0\.71\), reflecting their reliance on smoking\-related features\. Traj\-Evolve remains robust with the ExPool variant \(0\.82\), MARL variant \(0\.82\), and the combined system \(AUROC 0\.84; AUPRC 0\.28; F1 0\.31\)\. These results demonstrate that the self\-evolving design of Traj\-Evolve can effectively adapt to this clinically challenging population\.

#### Reasoning Quality

Using a pairwise LLM\-as\-a\-judge protocol similar toZenget al\.\([2026](https://arxiv.org/html/2606.02812#bib.bib55)\), we compare the final outputs from the static Traj\-CoA and our Traj\-Evolve system \(Figure[3](https://arxiv.org/html/2606.02812#S5.F3)\)\. Traj\-Evolve is preferred over Traj\-CoA in 69% of overall judgments, with consistent wins on detail, clinical reasoning, and temporal coherence rubrics \(68%–73%\)\. This indicates Traj\-Evolve’s reasoning quality also improves alongside discrimination\.

![Refer to caption](https://arxiv.org/html/2606.02812v1/x3.png)Figure 3:LLM\-as\-a\-judge evaluation compares Traj\-Evolve and Traj\-CoA\.

### 5\.2Evolutionary Dynamics of ExPool

#### Retrieval quality improves monotonically with pool size\.

As ExPool grows from <100 to 5,000 verified patients, average embedding distance between an index patient and its top\-kkneighbors decreases \(Figure[4](https://arxiv.org/html/2606.02812#S5.F4)A\), Spearman correlations on age and predicted risk increase \(Figure[4](https://arxiv.org/html/2606.02812#S5.F4)B\), and purity by case status, sex, and smoking status rises well above random retrieval baselines \(Figure[4](https://arxiv.org/html/2606.02812#S5.F4)C\)\. The pool therefore retrieves progressively more clinically relevant neighbors as experience accumulates\.

#### Optimalkkshifts from diversity to specificity\.

Predictive AUROC \(Figure[4](https://arxiv.org/html/2606.02812#S5.F4)D\) reveals a tradeoff between retrieval diversity and specificity\. When ExPool is small, a largerkkis best, potentially leveraging diversity to compensate for sparse coverage; once ExPool matures, a smallerkkwins, indicating that dense pools may prefer specificity\-driven retrieval\. Across all sizes, the few\-shot framework remains above the zero\-shot static Traj\-CoA baseline\.

![Refer to caption](https://arxiv.org/html/2606.02812v1/x4.png)Figure 4:Evolution of the ExPool\.As ExPool size increases:A,average distance to retrieved neighbors decreases;B,index\-neighbor Spearman correlations \(age, risk score\) increase; andC,retrieval purity \(case status, sex, never\-smoker\) improves\. Dashed lines inCindicate random retrieval baselines\.D,AUROC trajectories fork∈\{5,10,15\}k\\in\\\{5,10,15\\\}retrieved patients \(mean of 3 seeds\)\. The dashed line denotes the baseline without ExPool \(0\.814\)\.

### 5\.3Evolutionary Dynamics of MARL

#### Decoupled training yields asymmetric learning curves\.

Worker and manager agents both show sharp initial loss decreases followed by stabilization \(Figure[5](https://arxiv.org/html/2606.02812#S5.F5)A\)\. However, the manager loss converges quickly, while the worker loss continues to decline through 5,000 training samples\. This is consistent with the intuition that manager summarization can be easier to internalize than fine\-grained, sequential temporal reasoning and collaborative memory distillation across worker agents and the LTM\.

#### Test AUROC scales with verified experience

Test AUROC rises from about 0\.81 to 0\.83 as the MARL training pool grows to 5,000 samples \(Figure[5](https://arxiv.org/html/2606.02812#S5.F5)B\)\. The simultaneous decrease of worker loss and continued AUROC gains suggest that optimizing temporal reasoning may bridge the accumulation of verified experience and task performance\.

![Refer to caption](https://arxiv.org/html/2606.02812v1/x5.png)Figure 5:Evolving MARL performance during agent training\.A,Loss curves for the worker and manager agents across training iterations\.B,Model performance \(mean of 3 seeds\) as the number of training samples increases\.

### 5\.4Mechanism Analysis

#### ExPool and MARL exert complementary effects on the score distribution\.

Density plots of Traj\-Evolve risk scores against the static Traj\-CoA \(Figure[6](https://arxiv.org/html/2606.02812#S5.F6)\) reveal distinct optimization patterns of the two evolving strategies\. ExPool acts primarily on the negative class, pulling control scores downward \(improving specificity\) at the cost of mildly depressing some case scores\. MARL acts primarily on ambiguous mid\-range cases, lifting them toward high risk estimates \(improving sensitivity\)\. Their combination preserves both effects, producing the cleanest separation between cases and controls\. This analysis is also consistent with the observed performance metrics in Table[1](https://arxiv.org/html/2606.02812#S5.T1)and Table[2](https://arxiv.org/html/2606.02812#S5.T2), where Traj\-Evolve generally has a higher specificity with ExPool, a higher sensitivity with MARL, and a balanced sensitivity and specificity for the combined system\.

![Refer to caption](https://arxiv.org/html/2606.02812v1/x6.png)Figure 6:Optimization properties of Traj\-Evolve’s self\-evolving mechanisms\.Density scattered plots comparing the predicted risk scores of Traj\-Evolve variants against the static Traj\-CoA baseline\. Arrows illustrate the strength of how Traj\-Evolve changes the scores over Traj\-CoA \(wider means stronger\)\. Scattered points are presented in a jittered way to facilitate visualization\.

## 6Conclusion

We present Traj\-Evolve, a self\-evolving multi\-agent framework for lung cancer early detection from longitudinal EHRs\. By combining non\-parametric few\-shot retrieval \(ExPool\) with parametric reasoning internalization \(MARL\), Traj\-Evolve emulates the continuous learning of experienced clinicians\. Traj\-Evolve establishes a new state\-of\-the\-art for one\-year lung cancer prediction and shows particular robustness in challenging subgroups like never\-smokers\. Furthermore, we demonstrate that ExPool and MARL are highly complementary, improving specificity and sensitivity, respectively\. Future work will extend this framework to real\-world streaming settings with richer reward signals and additional clinical tasks\.

## Limitations

First, our evaluation was conducted within a single\-institution retrospective case\-control design\. Prospective and multi\-institutional validation will be essential to characterize calibration, generalizability across health systems and patient populations, and performance under genuine deployment prevalence\. Second, the self\-evolving mechanisms depend on ground\-truth diagnostic labels\. In clinical practice, definitive labels for incident lung cancer may take months to years to materialize and are subject to verification noise, loss to follow\-up, and ascertainment bias\. The current framework also assumes that label\-conditioned rejection sampling reliably identifies the optimal reasoning trace, yet a roll\-out with the correct final prediction does not guarantee correct intermediate reasoning\. Incorporating process\-level reward signalsZhanget al\.\([2025b](https://arxiv.org/html/2606.02812#bib.bib62)\)may mitigate these issues\. Finally, this paper focused on one\-year incident lung cancer prediction with a five\-year look\-back window\. Extending Traj\-Evolve to other outcomes, longer or shorter prediction horizons, and continuously updating patient trajectories in real\-world streaming settings will be necessary to establish the framework as a general\-purpose paradigm for self\-evolving longitudinal EHR modeling\.

## Ethical Considerations

We receive approval from the Institutional Review Board \(IRB\) to access the patient data\. All code and data are stored and executed on HIPAA\-compliant servers\.

## Acknowledgement

This work was supported by the National Institutes of Health \(NIH\)—National Cancer Institute \(Grant Nos\. 1R01CA248422\-01A1\)\. Additional support was provided by the NIH under award number R35CA274442 to R\.E\. and the Rosalie and Harold Rea Brown Endowed Chair at Fred Hutch Cancer Center to R\.E\.

## References

- MedAgentSim: self\-evolving multi\-agent simulations for realistic clinical interactions\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 362–372\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p5.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Chen, X\. Li, T\. Yang, H\. Wang, W\. Dong, and Y\. Gao \(2025\)Mdteamgpt: a self\-evolving llm\-based multi\-agent framework for multi\-disciplinary team medical consultation\.arXiv preprint arXiv:2503\.13856\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p5.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: A Scalable Tree Boosting System\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Note:arXiv:1603\.02754 \[cs\]External Links:[Link](http://arxiv.org/abs/1603.02754),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p1.1),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px3.p1.1)\.
- E\. Choi, M\. T\. Bahadori, J\. Sun, J\. Kulas, A\. Schuetz, and W\. Stewart \(2016\)Retain: an interpretable predictive model for healthcare using reverse time attention mechanism\.Advances in neural information processing systems29\.Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p1.1),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px3.p1.1)\.
- H\. Cui, A\. Unell, B\. Chen, J\. A\. Fries, E\. Alsentzer, S\. Koyejo, and N\. H\. Shah \(2025\)Timer: temporal instruction modeling and evaluation for longitudinal clinical records\.npj Digital Medicine8\(1\),pp\. 577\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p2.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- M\. E\. D’Arcy, R\. M\. Pfeiffer, M\. C\. Bradley, P\. H\. Hoang, T\. Tran, J\. P\. McElderry, M\. Li, M\. Kebede, C\. T\. DellaValle, S\. Rivas,et al\.\(2025\)Inflammatory diseases and risk of lung cancer among individuals who have never smoked\.Nature communications16\(1\),pp\. 5095\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: Efficient Finetuning of Quantized LLMs\.arXiv\.Note:arXiv:2305\.14314 \[cs\]External Links:[Link](http://arxiv.org/abs/2305.14314)Cited by:[§B\.1](https://arxiv.org/html/2606.02812#A2.SS1.p1.2),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px2.p1.2)\.
- H\. Dong, W\. Xiong, D\. Goyal, Y\. Zhang, W\. Chow, R\. Pan, S\. Diao, J\. Zhang, K\. Shum, and T\. Zhang \(2023\)Raft: reward ranked finetuning for generative foundation model alignment\.arXiv preprint arXiv:2304\.06767\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2606.02812#S3.SS4.p1.2)\.
- K\. W\. Eva \(2005\)What every teacher needs to know about clinical reasoning\.Medical education39\(1\),pp\. 98–106\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p3.1)\.
- A\. K\. Ganti, A\. B\. Klein, I\. Cotarla, B\. Seal, and E\. Chou \(2021\)Update of incidence, prevalence, survival, and initial treatment in patients with non–small cell lung cancer in the us\.JAMA oncology7\(12\),pp\. 1824–1832\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p1.1)\.
- H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu,et al\.\(2025a\)A survey of self\-evolving agents: what, when, how, and where to evolve on the path to artificial super intelligence\.arXiv preprint arXiv:2507\.21046\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Gao, M\. Rahman, J\. Caskey, M\. Oguss, A\. O’Rourke, R\. Brown, A\. Stey, A\. Mayampurath, M\. M\. Churpek, G\. Chen, and M\. Afshar \(2025b\)MoMA: a mixture\-of\-multimodal\-agents architecture for enhancing clinical prediction modelling\.npj Digital Medicine9\(1\),pp\. 46\(en\)\.External Links:ISSN 2398\-6352,[Link](https://www.nature.com/articles/s41746-025-02219-4),[Document](https://dx.doi.org/10.1038/s41746-025-02219-4)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. He, J\. Liu, Z\. Hu, Y\. Chen, Y\. Liu, Y\. Sui, Y\. Li, N\. Chen, J\. Hu, B\. Hooi, X\. Xu, and J\. Bian \(2026\)EvoClinician: A Self\-Evolving Agent for Multi\-Turn Medical Diagnosis via Test\-Time Evolutionary Learning\.arXiv\.Note:arXiv:2601\.22964 \[cs\.AI\]External Links:[Link](http://arxiv.org/abs/2601.22964),[Document](https://dx.doi.org/10.48550/arXiv.2601.22964)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- P\. B\. Jensen, L\. J\. Jensen, and S\. Brunak \(2012\)Mining electronic health records: towards better research applications and clinical care\.Nature Reviews Genetics13\(6\),pp\. 395–405\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p1.1)\.
- H\. A\. Katki, S\. A\. Kovalchik, C\. D\. Berg, L\. C\. Cheung, and A\. K\. Chaturvedi \(2016\)Development and validation of risk models to select ever\-smokers for ct lung cancer screening\.Jama315\(21\),pp\. 2300–2311\.Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p1.1),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px3.p1.1)\.
- E\. Kim, S\. M\. Rubinstein, K\. T\. Nead, A\. P\. Wojcieszynski, P\. E\. Gabriel, and J\. L\. Warner \(2019\)The evolving use of electronic health records \(ehr\) for research\.InSeminars in radiation oncology,Vol\.29,pp\. 354–361\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p1.1)\.
- M\. Kruse, S\. Hu, N\. Derby, Y\. Wu, S\. Stonbraker, B\. Yao, D\. Wang, E\. Goldberg, and Y\. Gao \(2025\)Large language models with temporal reasoning for longitudinal clinical summarization and prediction\.InFindings of ACL\. EMNLP\. Conference on Empirical Methods in Natural Language Processing,Vol\.2025,pp\. 20715–20735\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p2.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient Memory Management for Large Language Model Serving with PagedAttention\.arXiv\.Note:arXiv:2309\.06180 \[cs\.LG\]External Links:[Link](http://arxiv.org/abs/2309.06180),[Document](https://dx.doi.org/10.48550/arXiv.2309.06180)Cited by:[§B\.1](https://arxiv.org/html/2606.02812#A2.SS1.p1.2)\.
- H\. L\. Lancaster, M\. A\. Heuvelmans, and M\. Oudkerk \(2022\)Low\-dose computed tomography lung cancer screening: clinical evidence and implementation research\.Journal of internal medicine292\(1\),pp\. 68–80\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p1.1)\.
- S\. A\. Lee, A\. Wu, and J\. N\. Chiang \(2025\)Clinical ModernBERT: An efficient and long context encoder for biomedical text\.arXiv\.Note:arXiv:2504\.03964 \[cs\]External Links:[Link](http://arxiv.org/abs/2504.03964),[Document](https://dx.doi.org/10.48550/arXiv.2504.03964)Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p1.1),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px3.p1.1)\.
- D\. Li, J\. Liang, W\. Li, X\. Wang, L\. Cao, and K\. Yu \(2025a\)CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2507.22533v2)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Li, Y\. Lai, W\. Li, J\. Ren, M\. Zhang, X\. Kang, S\. Wang, P\. Li, Y\. Zhang, W\. Ma,et al\.\(2024\)Agent hospital: a simulacrum of hospital with evolvable medical agents\.arXiv preprint arXiv:2405\.02957\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p5.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Li, X\. Wang, D\. Berlowitz, J\. Mez, H\. Lin, and H\. Yu \(2025b\)CARE\-AD: a multi\-agent large language model framework for Alzheimer’s disease prediction using longitudinal clinical notes\.npj Digital Medicine8\(1\),pp\. 541\(en\)\.External Links:ISSN 2398\-6352,[Link](https://www.nature.com/articles/s41746-025-01940-4),[Document](https://dx.doi.org/10.1038/s41746-025-01940-4)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Liao, M\. Wen, J\. Wang, and W\. Zhang \(2025\)Marft: multi\-agent reinforcement fine\-tuning\.arXiv preprint arXiv:2504\.16129\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the Middle: How Language Models Use Long Contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Note:Place: Cambridge, MAExternal Links:[Link](https://aclanthology.org/2024.tacl-1.9/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00638)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Ma, T\. Hu, Z\. Pu, B\. Liu, X\. Ai, Y\. Liang, and M\. Chen \(2024\)Coevolving with the other you: fine\-tuning llm with sequential cooperative multi\-agent reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 15497–15525\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Makarov, M\. Bordukova, P\. Quengdaeng, D\. Garger, R\. Rodriguez\-Esteban, F\. Schmich, and M\. P\. Menden \(2025\)Large language models forecast patient health trajectories enabling digital twins\.npj Digital Medicine8\(1\),pp\. 588\.Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Nussbaum, J\. X\. Morris, B\. Duderstadt, and A\. Mulyar \(2024\)Nomic Embed: Training a Reproducible Long Context Text Embedder\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2402.01613v2)Cited by:[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px2.p1.2)\.
- OpenAI, S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao, B\. Barak, A\. Bennett, T\. Bertao, N\. Brett, E\. Brevdo, G\. Brockman, S\. Bubeck, C\. Chang, K\. Chen, M\. Chen, E\. Cheung, A\. Clark, D\. Cook, M\. Dukhan, C\. Dvorak, K\. Fives, V\. Fomenko, T\. Garipov, K\. Georgiev, M\. Glaese, T\. Gogineni, A\. Goucher, L\. Gross, K\. G\. Guzman, J\. Hallman, J\. Hehir, J\. Heidecke, A\. Helyar, H\. Hu, R\. Huet, J\. Huh, S\. Jain, Z\. Johnson, C\. Koch, I\. Kofman, D\. Kundel, J\. Kwon, V\. Kyrylov, E\. Y\. Le, G\. Leclerc, J\. P\. Lennon, S\. Lessans, M\. Lezcano\-Casado, Y\. Li, Z\. Li, J\. Lin, J\. Liss, Lily, Liu, J\. Liu, K\. Lu, C\. Lu, Z\. Martinovic, L\. McCallum, J\. McGrath, S\. McKinney, A\. McLaughlin, S\. Mei, S\. Mostovoy, T\. Mu, G\. Myles, A\. Neitz, A\. Nichol, J\. Pachocki, A\. Paino, D\. Palmie, A\. Pantuliano, G\. Parascandolo, J\. Park, L\. Pathak, C\. Paz, L\. Peran, D\. Pimenov, M\. Pokrass, E\. Proehl, H\. Qiu, G\. Raila, F\. Raso, H\. Ren, K\. Richardson, D\. Robinson, B\. Rotsted, H\. Salman, S\. Sanjeev, M\. Schwarzer, D\. Sculley, H\. Sikchi, K\. Simon, K\. Singhal, Y\. Song, D\. Stuckey, Z\. Sun, P\. Tillet, S\. Toizer, F\. Tsimpourlas, N\. Vyas, E\. Wallace, X\. Wang, M\. Wang, O\. Watkins, K\. Weil, A\. Wendling, K\. Whinnery, C\. Whitney, H\. Wong, L\. Yang, Y\. Yang, M\. Yasunaga, K\. Ying, W\. Zaremba, W\. Zhan, C\. Zhang, B\. Zhang, E\. Zhang, and S\. Zhao \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b Model Card\.arXiv\.Note:arXiv:2508\.10925 \[cs\]External Links:[Link](http://arxiv.org/abs/2508.10925),[Document](https://dx.doi.org/10.48550/arXiv.2508.10925)Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p1.1),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px2.p1.2)\.
- V\. L\. Patel, J\. F\. Arocha, and J\. Zhang \(2005\)Thinking and reasoning in medicine\.The Cambridge handbook of thinking and reasoning14\(727\-750\),pp\. 1\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p3.1)\.
- C\. Pellegrini, E\. Özsoy, D\. Bani\-Harouni, M\. Keicher, and N\. Navab \(2025\)From ehrs to patient pathways: scalable modeling of longitudinal health trajectories with llms\.arXiv preprint arXiv:2506\.04831\.Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Qu and M\. Färber \(2026\)TRACE: Temporal Reasoning via Agentic Context Evolution for Streaming Electronic Health Records \(EHRs\)\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2602.12833v1)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- M\. D\. Ruopp, N\. J\. Perkins, B\. W\. Whitcomb, and E\. F\. Schisterman \(2008\)Youden Index and Optimal Cut\-Point Estimated from Observations Affected by a Lower Limit of Detection\.Biometrical journal\. Biometrische Zeitschrift50\(3\),pp\. 419–430\.External Links:ISSN 0323\-3847,[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC2515362/),[Document](https://dx.doi.org/10.1002/bimj.200710415)Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p3.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- J\. F\. Silva and S\. Matos \(2022\)Modelling patient trajectories using multimodal information\.Journal of Biomedical Informatics134,pp\. 104195\.External Links:ISSN 1532\-0464,[Link](https://www.sciencedirect.com/science/article/pii/S1532046422002003),[Document](https://dx.doi.org/10.1016/j.jbi.2022.104195)Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p1.1),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px3.p1.1)\.
- H\. Sung, J\. Ferlay, R\. L\. Siegel, M\. Laversanne, I\. Soerjomataram, A\. Jemal, and F\. Bray \(2021\)Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries\.CA: a cancer journal for clinicians71\(3\),pp\. 209–249\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p1.1)\.
- X\. Tang, T\. Qin, T\. Peng, Z\. Zhou, D\. Shao, T\. Du, X\. Wei, P\. Xia, F\. Wu, H\. Zhu,et al\.\(2025\)Agent kb: leveraging cross\-domain experience for agentic problem solving\.arXiv preprint arXiv:2507\.06229\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2606.02812#S3.SS3.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu, E\. Gottlieb, Y\. Lu, K\. Cho, J\. Wu, L\. Fei\-Fei, L\. Wang, Y\. Choi, and M\. Li \(2025\)RAGEN: Understanding Self\-Evolution in LLM Agents via Multi\-Turn Reinforcement Learning\.arXiv\.Note:arXiv:2504\.20073 \[cs\.LG\]External Links:[Link](http://arxiv.org/abs/2504.20073),[Document](https://dx.doi.org/10.48550/arXiv.2504.20073)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang,et al\.\(2025\)Evolver: self\-evolving llm agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1)\.
- W\. Xiong, J\. Yao, Y\. Xu, B\. Pang, L\. Wang, D\. Sahoo, J\. Li, N\. Jiang, T\. Zhang, C\. Xiong,et al\.\(2025\)A minimalist approach to llm reasoning: from rejection sampling to reinforce\.arXiv preprint arXiv:2504\.11343\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1)\.
- Z\. Xu, M\. J\. Cruz, M\. Guevara, T\. Wang, M\. Deshpande, X\. Wang, and Z\. Li \(2024\)Retrieval\-Augmented Generation with Knowledge Graphs for Customer Service Question Answering\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2905–2909\.Note:arXiv:2404\.17723 \[cs\]External Links:[Link](http://arxiv.org/abs/2404.17723),[Document](https://dx.doi.org/10.1145/3626772.3661370)Cited by:[§3\.3](https://arxiv.org/html/2606.02812#S3.SS3.SSS0.Px2.p1.4)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2022\)STaR: Bootstrapping Reasoning With Reasoning\.arXiv\.Note:arXiv:2203\.14465 \[cs\.LG\]External Links:[Link](http://arxiv.org/abs/2203.14465),[Document](https://dx.doi.org/10.48550/arXiv.2203.14465)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Zeng, Y\. Fu, S\. Zhou, Z\. Yu, L\. J\. Liu, J\. Wen, M\. Thompson, R\. Etzioni, and M\. Yetisgen \(2025\)Traj\-coa: patient trajectory modeling via chain\-of\-agents for lung cancer risk prediction\.arXiv preprint arXiv:2510\.10454\.Cited by:[§B\.2](https://arxiv.org/html/2606.02812#A2.SS2.p1.1),[Appendix D](https://arxiv.org/html/2606.02812#A4.p1.1),[§1](https://arxiv.org/html/2606.02812#S1.p2.1),[§1](https://arxiv.org/html/2606.02812#S1.p6.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.02812#S3.SS2.p1.3),[§4](https://arxiv.org/html/2606.02812#S4.SS0.SSS0.Px3.p1.1)\.
- S\. Zeng, Y\. W\. Kim, W\. Lau, E\. Alipour, R\. Etzioni, M\. Yetisgen, and A\. Oka \(2026\)TrajOnco: a multi\-agent framework for temporal reasoning over longitudinal ehr for multi\-cancer early detection\.arXiv preprint arXiv:2604\.10386\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p2.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.02812#S3.SS2.p1.3),[§5\.1](https://arxiv.org/html/2606.02812#S5.SS1.SSS0.Px3.p1.1)\.
- K\. Zhang, K\. Tian, R\. Liu, S\. Zeng, X\. Zhu, G\. Jia, Y\. Fan, X\. Lv, Y\. Zuo, C\. Jiang,et al\.\(2025a\)Marti: a framework for multi\-agent llm systems reinforced training and inference\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Zhang, Y\. Zuo, B\. He, Y\. Sun, R\. Liu, C\. Jiang, Y\. Fan, K\. Tian, G\. Jia, P\. Li, Y\. Fu, X\. Lv, Y\. Zhang, S\. Zeng, S\. Qu, H\. Li, S\. Wang, Y\. Wang, X\. Long, F\. Liu, X\. Xu, J\. Ma, X\. Zhu, E\. Hua, Y\. Liu, Z\. Li, H\. Chen, X\. Qu, Y\. Li, W\. Chen, Z\. Yuan, J\. Gao, D\. Li, Z\. Ma, G\. Cui, Z\. Liu, B\. Qi, N\. Ding, and B\. Zhou \(2025b\)A Survey of Reinforcement Learning for Large Reasoning Models\.arXiv\.Note:arXiv:2509\.08827 \[cs\]External Links:[Link](http://arxiv.org/abs/2509.08827),[Document](https://dx.doi.org/10.48550/arXiv.2509.08827)Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[Limitations](https://arxiv.org/html/2606.02812#Sx1.p1.1)\.
- Y\. Zhang, R\. Sun, Y\. Chen, T\. Pfister, R\. Zhang, and S\. Ö\. Arik \(2024\)Chain of Agents: Large Language Models Collaborating on Long\-Context Tasks\.arXiv\.Note:arXiv:2406\.02818 \[cs\]External Links:[Link](http://arxiv.org/abs/2406.02818),[Document](https://dx.doi.org/10.48550/arXiv.2406.02818)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.02812#S3.SS2.p1.3)\.
- Z\. Zhang, Q\. Dai, X\. Bo, C\. Ma, R\. Li, X\. Chen, J\. Zhu, Z\. Dong, and J\. Wen \(2025c\)A survey on the memory mechanism of large language model\-based agents\.ACM Transactions on Information Systems43\(6\),pp\. 1–47\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Zhou, Y\. Chen, S\. Guo, X\. Yan, K\. H\. Lee, Z\. Wang, K\. Y\. Lee, G\. Zhang, K\. Shao, L\. Yang,et al\.\(2025a\)Memento: fine\-tuning llm agents without fine\-tuning llms\.arXiv preprint arXiv:2508\.16153\.Cited by:[§1](https://arxiv.org/html/2606.02812#S1.p4.1),[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Zhou, C\. Li, C\. Yang, and J\. Lu \(2025b\)ClinNoteAgents: An LLM Multi\-Agent System for Predicting and Interpreting Heart Failure 30\-Day Readmission from Clinical Notes\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2512.07081v2)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhou, M\. Yetisgen, and M\. Ostendorf \(2026\)RadTimeline: Timeline Summarization for Longitudinal Radiological Lung Findings\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2603.22820v1)Cited by:[Appendix A](https://arxiv.org/html/2606.02812#A1.p1.1)\.
- Y\. Zuo, K\. Zhang, S\. Qu, L\. Sheng, X\. Zhu, B\. Qi, Y\. Sun, G\. Cui, N\. Ding, and B\. Zhou \(2025\)TTRL: Test\-Time Reinforcement Learning\.arXiv\.Note:arXiv:2504\.16084 \[cs\]External Links:[Link](http://arxiv.org/abs/2504.16084),[Document](https://dx.doi.org/10.48550/arXiv.2504.16084)Cited by:[§2](https://arxiv.org/html/2606.02812#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ADataset Description

This retrospective case\-control study utilized an in\-house longitudinal dataset to predict first primary lung cancer diagnoses within a one\-year prediction periodZhouet al\.\([2026](https://arxiv.org/html/2606.02812#bib.bib57)\)\. The index date for prediction of both cases and controls was defined as the completion time of a qualifying chest\-related radiology exam \(chest CT, abdomen CT, or chest X\-ray\)\. To model patient trajectories, up to five years of EHR history prior to the index date was utilized\. This multimodal EHR data encompasses both structured records \(diagnosis, medication, lab, vital, and procedure\) and unstructured text \(clinical notes and radiology reports\)\.

Eligible patients were over 40 years old\. Cases were defined as individuals with a valid primary lung cancer diagnosis, excluding those with prior cancers\. To ensure predictions informed early detection and excluded active diagnostic workups, the index exam for cases was required to be completed between two months and one year prior to diagnosis\. Controls were defined as individuals with no cancer registry records of any cancer type; their index exams were selected to guarantee at least three years of cancer\-free follow\-up\. To ensure longitudinal data, included patients should have an active clinical encounter history exceeding 120 days within the prior five years\. Controls were matched to cases based on the index exam type and date \(overlapping within a three\-month window\) using a 1:10 case\-to\-control matching schema\.

The matched cohort was randomly partitioned into mutually exclusive training \(n=13,629\) and validation \(n=300\) sets, with the remainder assigned to a held\-out test set\. To assess model performance and generalizability, we constructed two evaluation cohorts via independent random sampling from this test set: an overall cohort of 1000 patients \(90 cases, 910 controls\) encompassing all smoking statuses, and a never\-smoker cohort of 835 patients \(27 cases, 808 controls\)\.

Baseline demographic, clinical, and lifestyle characteristics of the two test cohorts are summarized in Table[3](https://arxiv.org/html/2606.02812#A1.T3)\. For the overall population, patient trajectories spanned a median of 4\.5 years for cases and 3\.7 years for controls, and contained tens to hundreds of dated entries drawn from diagnoses, procedures, laboratories, medications, vital signs, and free\-text notes\. The per\-patient input was also exceptionally long, with median XML token counts exceeding 60,000 in the overall sample and 80,000 in the never\-smoker cases\. Time\-aware chunking reduced the mean per\-chunk token count to approximately 16,000, allowing Traj\-Evolve to process the full trajectory effectively while preserving chronology\.

Table 3:Baseline characteristics of cases and controls in the overall and never\-smoker cohorts\.
## Appendix BExperimental Settings

### B\.1Implementation Details

We serve all LLMs using vLLM 0\.19\.1Kwonet al\.\([2023](https://arxiv.org/html/2606.02812#bib.bib80)\)and implement MARL with QLoRADettmerset al\.\([2023](https://arxiv.org/html/2606.02812#bib.bib67)\)via theunslothlibrary \(v2026\.2\.1\)111https://unsloth\.ai/on 4 GPUs\. Both the worker and manager agents share the same training configuration: LoRA rank andα\\alphaare both set to 32, with gradient checkpointing enabled to reduce memory overhead\. We use a per\-device batch size of 1 with gradient accumulation over 8 steps, yielding an effective batch size of 8 per device\. Models are trained for a single epoch with a linear learning rate schedule, a peak learning rate of2×10−42\\times 10^\{\-4\}, 5 warmup steps, and weight decay of 0\.01\. Optimization is performed using 8\-bit AdamW\. For LLM\-as\-a\-judge evaluation, we use GPT\-OSS\-120B as the judge model\.

### B\.2Baselines

We compare against baselines from five categories: \(i\) a clinical risk model, LCRAT\(Katkiet al\.,[2016](https://arxiv.org/html/2606.02812#bib.bib13)\); \(ii\) supervised machine learning models, logistic regression and XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2606.02812#bib.bib63)\); \(iii\) sequential deep learning models, RETAIN\(Choiet al\.,[2016](https://arxiv.org/html/2606.02812#bib.bib25)\)and PatientTM\(Silva and Matos,[2022](https://arxiv.org/html/2606.02812#bib.bib64)\); \(iv\) a clinical BERT encoder, Clinical ModernBERT\(Leeet al\.,[2025](https://arxiv.org/html/2606.02812#bib.bib65)\); and \(v\) LLM\-based pipelines built on GPT\-OSS\-20B\(OpenAIet al\.,[2025](https://arxiv.org/html/2606.02812#bib.bib66)\)with or without long\-context modeling strategies, including vanilla single\-agent LLM, RAG, and Traj\-CoA\(Zenget al\.,[2025](https://arxiv.org/html/2606.02812#bib.bib35)\)\.

To ensure feasible and fair comparisons, input data modalities were tailored to each model’s specific architectural requirements\. Standard ML models utilized structured medical codes, whereas PatientTM incorporated both codes and unstructured clinical texts \(i\.e\., clinical notes and radiology reports\)\. For BERT\- and LLM\-based architectures, we employed the unified XML\-formatted EHR representation that preserved multimodal information, including codes, free text, and numerical laboratory values\. For long sequences exceeding model context limits, left\-truncation was applied to prioritize the most recent clinical records\.

Supervised baselines, Clinical ModernBERT, and the MARL\-optimized Traj\-Evolve were trained on the same training set\. In contrast, the ExPool variant of Traj\-Evolve operated in a few\-shot manner, dynamically retrieving “patients\-like\-me” from the ExPool constructed using the training set, while zero\-shot baselines operated without access to the training set\. Optimal hyperparameter configurations for all trained baselines were determined using the validation set\. During evaluation, the threshold that maximizes Youden’s J\-indexRuoppet al\.\([2008](https://arxiv.org/html/2606.02812#bib.bib81)\)was used\.

## Appendix CCase Study

To qualitatively demonstrate the nuanced clinical reasoning capabilities enabled by the ExPool, we examined the diagnostic trajectory of a 50\-year\-old Asian female never\-smoker with a history of occupational asbestos exposure\. The index patient presented with a normal pulmonary function test \(PFT\) and cough in 2015, formally documented asbestos exposure in 2016, and the discovery of a 16 mm non\-calcified left upper lobe \(LUL\) nodule in 2018, with normal PFT and absence of pulmonary symptoms\. She was subsequently diagnosed with lung cancer\.

Traj\-Evolve used information from two of the retrieved 10 patients from ExPool\. Patient 8 was a 55\-year\-old woman who shared documented asbestos exposure and progressive pulmonary fibrosis, but lacked a discrete mass and remained cancer\-free after 3 years\. Patient 10 was a 57\-year\-old Asian female never\-smoker who presented with an 18 mm LUL mass and was confirmed to have lung cancer within 6 months\.

In its generated reasoning rationale, Traj\-Evolve explicitly estimated the index patient’s risk by comparing her longitudinal trajectory to these retrieved cohorts\. The manager agent identified that the index patient “sits biologically between these extremes”, correctly weighing the mitigating factor of normal pulmonary function against the high\-risk modifiers of progressive environmental carcinogen exposure and a suspicious nodule\. The system assigned a risk score of 7/10\. This case exemplifies how the incorporation of verified clinical experience enables Traj\-Evolve to execute robust, comparative clinical judgment in atypical and highly challenging clinical presentations\.

![Refer to caption](https://arxiv.org/html/2606.02812v1/x7.png)Figure 7:Case study demonstrating Traj\-Evolve \(ExPool\+MARL\)’s reasoning for a never\-smoker patient\. A UMAP projection maps the local semantic neighborhood of the index patient \(star\) alongside retrieved historical cases \(diamonds\)\. The right panel demonstrates how Traj\-Evolve balances the shared signals from retrieved patients and the index patient’s own characteristics\.
## Appendix DPrompts

We follow the same prompt template in Traj\-CoAZenget al\.\([2025](https://arxiv.org/html/2606.02812#bib.bib35)\)for Traj\-Evolve without ExPool\. With ExPool, the manager agent’s system prompt and user prompt are shown in Table[4](https://arxiv.org/html/2606.02812#A4.T4)and[5](https://arxiv.org/html/2606.02812#A4.T5)\.

Table 4:System prompt for the manager agent with ExPool \(Patients\-Like\-Me retrieval\)\.Table 5:User prompt for the manager agent with ExPool\.

Similar Articles

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

arXiv cs.CL

CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.