RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis

arXiv cs.AI 05/25/26, 04:00 AM Papers
Summary
Proposes RAG4Outcome, a retrieval-augmented generation framework integrating multimodal clinical data (PET-CT reports, surgical records, follow-up notes) to improve prognostic prediction in chronic osteomyelitis, enhancing interpretability and clinical reliability.
arXiv:2605.22833v1 Announce Type: cross Abstract: Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets. In this work, we propose RAG4Outcome, a retrieval-augmented generation (RAG) framework for prognostic prediction in chronic osteomyelitis. Our method integrates multimodal clinical data, including PET-CT imaging reports, structured surgical and diagnostic records, and unstructured follow-up notes, into a unified prediction pipeline. By combining a domain-specific retrieval corpus with expert-guided prompting, the framework enables more interpretable, evidence-grounded, and clinically reliable prognosis. Preliminary results on real-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI-assisted infection management and postoperative decision support.
Original Article
View Cached Full Text
Cached at: 05/25/26, 09:00 AM
# A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis
Source: [https://arxiv.org/html/2605.22833](https://arxiv.org/html/2605.22833)
Pei Han Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine Shanghai, CN hanpei\_cn@163\.comJishizhan Chen University College London London, UK jishizhan\.chen@ucl\.ac\.ukYang Wang Shanghai Sixth People’s Hospital Shanghai, CN hongchayang@sina\.comXiaolei Diao∗ University College London London, UK xiaolei\.diao@ucl\.ac\.ukXianyou Zheng∗ Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine Shanghai, CN zhengxianyou@situ\.edu\.cnPengfei Cheng∗ Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine Shanghai, CN chengpf@alumni\.sjtu\.edu\.cn

###### Abstract

Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories\. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice\. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets\. In this work, we propose RAG4Outcome, a retrieval\-augmented generation \(RAG\) framework for prognostic prediction in chronic osteomyelitis\. Our method integrates multimodal clinical data, including PET\-CT imaging reports, structured surgical and diagnostic records, and unstructured follow\-up notes, into a unified prediction pipeline\. By combining a domain\-specific retrieval corpus with expert\-guided prompting, the framework enables more interpretable, evidence\-grounded, and clinically reliable prognosis\. Preliminary results on real\-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI\-assisted infection management and postoperative decision support\.

11footnotetext:Corresponding author\.## 1Introduction

Chronic osteomyelitis is a highly challenging orthopedic infection characterized by high recurrence risk and complex postoperative recovery trajectories\[[27](https://arxiv.org/html/2605.22833#bib.bib31),[17](https://arxiv.org/html/2605.22833#bib.bib8),[18](https://arxiv.org/html/2605.22833#bib.bib9)\]\. Accurate prognostic prediction is critical for tailoring personalized treatment strategies and ensuring long\-term care\[[8](https://arxiv.org/html/2605.22833#bib.bib11)\]\. However, current clinical assessment methods typically rely on expert\-defined scoring systems and manual evaluation processes\[[5](https://arxiv.org/html/2605.22833#bib.bib10)\], which are labor\-intensive, time\-consuming, and difficult to scale across different care settings\. This limits their effectiveness for continuous monitoring and early risk stratification in real\-world scenarios\.

In practice, clinicians must reason over a broad range of heterogeneous data types\[[13](https://arxiv.org/html/2605.22833#bib.bib12),[31](https://arxiv.org/html/2605.22833#bib.bib38),[30](https://arxiv.org/html/2605.22833#bib.bib39)\], including medical imaging reports, structured Electronic Health Records \(EHR\), and unstructured reports, to assess a patient’s recovery and recurrence risk\. These multimodal data sources are often incomplete, asynchronous, and loosely aligned, posing significant challenges for conventional multimodal learning models\[[32](https://arxiv.org/html/2605.22833#bib.bib13),[1](https://arxiv.org/html/2605.22833#bib.bib14)\]\. Existing AI\-based methods for postoperative prediction often rely on rigid data alignment, require large amounts of well\-labeled multimodal data\[[1](https://arxiv.org/html/2605.22833#bib.bib14)\], and tend to generalize poorly across real\-world clinical environments\[[9](https://arxiv.org/html/2605.22833#bib.bib15),[12](https://arxiv.org/html/2605.22833#bib.bib16)\]\. Moreover, critical indicators such as infection type, surgical timing, metabolic biomarkers, and symptom evolution are often embedded in unstructured clinical texts, making them difficult to extract and interpret\.

To address the above challenges, we propose RAG4Outcome, a retrieval\-augmented multimodal framework designed for prognostic prediction in chronic osteomyelitis\. Our method integrates multiple forms of clinical data into a unified prediction model, including PET\-CT\[[11](https://arxiv.org/html/2605.22833#bib.bib30)\]imaging reports, structured EHRs, diagnostic and surgical reports, and follow\-up documentations\. A central Retrieval\-Augmented Generation \(RAG\) module integrates external medical knowledge retrieved from a domain\-specific corpus to produce evidence\-grounded outcome predictions\. This design enables reliable generation without requiring strict data alignment or exhaustive annotations\[[33](https://arxiv.org/html/2605.22833#bib.bib17),[1](https://arxiv.org/html/2605.22833#bib.bib14),[12](https://arxiv.org/html/2605.22833#bib.bib16),[16](https://arxiv.org/html/2605.22833#bib.bib18)\]\. To enhance clinical interpretability, we incorporate 12 prognostic indicators identified by orthopedic experts as the most relevant to recovery outcomes in chronic osteomyelitis\. These factors guide the structured prompt construction and support targeted retrieval and reasoning\. We validate our approach using a real\-world dataset111The dataset was collected under institutional ethical approval, ensuring patient privacy and data governance compliance\.from anonymized chronic osteomyelitis patients treated at a tertiary care center, each with follow\-up in 3–6 years\. Case studies on representative patients demonstrate that RAG4Outcome achieves high consistency with clinical scoring systems, while providing transparent, evidence\-supported rationales\. This work provides a promising step toward the development of reliable, interpretable, and scalable AI tools for infection prognosis and surgical decision support\. Rather than replacing established clinical scoring systems, our goal is to complement them with a transparent retrieval\-supported framework that can synthesize heterogeneous evidence and assist clinicians in real\-world postoperative assessment\. Our contributions are summarized as follows:

- •We propose RAG4Outcome, a retrieval\-augmented multimodal framework for clinical prognostic prediction in osteomyelitis, capable of integrating heterogeneous and partially missing clinical data into a unified model\.
- •We design an interpretable prognostic framework that leverages twelve expert\-defined indicators and a curated medical retrieval corpus to extract relevant clinical cues and construct structured prompts from multimodal inputs\.
- •We conduct case\-level evaluations using real\-world patient data, demonstrating strong agreement with established scoring systems and the clinical utility\.

## 2Related Work

### 2\.1Prognostic Assessment in Osteomyelitis

Prognostic assessment in chronic osteomyelitis has traditionally relied on structured scoring systems designed for orthopedic and infection\-related outcome monitoring\. One of the most widely used scales is the Lower Extremity Functional Scale \(LEFS\), a patient\-reported outcome measure that evaluates lower\-limb functionality\[[2](https://arxiv.org/html/2605.22833#bib.bib19),[7](https://arxiv.org/html/2605.22833#bib.bib22),[15](https://arxiv.org/html/2605.22833#bib.bib23)\]\. The Enneking system assesses musculoskeletal tumor patients’ functional status and has been adapted for post\-treatment outcome assessment\[[28](https://arxiv.org/html/2605.22833#bib.bib20)\]\. The Cierny–Mader system evaluates chronic osteomyelitis cases based on anatomic type and host physiology, and is extensively used in clinical risk stratification\[[17](https://arxiv.org/html/2605.22833#bib.bib8),[6](https://arxiv.org/html/2605.22833#bib.bib21)\]\. Different learning strategies are also introduced in the community to optimize the performance of deep learning\-based methods\[[19](https://arxiv.org/html/2605.22833#bib.bib34),[20](https://arxiv.org/html/2605.22833#bib.bib35),[4](https://arxiv.org/html/2605.22833#bib.bib36),[3](https://arxiv.org/html/2605.22833#bib.bib37)\]\. While clinically valuable, these systems require manual interpretation and exhibit subjectivity, limiting scalability and real\-time application\. Furthermore, studies suggest that current scoring methods fall short in supporting automated or continuous prognostic monitoring\[[17](https://arxiv.org/html/2605.22833#bib.bib8)\], highlighting the need for AI\-driven, interpretable alternatives\.

### 2\.2Language Models and Retrieval‑Augmented Generation in Medical AI

Large Language Models \(LLMs\) have shown promise in medical text understanding, summarization, and clinical decision support\. BioGPT\[[14](https://arxiv.org/html/2605.22833#bib.bib24)\]and Med\-PaLM\[[24](https://arxiv.org/html/2605.22833#bib.bib25),[25](https://arxiv.org/html/2605.22833#bib.bib26)\]demonstrate strong performance on biomedical QA and benchmark challenges\. However, LLMs are prone to factual errors, commonly referred to as hallucinations\[[34](https://arxiv.org/html/2605.22833#bib.bib27)\], particularly in scenarios lacking reliable contextual information\. To mitigate this, a hybrid architecture, called Retrieval\-Augmented Generation, is introduced that combines retrieval\-based evidence with generative capabilities\. In clinical settings, RAG has been shown to improve factual consistency, support the interpretation of medical guidelines, and reduce hallucination risks\[[33](https://arxiv.org/html/2605.22833#bib.bib17),[10](https://arxiv.org/html/2605.22833#bib.bib28),[29](https://arxiv.org/html/2605.22833#bib.bib40),[21](https://arxiv.org/html/2605.22833#bib.bib41)\]\. Some of them introduce knowledge bases into the systems\[[23](https://arxiv.org/html/2605.22833#bib.bib32),[22](https://arxiv.org/html/2605.22833#bib.bib33)\]\. Yang et al\.\[[12](https://arxiv.org/html/2605.22833#bib.bib16)\]applied RAG for the interpretation of the guideline and demonstrated improvements in the factual correctness and the agreement of the clinician\. Xiong et al\.\[[33](https://arxiv.org/html/2605.22833#bib.bib17)\]established benchmarks for RAG in the medical domain and showed its superiority over closed\-book LLMs in evidence\-aware reasoning tasks\. Recent applications include MedRAG for evidence\-grounded diagnostic assistance\[[10](https://arxiv.org/html/2605.22833#bib.bib28)\], and a zero\-shot RAG\-based framework for automatic disease phenotyping from electronic health records, demonstrated on pulmonary hypertension as a case study\[[26](https://arxiv.org/html/2605.22833#bib.bib29)\]\. By explicitly linking model outputs to retrieved evidence, RAG enhances response consistency and credibility, while also allowing flexible integration of heterogeneous data sources without requiring input alignment or model retraining\.

## 3The Proposed Method

In this section, we present the RAG4Outcome framework, a retrieval\-augmented multimodal system designed to support clinical prognostic prediction in chronic osteomyelitis\. As illustrated in Figure[1](https://arxiv.org/html/2605.22833#S3.F1), the proposed pipeline consists of two primary modules: \(1\) an Information Extraction Module that processes multimodal clinical data into interpretable textual representations, and \(2\) a RAG\-based Module that performs evidence\-grounded reasoning for outcome prediction using large language models enhanced with domain\-specific retrieval\.

![Refer to caption](https://arxiv.org/html/2605.22833v1/figures/overall_structure.png)Figure 1:Overview of the RAG4Outcome framework\. The system integrates PET\-CT reports, EHR surgical records, and follow\-up documentations through an information extraction pipeline, and then applies an RAG\-based model to perform patient\-level outcome prediction\.### 3\.1Overview of The Framework

Given a chronic osteomyelitis patient, the proposed framework takes three types of multimodal clinical data as input, including PET\-CT imaging reports, structured EHRs such as diagnostic and surgical records, and unstructured follow\-up documentation\. These heterogeneous data sources are processed by modality\-specific components, the Medical Image Processor and the Record Processor, to generate coherent and interpretable textual chunks, each representing key clinical factors defined by chronic osteomyelitis experts\.

The extracted chunks are used to construct patient\-specific prompts, which are subsequently passed to the RAG\-base Module\. During inference, RAG retrieves relevant domain\-specific evidence from an external medical corpus and generates both a structured summary of the patient’s condition and a prognostic prediction\. The final output includes a structured summary of the patient’s condition and a prognostic prediction describing recovery outcome \(including excellent, good, fair, and poor\), along with supporting explanations grounded in both patient data and retrieved external references\. Unlike prior approaches that require tightly aligned multimodal inputs, RAG4Outcome supports heterogeneous, asynchronous, and partially missing data without the need for modality alignment\. This is achieved by independently processing each input source and transforming it into a standardized semantic representation suitable for retrieval and reasoning\.

### 3\.2Information Extraction Module

We denote a patient case as a collection of heterogeneous clinical documents𝒟=\{d1,d2,…,dN\}\\mathcal\{D\}=\\\{d\_\{1\},d\_\{2\},\\dots,d\_\{N\}\\\}, where eachdid\_\{i\}corresponds to one of the following modalities: PET\-CT imaging reports, structured surgical or diagnostic EHRs, and unstructured follow\-up documentations\. Each input type is processed by a modality\-specific preprocessor, namely Medical Image Processor or Record Processor, to extract semantically meaningful representations for downstream retrieval and generation\.

Medical Image Processor\.PET\-CT imaging reports are parsed into structured textual fields that describe radiological findings and anatomical annotations\. We incorporate Qwen2\.5\-VL\-3B to extract expert\-defined indicators from the PET\-CT image and its natural language reports\.

Record Processor\.Surgical, diagnostic, and follow\-up documents are segmented into sentences and sections\. We apply Qwen3\-4B as the processor to generate embeddings for sentences associated with predefined prognostic indicators, allowing targeted semantic understanding\.

Each document chunkdid\_\{i\}is encoded into a dense vector representation via a pretrained domain\-specific encoder:

𝐡i=Processor\(di\)\\mathbf\{h\}\_\{i\}=\\text\{Processor\}\(d\_\{i\}\)\(1\)where𝐡i∈ℝd\\mathbf\{h\}\_\{i\}\\in\\mathbb\{R\}^\{d\}is the resulting embedding of the document chunkdid\_\{i\}\. Note that, missing modalities are handled with padding with null vectors𝟎\\mathbf\{0\}for structurally expected but unavailable modalities\.

### 3\.3RAG\-based Module

The RAG\-based module serves as the core reasoning engine in our framework, integrating structured patient data and external clinical knowledge to generate prognostic predictions with explanatory support\. This module consists of three key components: a domain\-specific retriever, an expert\-guided prognostic schema, and a retrieval\-augmented generation model\.

Domain\-Specific Retriever:To improve factual consistency and medical relevance, we construct a retrieval corpus𝒞=\{c1,c2,…,cK\}\\mathcal\{C\}=\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{K\}\\\}consisting of curated domain\-specific resources, including: \(1\) general medical knowledge graph, \(2\) clinical guidelines for osteomyelitis treatment and infection management, \(3\) evidence\-based postoperative outcome studies, \(4\) surgical decision support documents, and \(5\) institutional recovery protocols and expert narratives\.

Given an input patient representationzzderived from the previous information extraction module, we encode it using a query encoderℰq\\mathcal\{E\}\_\{q\}and retrieve the top\-kkrelevant documents using a dense retriever:

ℛ=Top\-k\(FAISS\(ℰq\(z\),\{ℰd\(cj\)\}j=1K\)\)\\mathcal\{R\}=\\text\{Top\-\}k\\left\(\\text\{FAISS\}\\left\(\\mathcal\{E\}\_\{q\}\(z\),\\\{\\mathcal\{E\}\_\{d\}\(c\_\{j\}\)\\\}\_\{j=1\}^\{K\}\\right\)\\right\)\(2\)whereℰd\\mathcal\{E\}\_\{d\}denotes the document encoder\. The retrieved evidence setℛ\\mathcal\{R\}is then combined with patient\-specific prompts for generation\.

Expert\-Guided Prognostic Evaluation:To improve interpretability and anchor the generation process to clinically meaningful concepts, we define a prognostic evaluationℱ=\{f1,f2,…,f12\}\\mathcal\{F\}=\\\{f\_\{1\},f\_\{2\},\\dots,f\_\{12\}\\\}comprising twelve expert\-selected features related to infection severity, surgical history, and imaging biomarkers\. These indicators were identified by orthopedic experts as the most relevant factors for postoperative recovery assessment in chronic osteomyelitis, and are used to guide both structured prompt construction and targeted retrieval\. Compared with directly prompting the model using raw multimodal records alone, this expert\-guided design provides a more clinically grounded intermediate representation and helps organize heterogeneous patient evidence into a unified prognostic schema\.

The twelve indicators span three complementary dimensions and are summarized in Table[1](https://arxiv.org/html/2605.22833#S3.T1)\. Specifically, they include:

- •Infection and Surgical History:Aetiopathogenesis, Cierny\-Mader classification, time intervals between index, debridement, and revision surgeries, prior debridement count, number of interventions between PET\-CT and bone reconstruction, implant removal status, and surgical strategy\.
- •Clinical Biomarkers:WBC count\.
- •Radiological Indicators:SUVmax pre/post\-debridement, TLG pre/post\-debridement, and SUVmax location shift\.

These indicators guide the construction of structured prompt templates that capture semantically relevant clinical information\. These factors jointly capture the major dimensions of postoperative prognosis, including infection origin and severity, surgical burden, systemic inflammatory activity, and residual metabolic evidence from PET\-CT\. By explicitly encoding these clinically meaningful variables, the framework is able to preserve medically relevant signals even when the underlying records are incomplete, asynchronous, or distributed across different document types\.

To make this expert\-guided prognostic schema explicit, Table[1](https://arxiv.org/html/2605.22833#S3.T1)summarizes the twelve indicators used in RAG4Outcome, together with their associated clinical roles and primary source modalities\. This design transforms heterogeneous multimodal evidence into a structured and clinically interpretable intermediate representation, improving both prompt consistency and the traceability of the final prognostic prediction\.

Table 1:Expert\-defined prognostic indicators used in RAG4Outcome\.These indicators guide the construction of structured prompt templates that capture semantically relevant clinical information\. In particular, each prognostic factor is instantiated from the extracted patient evidence and used to organize the multimodal record into a form that is more suitable for retrieval and downstream reasoning\. This factor\-based design also improves robustness in real\-world clinical settings, where equivalent prognostic evidence may appear in different formats across PET\-CT reports, surgical notes, laboratory tests, and follow\-up documentation\.

Retrieval\-Augmented Generation for Prognostic Prediction:The language generation process is handled by a large language model𝒢\\mathcal\{G\}\(we applied Qwen3\-4B here\), which takes as input a concatenated prompt comprising patient data summary and retrieved evidence passages:

x=\[Prompt\(z\);ℛ\]x=\[\\text\{Prompt\}\(z\);\\mathcal\{R\}\]\(3\)
The prompt design leverages slot\-filling mechanisms, where each prognostic factorfi∈ℱf\_\{i\}\\in\\mathcal\{F\}is instantiated with extracted patient data and used to condition retrieval and generation:

Prompt\(z\)=TemplateFill\(f1\(z\),f2\(z\),…,f12\(z\)\)\\text\{Prompt\}\(z\)=\\text\{TemplateFill\}\(f\_\{1\}\(z\),f\_\{2\}\(z\),\\dots,f\_\{12\}\(z\)\)\(4\)The generation objective is:

y=𝒢\(x\)=arg⁡maxy∏t=1TP\(yt\|y<t,x\)y=\\mathcal\{G\}\(x\)=\\arg\\max\_\{y\}\\prod\_\{t=1\}^\{T\}P\(y\_\{t\}\|y\_\{<t\},x\)\(5\)wherey=\{y1,y2,…,yT\}y=\\\{y\_\{1\},y\_\{2\},\\dots,y\_\{T\}\\\}is the generated output sequence, consisting of: \(1\) a structured textual summary synthesizing patient\-specific features, and \(2\) a final outcome prediction labelo∈\{good,fair,poor,very poor\}o\\in\\\{\\text\{good\},\\text\{fair\},\\text\{poor\},\\text\{very poor\}\\\}with supporting rationale\.

In our setting, the final prediction is produced in four levels, corresponding to clinically interpretable postoperative outcomes\. By combining structured patient\-specific factors with retrieved external evidence, the model is encouraged to generate prognostic assessments that are not only coherent and clinically informed, but also more transparent and verifiable\.

### 3\.4Implementation Details

In practice, the multimodal inputs are first normalized into textual units that can be consistently consumed by the retrieval and generation modules\. PET\-CT reports are converted into structured findings with emphasis on lesion activity, anatomical site, and temporal change, while surgical and follow\-up records are segmented into clinically meaningful units such as diagnosis, intervention history, laboratory findings, and postoperative evolution\. These normalized textual units are then mapped to the predefined prognostic schema and assembled into a patient\-specific structured prompt\.

For retrieval, we use a dense vector index built from a curated domain\-specific corpus containing osteomyelitis\-related guidelines, postoperative management literature, recovery assessment documents, and expert\-authored reference materials\. At inference time, the patient prompt is encoded as a query and matched against the indexed corpus to retrieve the most relevant evidence passages\. The retrieved passages are concatenated with the patient summary and then passed to the generator for final prognostic reasoning\. This design allows the framework to decouple heterogeneous data processing from downstream outcome prediction, making it more robust to missing modalities and variable documentation quality\.

To improve interpretability, the generated output is constrained to include two components: a structured summary of patient\-specific risk factors and a final categorical prognosis accompanied by explicit supporting evidence\. In this way, the model output remains traceable to both input records and external references, which is essential for high\-stakes clinical use\.

## 4Experiments and Discussions

![Refer to caption](https://arxiv.org/html/2605.22833v1/figures/Confusion_Matrix.png)Figure 2:Model prediction analysis on 8 case study patients\.\(a\)\. Confusion matrix: RAG prediction vs LEFS\. \(b\)\. Confusion matrix: RAG prediction vs Enneking\.### 4\.1Data Collection

To evaluate the feasibility and interpretability of our proposed RAG4Outcome framework, we conduct preliminary case studies using a real\-world dataset collected from a tertiary care hospital in China\. The dataset we are constructing is planned to contain clinical information of 230 chronic osteomyelitis patients, each followed up for 3 \- 6 years\. Each patient case includes multimodal clinical records: PET\-CT imaging reports, structured EHR data \(diagnoses, surgeries, lab values\), and longitudinal follow\-up notes\. For this workshop submission, we focus on a representative subset of 8 patients, denoted as P1–P8, to demonstrate the effectiveness of our retrieval\-augmented prognostic reasoning pipeline\.

The collected cohort reflects the practical complexity of real\-world postoperative management\. In particular, modality completeness varies substantially across patients, and the temporal spacing between PET\-CT examination, surgery, and follow\-up is not uniform\. Some cases contain rich longitudinal documentation, while others include sparse but clinically critical observations\. This heterogeneity makes the cohort suitable for evaluating whether a prognostic framework can remain clinically informative under realistic documentation constraints rather than under artificially standardized multimodal settings\.

### 4\.2Case Study and Results

We evaluate the proposed RAG4Outcome framework on eight real\-world patient cases drawn from a larger cohort of 230 patients\. The ground\-truth outcome severity for each case was determined according to two widely used clinical scoring systems:

- •LEFS Score \(Lower Extremity Functional Scale\):Outcome levels were categorized based on standard thresholds:severe functional impairment \(0–20\),moderate \(21–40\),mild to moderate \(41–60\), andmild or normal \(\>\>60\)\.
- •Enneking Score:Outcome Levels were defined asPoor \(0–9\),Fair \(10–17\),Good \(18–25\), andExcellent \(\>\>26\)\.

To ensure consistent evaluation across cases, all outcomes were standardized into a unified four\-level taxonomy:Poor,Fair,Good, andExcellent\. This unified scale follows the Enneking system and serves as the label space for model prediction\.

The eight patient cases were selected to represent a diverse range of outcome severities, clinical presentations, and data completeness levels\. Specifically, cases were chosen to cover each of the four unified outcome categories, include both straightforward and clinically challenging instances, and reflect variation in available longitudinal and imaging data\. This selection strategy aims to qualitatively demonstrate the model’s reasoning ability and interpretability across different clinical contexts, rather than to establish statistical significance\. RAG4Outcome predicts one of these four outcome levels for each case, accompanied by an interpretable explanation grounded in both patient\-specific features and retrieved clinical evidence\. Model confidence was quantified using the softmax\-normalized log\-likelihood over the output distribution\.

We visualize and analyze model performance from three perspectives: agreement with clinical reference labels, confidence score distribution, and inter\-method prediction alignment\. Results are shown in Figure[2](https://arxiv.org/html/2605.22833#S4.F2)\.

![Refer to caption](https://arxiv.org/html/2605.22833v1/figures/Results_and_Comparison.png)Figure 3:Model prediction analysis on 8 case study patients\.\(a\)\. Confidence scores per patient\. \(b\)\. Comparison of LEFS, Enneking, and RAG predictions for each patient\.Prediction Agreement with Clinical Scores\.Figures[2](https://arxiv.org/html/2605.22833#S4.F2)\(a\) and \(b\) show confusion matrices comparing RAG4Outcome predictions with LEFS and Enneking ground truth labels, respectively\. Most predictions fall on the diagonal, indicating strong agreement\. A few off\-diagonal entries, such as between “poor” and “fair” in Figure[2](https://arxiv.org/html/2605.22833#S4.F2)\(a\), suggest minor mismatches likely due to subjective boundaries in clinical scoring\. This underscores the robustness of our model across different reference systems\.

Confidence Score Evaluation\.The confidence scores range from 0\.62 to 0\.92, as shown in Figure[3](https://arxiv.org/html/2605.22833#S4.F3)\(a\), indicating generally high model certainty\. Lower confidence in patients P5–P7 corresponds to ambiguous conditions, which also fall near the clinical decision boundaries in LEFS and Enneking scoring\.

Three\-Way Prediction Comparison\.Figure[3](https://arxiv.org/html/2605.22833#S4.F3)\(b\) presents a side\-by\-side comparison of LEFS, Enneking, and RAG4Outcome predictions\. Most patients show consistent results across all three methods\. Discrepancies observed in P5 highlight the potential advantage of retrieval\-augmented reasoning in resolving subjective ambiguities in clinical assessment\.

Qualitative inspection of the retrieved evidence further supports the validity of the proposed framework\. Across the selected cases, the retrieved passages frequently focused on clinically meaningful concepts such as recurrence risk after repeated debridement, interpretation of postoperative inflammatory activity, and the prognostic relevance of PET\-CT metabolic indicators\. In cases with clearer recovery trajectories, the retrieved evidence was highly consistent with the final prediction and the clinical scores\. In more ambiguous cases, retrieval provided additional contextual support that helped the model articulate why a borderline case should be assigned to a more conservative or more favorable outcome category\. This suggests that the value of RAG4Outcome lies not only in final label prediction, but also in structuring clinically relevant evidence for transparent decision support\.

### 4\.3Ablation Study

While the case study analysis highlights the interpretability of the proposed framework, we further perform a small\-scale ablation study to better understand the contribution of retrieval augmentation and PET\-CT\-derived evidence to the final prognostic prediction\. To further examine the contribution of each core component in RAG4Outcome, we conducted an ablation study on the eight patient cases used in our qualitative evaluation\. Specifically, we compared three model configurations: \(1\) the full model, which uses multimodal patient information together with retrieval\-augmented external medical evidence; \(2\) a variant without RAG, which disables retrieval from the external knowledge corpus and relies solely on patient\-specific clinical information; and \(3\) a variant without PET\-CT\-derived evidence, which uses EHR\-based clinical records together with the RAG module but excludes radiology\-related postoperative findings\.

Table 2:Ablation study of different RAG4Outcome variants on the 8\-case evaluation subset\.Given the small cohort size and the four\-category prognostic setting, we adopted the Macro\-F1 score to evaluate the agreement between model predictions and the unified reference labels \(Poor,Fair,Good, andExcellent\)\. As shown in Table[2](https://arxiv.org/html/2605.22833#S4.T2), the full model achieved the best performance, indicating the benefit of combining multimodal clinical evidence with retrieval\-augmented reasoning\. When the RAG module was removed, performance dropped substantially, suggesting that external medical knowledge plays an important role in supporting robust decision\-making, particularly for clinically ambiguous cases\. Removing PET\-CT\-derived evidence also led to a noticeable decline, indicating that radiology\-informed postoperative cues remain important for accurate prognosis assessment\.

Overall, the ablation results support the design motivation of RAG4Outcome\. Both multimodal evidence integration and retrieval augmentation contribute to model effectiveness, while the combination of the two produces predictions that are more consistent with the unified clinical reference\. These findings further suggest that external retrieval helps compensate for uncertainty and incompleteness in patient records, whereas PET\-CT\-derived indicators provide additional prognostic signals that may not be fully captured from textual EHR evidence alone\. We note that this ablation study is conducted on a small case subset and should therefore be interpreted as preliminary evidence of component\-wise contribution\.

### 4\.4Qualitative Analysis

To further illustrate the interpretable reasoning process of RAG4Outcome, we present two representative patient cases with different prognostic characteristics\. Rather than focusing only on the final predicted label, this analysis highlights how multimodal patient evidence and retrieved external knowledge are jointly organized into a clinically meaningful reasoning pathway\.

Case P2 \(Excellent Outcome\)\.This patient presented with a relatively localized infection pattern and underwent timely surgical intervention\. PET\-CT\-derived evidence indicated a focal metabolic signal with limited spread, while the clinical record suggested a favorable surgical trajectory\. The model retrieved external evidence related to early intervention and postoperative infection control, and finally predicted anExcellentoutcome\. The generated rationale emphasized localized disease burden, timely treatment, and the absence of strong indicators of persistent postoperative activity\. This prediction was consistent with the long\-term follow\-up outcome showing stable recovery without recurrence\.

Case P6 \(Fair/Poor Boundary\)\.In contrast, this patient exhibited a more complex clinical course, including multiple prior debridement procedures and less favorable postoperative characteristics\. PET\-CT\-derived evidence suggested more diffuse metabolic activity, and the EHR record reflected a higher surgical burden\. These signals alone could support a pessimistic interpretation\. However, after retrieval augmentation, the model incorporated additional external evidence related to host condition and treatment strategy, and produced aFairprediction rather than a strictlyPoorone\. The generated rationale noted a guarded prognosis together with the possibility of stabilization under appropriate management\.

Overall, these representative cases show that the value of RAG4Outcome lies not only in categorical prediction, but also in its ability to expose an interpretable reasoning path from multimodal records to retrieved evidence and final clinical judgment\.

### 4\.5Discussion

The results suggest thatRAG4Outcomeprovides consistent and clinically reliable predictions\. Compared to black\-box classifiers, it offers structured justifications aligned with clinical reasoning\. Retrieved evidence often refers to literature on infection recurrence, surgical timing, or imaging biomarkers, providing grounded support for its predictions\. We observed high consistency between the model’s generated rationale and physician assessments\. The agreement with expert scoring systems, combined with the ability to explain its decisions using external references, supports the model’s practical value for prognostic prediction in chronic osteomyelitis\. We acknowledge that the evaluation on eight cases is preliminary and does not yet allow conclusions about generalizability or robustness\. The primary goal of this case study is to illustrate the model’s interpretability and clinical reasoning process, not to provide statistically significant validation\. A larger\-scale quantitative evaluation involving the full 230\-patient cohort \(and potentially additional cases\) is planned for future work to further assess model performance, generalizability, and data\-driven variability\. Nevertheless, the selected cases provide an informative cross\-section of clinical diversity within the broader cohort, demonstrating how RAG4Outcome handles both typical and complex scenarios\.

## 5Conclusion

In this work, we present RAG4Outcome, a retrieval\-augmented multimodal framework for postoperative outcome prediction in chronic osteomyelitis\. Our method integrates heterogeneous clinical data into a unified reasoning process supported by expert\-defined prognostic indicators and external clinical knowledge\. Case study results demonstrate promising consistency with gold\-standard clinical scoring systems, alongside high interpretability and confidence transparency\. For future work, we plan to expand the evaluation to the full patient cohort\. Moreover, we aim to refine the retrieval corpus and explore model alignment techniques to enhance factual consistency and domain safety in broader clinical scenarios\.

## Acknowledgments

This work was supported by the Pudong New Area Science and Technology Development Fund\-Public Institution Livelihood Research Special Project \(Healthcare\) \(No\. PKJ2024\-Y05\), and the Clinical Research Program funded by Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine \(No\. ynts202201\)\. This work was also conducted in collaboration with LinkIntelli Technology\.

## References

- \[1\]L\. M\. Amugongo, P\. Mascheroni, S\. Brooks, S\. Doering, and J\. Seidel\(2025\)Retrieval augmented generation for large language models in healthcare: a systematic review\.PLOS Digital Health4\(6\),pp\. e0000877\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p2.1),[§1](https://arxiv.org/html/2605.22833#S1.p3.1)\.
- \[2\]J\. M\. Binkley, P\. W\. Stratford, S\. A\. Lott, D\. L\. Riddle, and N\. A\. O\. R\. R\. Network\(1999\)The lower extremity functional scale \(lefs\): scale development, measurement properties, and clinical application\.Physical therapy79\(4\),pp\. 371–383\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[3\]J\. Chang, T\. Maltby, A\. Moineddini, D\. Shi, L\. Wu, J\. Chen, J\. Yu, J\. Hung, G\. Viola, A\. Vilches,et al\.\(2025\)Piezoelectric nanofiber–based intelligent hearing system\.Science Advances11\(19\),pp\. eadl2741\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[4\]B\. Chen, A\. Solebo, D\. Shi, J\. Wu, and P\. Taylor\(2025\)Minuscule cell detection in as\-oct images with progressive field\-of\-view focusing\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 365–375\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[5\]G\. I\. CIERNY\(1985\)A clinical staging system for adult osteomyelitis\.Contemp Orthop\.10,pp\. 17–37\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p1.1)\.
- \[6\]J\. D\. Conway, V\. Hambardzumyan, N\. G\. Patel, S\. D\. Giacobbe, and M\. G\. Gesheff\(2021\)Immunological evaluation of patients with orthopedic infections: taking the cierny–mader classification to the next level\.Journal of Bone and Joint Infection6\(9\),pp\. 433–441\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[7\]S\. A\. Dingemans, S\. C\. Kleipool, M\. A\. Mulders, J\. Winkelhagen, N\. W\. Schep, J\. C\. Goslings, and T\. Schepers\(2017\)Normative data for the lower extremity functional scale \(lefs\)\.Acta Orthopaedica88\(4\),pp\. 422–426\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[8\]M\. Dudareva, A\. Hotchen, M\. A\. McNally, J\. Hartmann\-Boyce, M\. Scarborough, and G\. Collins\(2021\)Systematic review of risk prediction studies in bone and joint infection: are modifiable prognostic factors useful in predicting recurrence?\.Journal of bone and joint infection6\(7\),pp\. 257–271\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p1.1)\.
- \[9\]O\. Fırat, K\. T\. Okur, Ö\. Koray, Ç\. Mehmet, K\. Hatice, and C\. Ilhami\(2025\)Management and long\-term outcomes of post\-traumatic chronic osteomyelitis in long bones: cierny\-mader types iii and iv\.Cureus17\(1\)\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p2.1)\.
- \[10\]S\. Hasan and A\. Rezai\(2025\)LLM: retreival vs\. parametricmemory tradeoff: a comparison of retrieval\-augmented generation and standalone largelanguage models using ragas answer accuracy\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[11\]V\. Kapoor, B\. M\. McCook, and F\. S\. Torok\(2004\)An introduction to pet\-ct imaging\.Radiographics24\(2\),pp\. 523–543\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p3.1)\.
- \[12\]Y\. H\. Ke, L\. Jin, K\. Elangovan, H\. R\. Abdullah, N\. Liu, A\. T\. H\. Sia, C\. R\. Soh, J\. Y\. M\. Tung, J\. C\. L\. Ong, C\. Kuo,et al\.\(2025\)Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness\.npj Digital Medicine8\(1\),pp\. 187\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p2.1),[§1](https://arxiv.org/html/2605.22833#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[13\]D\. P\. Lew and F\. A\. Waldvogel\(2004\)Osteomyelitis\.The Lancet364\(9431\),pp\. 369–379\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p2.1)\.
- \[14\]R\. Luo, L\. Sun, Y\. Xia, T\. Qin, S\. Zhang, H\. Poon, and T\. Liu\(2022\)BioGPT: generative pre\-trained transformer for biomedical text generation and mining\.Briefings in bioinformatics23\(6\),pp\. bbac409\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[15\]S\. P\. Mehta, A\. Fulton, C\. Quach, M\. Thistle, C\. Toledo, and N\. A\. Evans\(2016\)Measurement properties of the lower extremity functional scale: a systematic review\.Journal of Orthopaedic & Sports Physical Therapy46\(3\),pp\. 200–216\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[16\]K\. K\. Y\. Ng, I\. Matsuba, and P\. C\. Zhang\(2025\)RAG in health care: a novel framework for improving communication and decision\-making by addressing llm limitations\.NEJM AI2\(1\),pp\. AIra2400380\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p3.1)\.
- \[17\]M\. Panteli and P\. V\. Giannoudis\(2016\)Chronic osteomyelitis: what the surgeon needs to know\.EFORT open Reviews1\(5\),pp\. 128–135\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[18\]S\. K\. Schmitt\(2017\)Osteomyelitis\.Infectious Disease Clinics31\(2\),pp\. 325–338\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p1.1)\.
- \[19\]D\. Shi, X\. Diao, X\. Chen, and C\. M\. John\(2025\)Competitive distillation: a simple learning strategy for improving visual classification\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 2981–2990\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[20\]D\. Shi, X\. Diao, L\. Shi, H\. Tang, Y\. Chi, C\. Li, and H\. Xu\(2022\)Charformer: a glyph fusion based attentive framework for high\-precision character image denoising\.InProceedings of the 30th ACM international conference on multimedia,pp\. 1147–1155\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[21\]D\. Shi, X\. Diao, J\. Wu, H\. Wu, X\. Tang, F\. Naughton, and P\. Bondaronek\(2025\)Graph\-based llm over semi\-structured population data for dynamic policy response\.InInternational Workshop on Efficient Medical Artificial Intelligence,pp\. 278–288\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[22\]D\. Shi, X\. Li, and F\. Giunchiglia\(2024\)Kae: a property\-based method for knowledge graph alignment and extension\.Journal of Web Semantics82,pp\. 100832\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[23\]D\. Shi, T\. Wang, H\. Xing, and H\. Xu\(2020\)A learning path recommendation model based on a multidimensional knowledge graph framework for e\-learning\.Knowledge\-Based Systems195,pp\. 105618\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[24\]K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[25\]K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, H\. Cole\-Lewis,et al\.\(2025\)Toward expert\-level medical question answering with large language models\.Nature Medicine,pp\. 1–8\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[26\]W\. E\. Thompson, D\. M\. Vidmar, J\. K\. De Freitas, J\. M\. Pfeifer, B\. K\. Fornwalt, R\. Chen, G\. Altay, K\. Manghnani, A\. C\. Nelsen, K\. Morland,et al\.\(2023\)Large language models with retrieval\-augmented generation for zero\-shot disease phenotyping\.Deep Generative Models for Health Workshop in NeurIPS 2023\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[27\]I\. Uçkay, K\. Jugun, A\. Gamulin, J\. Wagener, P\. Hoffmeyer, and D\. Lew\(2012\)Chronic osteomyelitis\.Current infectious disease reports14\(5\),pp\. 566–575\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p1.1)\.
- \[28\]T\. Wada, A\. Kawai, K\. Ihara, M\. Sasaki, T\. Sonoda, T\. Imaeda, and T\. Yamashita\(2007\)Construct validity of the enneking score for measuring function in patients with malignant or aggressive benign tumours of the upper limb\.The Journal of Bone & Joint Surgery British Volume89\(5\),pp\. 659–663\.Cited by:[§2\.1](https://arxiv.org/html/2605.22833#S2.SS1.p1.1)\.
- \[29\]Z\. Wang, D\. Shi, J\. Zhao, X\. Diao, X\. Tang, and Y\. Qin\(2025\)Automated construction of medical indicator knowledge graphs using retrieval augmented large language models\.arXiv preprint arXiv:2511\.13526\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[30\]J\. Wu, Y\. Kim, D\. Shi, D\. Cliffton, F\. Liu, and H\. Wu\(2024\)Slava\-cxr: small language and vision assistant for chest x\-ray report automation\.arXiv preprint arXiv:2409\.13321\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p2.1)\.
- \[31\]J\. Wu, D\. Shi, A\. Hasan, and H\. Wu\(2023\)KnowLab at radsum23: comparing pre\-trained language models in radiology report summarization\.InThe 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks,pp\. 535–540\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p2.1)\.
- \[32\]P\. Xia, K\. Zhu, H\. Li, T\. Wang, W\. Shi, S\. Wang, L\. Zhang, J\. Zou, and H\. Yao\(2025\)Mmed\-rag: versatile multimodal rag system for medical vision language models\.The Thirteenth International Conference on Learning Representations \(ICLR\)\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p2.1)\.
- \[33\]G\. Xiong, Q\. Jin, Z\. Lu, and A\. Zhang\(2024\)Benchmarking retrieval\-augmented generation for medicine\.InFindings of the Association for Computational Linguistics ACL 2024,pp\. 6233–6251\.Cited by:[§1](https://arxiv.org/html/2605.22833#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
- \[34\]Z\. Zhang, C\. Wang, Y\. Wang, E\. Shi, Y\. Ma, W\. Zhong, J\. Chen, M\. Mao, and Z\. Zheng\(2025\)Llm hallucinations in practical code generation: phenomena, mechanism, and mitigation\.Proceedings of the ACM on Software Engineering2\(ISSTA\),pp\. 481–503\.Cited by:[§2\.2](https://arxiv.org/html/2605.22833#S2.SS2.p1.1)\.
RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis

Similar Articles

RAG-Anything: All-in-One RAG Framework

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

Submit Feedback

Similar Articles

RAG-Anything: All-in-One RAG Framework
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation
Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs
When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG