Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

arXiv cs.CL Papers

Summary

This paper investigates parameter-efficient strategies for adapting large language models to 3D CT report generation, introducing RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that keeps the LLM frozen and requires minimal trainable parameters. It shows that freezing larger LLMs (~1B+) and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency.

arXiv:2606.17213v1 Announce Type: new Abstract: Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:39 AM

# Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors
Source: [https://arxiv.org/html/2606.17213](https://arxiv.org/html/2606.17213)
\\addauthor

Vanshali Sharmasharmavanshali@gmail\.com1\\addauthorAndrea M\. Bejar∗andrea\.bejar@northwestern\.edu1\\addauthorHalil Ertugrul Aktas∗halilertugrul\.aktas@northwestern\.edu1\\addauthorQuoc\-Huy Trinhtrnhquchuy@yahoo\.com\.vn3\\addauthorDebesh Jhadebesh\.jha@usd\.edu2\\addauthorGorkem Durakgorkem\.durak@northwestern\.edu1\\addauthorUlas Bagciulas\.bagci@northwestern\.edu1\\addinstitutionDepartment of Radiology Northwestern University Chicago, USA\\addinstitutionDepartment of Computer Science University of South Dakota South Dakota, USA\\addinstitutionDepartment of Computer Science Aalto University, Finland

###### Abstract

\*\*footnotetext:These authors contributed equally to this work\.

Recent advances in multimodal learning, including large language models \(LLMs\) and vision\-language models \(VLMs\), have demonstrated strong adaptability to natural images\. However, extending their use to the medical domain, particularly for volumetric \(3D\) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology\. Naively fine\-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality\. In this study, we investigate parameter\-efficient adaptation strategies for volumetric CT report generation and introduceRAD3D\-Prefix, a lightweight diagnostic\-prior conditioning framework that minimizes the need for extensive parameter training\. This module integrates image embeddings with multi\-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap\. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain\-specific datasets\. Through a systematic study spanning LLMs from 96\.1M to 1\.6B parameters, we find that fine\-tuning is most beneficial for smaller LLMs, whereas freezing larger \(≈\\approx1B\+\) LLMs and training only lightweight projection layers provides a superior trade\-off between performance, generalization, and computational efficiency\. Across multiple automatic metrics and a clinical reader study,RAD3D\-Prefixoutperforms comparable parameter\-efficient baselines and demonstrates strong out\-of\-domain generalization while using substantially fewer trainable parameters than fully fine\-tuned alternatives\.111The source code and models will be made public after the review process\.

## 1Introduction

Large Language Models \(LLMs\) are pre\-trained on a massive amount of text, which allows them to generalize effectively and perform well on downstream tasks involving zero\-shot learning\[sanhmultitask,ramesh2021zero\]and few\-shot in\-context learning\[brown2020language,wei2022chain\]\. These remarkable properties of LLMs have inspired their adoption in various vision\-based tasks for multimodal applications\[guo2023images,li2024dtllm\]\. Most of these approaches concentrate on end\-to\-end training or fine\-tuning using domain\-specific image\-text pairs\[li2022blip,kim2021vilt\]\. Specifically, multimodal models that work with medical images and text often rely on fine\-tuning because the general data used to pre\-train LLMs contains very few medical examples\[e3d\-gpt\]\. Specialized biomedical LLMs have been developed such asBioGPT\[10\.1093/bib/bbac409\]andBioMedLM\[bolton2024biomedlm\], to address the limitations of general LLMs in the medical field\. These models are pre\-trained on extensive biomedical text corpora likePubMed\[white2020pubmed\]to improve their ability to recognize medical terminology\. However, despite these advances, there is still a lack of a systematic approach to integrate LLMs in 3D medical imaging\. Although the adoption of LLMs, including their frozen or fine\-tuning paradigms has been extensively investigated in natural image\-text settings\[llava,li2022blip\], their extension to volumetric 3D CT report generation, where critical diagnostic reasoning is crucial, has received comparatively limited attention\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/RAD3D-Preifx_intro_1.png)Figure 1:Three critical challenges in report generation: \(a\) Semantic Clinical Gap\. \(b\) Clinical Hallucination\. \(c\) Computational Inefficacy\.![Refer to caption](https://arxiv.org/html/2606.17213v1/images/prefix_basic_4.png)Figure 2:Three variations of the proposed projection module: \(a\)V\-1: prefix includes image embeddings and involves LLM’s fine\-tuning \(suitable for smaller LLMs\), \(b\)V\-2: prefix includes image embeddings with frozen LLM \(suitable for larger LLMs\), and \(c\)V\-3: prefix includes image embeddings, and classification logits with frozen LLM\.Generating clinical reports from 3D CT scans involves three critical issues \(as illustrated in Fig[1](https://arxiv.org/html/2606.17213#S1.F1)\): \(i\)Clinical Hallucination:Fine\-tuning large models on limited medical image\-text pairs often leads to overfitting and “clinical hallucination”, where it prioritizes linguistic fluency over clinical factuality\. Moreover, despite the growing adoption of LLMs in medical imaging, the trade\-offs between freezing and fine\-tuning models of different scales for 3D CT report generation remain poorly understood\. \(ii\)Computational Inefficiency:Fine\-tuning process requires optimizing millions or billions of parameters, which creates significant computational overhead in a resource constrained clinical setting\. \(iii\)Semantic Clinical Gap:Unlike regular captions and X\-ray reports, the 3D CT scan reports have long sequences and complex domain\-specific language\. Although 3D CT scans contain comprehensive diagnostic information compared to 2D images, absence of explicit clinical priors creates a semantic clinical gap between high\-dimensional 3D visual features and medical terminologies\. This issue persists even when using domain\-specialized LLMs\.

To address the above challenges, we conduct a systematic study of LLM adaptation strategies for 3D CT report generation acrossfiveLLMs of varying parameter scales\. Our study establishes a comprehensive protocol for LLM training strategies and derives practical recommendations for choosing between freezing and fine\-tuning paradigms for 3D CT report generation\. Building on these findings, we proposeRAD3D\-Prefix, alightweight anomaly\-aware prefix conditioning frameworkthat injects critical clinical priors into a frozen LLM through a prefix learning mechanism while allowing faster training with minimal trainable parameters\. To this end, our method introduces a projection module that generates ananomaly\-aware prefix, which is a fixed\-length embedding sequence that represents both image features and multi\-label diagnostic classes\. This lightweight approach allows LLMs to be used in a zero\-shot learning setting\.

To systematically demonstrate the impact of our proposed approach, inspired by\[mokady2021clipcap,wang2023r2gengpt\], we investigated three different experiment setup \(variant\), as illustrated in Fig[2](https://arxiv.org/html/2606.17213#S1.F2)\. In variantV\-1\(Fig[2](https://arxiv.org/html/2606.17213#S1.F2)\(a\)\), our projection network and LLMs are simultaneously trained with prefixes comprising of image embeddings alone, which are then concatenated with text embeddings\. In variantV\-2\(Fig[2](https://arxiv.org/html/2606.17213#S1.F2)\(b\)\), the same prefix settings are used with frozen LLM\. In variantV\-3\(Fig[2](https://arxiv.org/html/2606.17213#S1.F2)\(c\)\), we incorporate the diagnostic details by concatenating multi\-abnormality classification logits with image prefix representation as projection input for frozen LLM\. We examined these approaches using smaller models \(with a few million parameters\) and larger models \(≈1​B\\approx 1B\+ parameters\)\. Unlike\[wang2023r2gengpt\], our work focuses on improving the core processing of 3D image embeddings and their projection as input to LLMs while preserving clinically significant multi\-abnormality entity markers\. Our main contributions are summarized below:

- •First systematic study of frozen\-vs\-fine\-tuned LLM scaling laws for 3D CT report generation:We conducted extensive experiments onthreedifferent model variations to determine the most effective approach for different prefix designs\. Further, we performed comparison acrossfiveLLMs from 96\.1M to 1\.6B parameters, with frozen vs fine\-tuned setups\. This provides actionable guidance \(fine\-tune<<1B, freeze≈\\approx1B\+\) that has not been studied in 3D medical imaging and contradicts prevailing natural image findings \(LLaVA\[llava\]and BLIP\-2\[li2023blip\]\)\.
- •A lightweight anomaly\-aware prefix projection module:We propose alightweight anomaly\-aware prefix projection moduleto generate clinical reports for 3D radiology images with minimal parameter training\. In contrast to existing 2D image\-based vision\-language models \(VLMs\)\[medflamingo\],RAD3D‑Prefixaligns 3D image embeddings and anomaly logits with a frozen LLM\. Thus, narrowing the semantic clinical gap, unlike natural image models\[10\.5555/3666122\.3668264,jin2024unified\], which face semantic gaps, especially when deployed in the medical domain\. While the basic prefix learning concept exists\[mokady2021clipcap\], we extend it to volumetric CT report generation by integrating visual features with diagnostic priors, yielding improved clinical relevance and parameter efficiency\.
- •Ensure clinically relevant outputs using medical\-specific metrics and a reader study by clinical experts:We incorporated multi\-anomaly classification logits to retain important clinical details in the generated reports\. This explicitly exposes clinical concepts \(e\.g\., effusion, consolidation\) to the LLM\. We further used medical\-specific evaluation metrics to ensure diagnostic precision\. Additionally, a reader study with two clinical experts shows that our model generates reports with higher clinical relevance than both the baseline and the variant without clinical priors\.
- •Outperforms similar\-sized and domain\-specialized models, while performing comparable with larger models:Our proposed method, despite minimal training, empirically outperforms existing techniques when using frozen LLMs with the same parameter count and specialized domain pre\-training\. The model also performed comparably to methods employing frozen LLMs with higher parameter count, supported by bootstrap analysis\. Moreover, using the same vision encoder across all methods shows that the gains stem from the anomaly\-aware prefix rather than a heavier backbone\.

## 2Related Work

### 2\.1Medical Report Generation

Llava\-Med\[llavamed\], Med\-Flamingo\[medflamingo\], and Med\-PaLM\[medpalm\]are the major models designed for medical report generation that use Vision\-Language Models \(VLMs\) trained on extensive image\-text datasets\. However, these models have a key limitation: they can’t process 3D medical images like CT and MRI scans because of the high complexity and computational costs involved\. To address this, other models like, CT2Rep\[ct2rep\], CT\-AGRG\[ct\-agrg\], E3D\-GPT\[e3d\-gpt\], and Med\-2E3\[med2e3\]have been developed\. These solutions capture global features from 3D images and use them as input for text decoders to generate reports\.

TheCT2Repmodel uses a 3D medical vision encoder to extract global features from CT images and integrate them into a language model for report generation, showing initial effectivness\. Building on this, theCT\-AGRGmodel incorporates abnormality\-guided recognition, which allows the framework to detect anomalies and generate corresponding medical report descriptions\. In addition,E3D\-GPTintroduced a large\-scale 3D medical image dataset and a 3D medical image foundation model based on MAE\[he2022masked\], which enhances the representation of visual information in the overall vision\-language model\.

While these methods have promising initial results, they still have significant limitations\. They either fine\-tune larger LLMs \(E3D\-GPT\), use a simple linear projection of 3D radiology images \(CT\-AGRG\), or fine\-tune both the image encoder and text decoder \(CT2Rep\)\. This can lead to overfitting and a lack of proper alignment between visual and textual semantics\. While recent work\[chen2025large\]has explored region\-guided mechanisms, our work investigates parameter\-efficient and diagnostic\-prior conditioning for 3D CT report generation\.

### 2\.2Vision Projector in Large Vision Language Models \(VLMs\)

Vision projectors are modules designed to project the context of visual data in the same way as text, which helps align the image and text spaces\. Early methods like LLaVA\[llava\]proposed a simple feed\-forward layer for this purpose, with promising initial results\. Later, LLaMA 3\.2\[llama32\]introduced a cross\-attention mechanism to tackle this alignment challenge\.

In the context of medical imaging, the LLaVA\-Med model\[llavamed\]uses a simple MLP projection layer, similar to the original LLaVA model\[llava\]\. This approach works well for 2D images but isn’t effective for 3D radiology images\. For report generation, models like Med\-2E3\[med2e3\]and Red2RG\[chen2024large\]use adapters to help projectors encode volumetric data, but they still have high computational costs\. This is because they use two encoders to generate visual information, which significantly increases overhead\. Like our proposed method, the R2GenGPT\[wang2023r2gengpt\]model uses a frozen LLM with a visual alignment layer\. However, its alignment layer is a simple linear projection, which can lead to feature alignment issues\.

To address these challenges, we proposeRAD3D\-Prefix, which uses a transformer\-based, anomaly\-aware prefix module\. This module leverages prefix projection to effectively manage the differences between the image and text embedding spaces\. This approach improves computational efficiency and provides better alignment, all while keeping the LLM frozen\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/Prefix_intro_notation.png)Figure 3:Overview of the proposedRAD3D\-Prefixmodel\. The model aligns the image encoder’s output and the classification logits to the text embedding space via a lightweight projection network\.

## 3RAD3D\-Prefix

In this section, we describe our proposedRAD3D\-Prefixmodel for CT report generation\. Given a 3D CT\-scan imagexi∈ℝC×D×H×Wx\_\{i\}\\in\\mathbb\{R\}^\{C\\times D\\times H\\times W\}where C, D, H and W denote the channel, depth, height, and width, respectively, our framework aims to generate a patient\-specific multi\-sentence report impression/finding,RR, in a clear and coherent manner\. We leverage the strength of the vision foundation model and LLM\. This approach allows for efficient report generation with only a few trainable parameters while still achieving enhanced outcomes and mitigating the previously mentioned limitations\.

### 3\.1Overall Framework

RAD3D\-Prefixis presented in Fig\.[3](https://arxiv.org/html/2606.17213#S2.F3), comprising of a pretrained and frozen 3D image encoderfef\_\{e\}, a trainable transformer\-based projection networkfmf\_\{m\}and a frozen text decoderfdf\_\{d\}\. The encoderfef\_\{e\}extracts the visual embedding that can be utilized by the decoderfdf\_\{d\}to generate a reportRi=\{r1,r2,…,rN\}∈𝕍R\_\{i\}=\\\{r\_\{1\},r\_\{2\},\.\.\.,r\_\{N\}\\\}\\in\\mathbb\{V\}where𝕍\\mathbb\{V\}is the vocabulary andNNis the length of the report\. During this process, the lightweight networkfmf\_\{m\}tries to project the visual embeddings obtained fromfef\_\{e\}into thefdf\_\{d\}’s token space using a prefix learning mechanism, thus ensuring alignment between the two modalities\. For instance, given a datasetD=\(X,ℛ\)D=\{\(X,\\mathcal\{R\}\)\}, we extract the visual embeddings𝐳i\\mathbf\{z\}\_\{i\}for imagexi∈Xx\_\{i\}\\in Xusingfef\_\{e\}:

𝐳i=fe​\(xi\)∈ℝv,\\mathbf\{z\}\_\{i\}=f\_\{e\}\(x\_\{i\}\)\\in\\mathbb\{R\}^\{v\},\(1\)
To provide the model with crucial clinical context, our approach incorporates multi\-anomaly classification logits,𝐥i∈ℝc\\mathbf\{l\}\_\{i\}\\in\\mathbb\{R\}^\{c\}\(forccanomaly labels\)\. These logits are obtained from a separate, pre\-trained classification head on the image encoderfef\_\{e\}\. We directly fuse this high\-level diagnostic information with the raw visual features by concatenating the image embeddings,𝐳i\\mathbf\{z\}\_\{i\}, with the classification logits,𝐥i\\mathbf\{l\}\_\{i\}\. This concatenated vector, which combines both visual and clinical cues, is then used as input to our transformer\-based projection network,fmf\_\{m\}\. The network transforms this combined input into a structured sequence ofLL\-dimensional embedding vector, which act as a prefix for the LLM\. Note that theRAD3D\-Prefixis built on theV\-3variant configuration, where the prefix integrates both image embeddings and classification logits, and the LLM remains frozen\. The other two proposed variants,V\-1andV\-2are presented in the subsequent sections for comparative analysis\.

### 3\.2Visual Feature Extraction

Considering the complexity of 3D CT\-scan images, we adopted a recently introduced CT\-CLIP\[hamamci2024foundation\]image encoder that is pre\-trained in a contrastive setting using a large number of 3D CT\-focused image\-text pairs\. This encoder is based on the CT\-ViT\[hamamci2024generatect\]architecture that processes the input𝐱∈ℝ\(240\)×480×480\\mathbf\{x\}\\in\\mathbb\{R\}^\{\(240\)\\times 480\\times 480\}as non\-overlapping patches of shape\(10\)×20×20\(10\)\\times 20\\times 20, where1010is the temporal patch sizetpt\_\{p\}and2020denotes spatial patch sizesp1p\_\{1\}andp2p\_\{2\}\. These patches are reshaped toB×T×H×W×\(C\.tp\.p1\.p2\)B\\times T\\times H\\times W\\times\(C\.t\_\{p\}\.p\_\{1\}\.p\_\{2\}\), whereBBis the batch size andTTis the temporal patch count\. A linear transformation is then applied to obtain frame embeddings of shapeB×T×Hp1×Wp2×vB\\times T\\times\\frac\{H\}\{p\_\{1\}\}\\times\\frac\{W\}\{p\_\{2\}\}\\times v, wherevvis the final required latent representation dimension\. The spatial transformer processes this reshaped tensor, maintaining the same size\. Subsequently, it passes through a causal transformer, yielding a tensor of shape\(Hp1\.Wp2\)×\(B\.T\)×v\(\\frac\{H\}\{p\_\{1\}\}\.\\frac\{W\}\{p\_\{2\}\}\)\\times\(B\.T\)\\times v\. Such a combination of spatial and causal transformations ensures the retention of 3D information throughout the network\. Finally, to obtain the latent representation, the embeddings obtained are processed through a linear layer to convert them into a vector of dimensionvv, wherevvis set to 512\. This latent vector is used as input for the projection network\. For in\-domain data samples, we freeze all parameters offef\_\{e\}, whereas for out\-of\-domain data samples, we fine\-tuned the last layer\.

### 3\.3Anomaly\-Aware Prefix Projection Network

Our proposed model’s main trainable component is a lightweight, transformer\-based projection network,fmf\_\{m\}\. Its purpose is to align the image\-derived embeddings with the LLM’s token space\. Unlike traditional projectors that only use visual features as a single token or sequence of tokens, our network isanomaly\-awarethat generates a semantically disentangled sequence of visual and clinical context and maps them into a sequence of LLM\-style tokens\. It leverages both a latent vector from the image encoder and a set of multi\-label diagnostic classification logits to create a more comprehensive input\. This design is motivated by the need to explicitly provide the model with high\-level clinical priors, thus retaining critical diagnostic cues that might otherwise be lost\.

To integrate the diagnostic information, we concatenate the multi\-anomaly classification logits,𝐥i\\mathbf\{l\}\_\{i\}, with the image features𝐳i\\mathbf\{z\}\_\{i\}obtained fromfef\_\{e\}\. This fusion creates a rich, combined representation, and can be defined as𝐳^i=\[𝐳i⊕𝐥i\]∈ℝv\+c\\hat\{\\mathbf\{z\}\}\_\{i\}=\[\\mathbf\{z\}\_\{i\}\\oplus\\mathbf\{l\}\_\{i\}\]\\in\\mathbb\{R\}^\{v\+c\}, where⊕\\oplusdenotes concatenation\. Initially, these fused embeddings𝐳^i\\hat\{\\mathbf\{z\}\}\_\{i\}are projected into a structured sequence using a linear transformation\. The obtained structured embeddings and a learnable constant serve as input for the projection network\. The learnable constant extracts relevant information from the embeddings and adjusts the network to new data samples\. The transformer layers within our projection network then operate in a self\-attention setting to capture complex dependencies and optimize the representation\. The output of this process is the final prefix, a fixed\-length embedding sequence, calledanomaly\-aware prefix, which is then concatenated with the textual report embeddings𝐑^i=\{𝐫^1,𝐫^2,…,𝐫^N\}\\hat\{\\mathbf\{R\}\}\_\{i\}=\\\{\\hat\{\\mathbf\{r\}\}\_\{1\},\\hat\{\\mathbf\{r\}\}\_\{2\},\.\.\.,\\hat\{\\mathbf\{r\}\}\_\{N\}\\\},𝐫^j∈ℝh\\hat\{\\mathbf\{r\}\}\_\{j\}\\in\\mathbb\{R\}^\{h\}, before being fed to the frozen LLMfdf\_\{d\}\. Herehhrepresents the LLM’s hidden size\. This design ensures that the model is conditioned not only on the raw visual content but also on clinically significant multi\-abnormality markers, helping to generate more factual and diagnostically precise reports\. The classification logits are kept soft and frozen during report generation to provide valuable clinical priors without introducing backpropagation errors into the classifier\. This strategy conditions the LLM to generate clinically accurate reports\.

Training Objective:The training objective is to optimize the lightweight, trainable projection network,fmf\_\{m\}, while the image encoderfef\_\{e\}and text decoderfdf\_\{d\}remain frozen\. This objective is achieved by minimizing the negative log\-likelihood of the target report sequence\. The loss function is defined as:

ℒ=−∑i=1M∑j=1Nl​o​g​pθ​\(𝐫^i,j\|𝐳^i,𝐫^i,1,…,𝐫^i,j−1\)\.\\mathcal\{L\}=\-\\sum^\{M\}\_\{i=1\}\\sum^\{N\}\_\{j=1\}log\\hskip 1\.99997ptp\_\{\\theta\}\(\\hat\{\\mathbf\{r\}\}\_\{i,j\}\|\\hat\{\\mathbf\{z\}\}\_\{i\},\\hat\{\\mathbf\{r\}\}\_\{i,1\},\.\.\.,\\hat\{\\mathbf\{r\}\}\_\{i,\{j\-1\}\}\)\.\(2\)Here,MMis the number of reports in the dataset, andNNis the number of tokens in theii\-th report\. The termpθ\(𝐫^i,j\|\.\)p\_\{\\theta\}\(\\hat\{\\mathbf\{r\}\}\_\{i,j\}\|\.\)represents the probability of predicting thejj\-th token of theii\-th report,ri,jr\_\{i,j\}, which is conditioned on the entire prefix,𝐳^i\\hat\{\\mathbf\{z\}\}\_\{i\}and all previously generated tokens\. The parameters being optimized, denoted byθ\\thetabelong exclusively to the projection network,fmf\_\{m\}\. The goal is to train this network to generate a prefix representation,𝐳^i\\hat\{\\mathbf\{z\}\}\_\{i\}, that effectively conditions the frozen LLM \(fdf\_\{d\}\) to produce the correct report tokens,ri,jr\_\{i,j\}, in an autoregressive manner\. This approach ensures that the LLM’s vast, pre\-trained knowledge is leveraged while the model learns to bridge the modality gap using only a minimal number of trainable parameters\.

During the training phase, the projection network is optimized using the training lossℒ\\mathcal\{L\}\(Eq\.[2](https://arxiv.org/html/2606.17213#S3.E2)\), constrained by the concatenated prefix and report embeddings \(𝐳^i,𝐫^i,1,…,𝐫^i,j−1\\hat\{\\mathbf\{z\}\}\_\{i\},\\hat\{\\mathbf\{r\}\}\_\{i,1\},\.\.\.,\\hat\{\\mathbf\{r\}\}\_\{i,\{j\-1\}\}\)\. In contrast, only the prefix projections \(𝐳^i\\hat\{\\mathbf\{z\}\}\_\{i\}\) are used during evaluation, and the model generates the report tokens iteratively in an autoregressive manner, selecting each token based on the highest computed probabilities in the process\. A detailed algorithm for projection network training and the classification module are discussed in the supplementary material\.

### 3\.4Report Decoder

Report decoderfdf\_\{d\}is a pre\-trained LLM, extensively trained on a large corpus of generic text data\. We adopted the LLaMA\-3\.2\-1B222https://huggingface\.co/meta\-llama/Llama\-3\.2\-1Bmodel, originally designed to generate relevant text responses when given an input text prompt\. This limits its ability to handle visual content, as they are encoded in a different format\. Moreover, fine\-tuning such large models could interfere with the model’s generalizability and lead to suboptimal performance\. Therefore, we used the decoder’s pre\-trained weights without fine\-tuning to leverage its already captured rich and hierarchical text representations and avoid overfitting on small medical datasets\. However, to align the medical semantic features and the textual data in the report, we used the projection network that reformulates visual cues and the semantics of the reports in a manner that LLaMA\-3\.2\-1B can effectively interpret and process\. The decoderfdf\_\{d\}iteratively generates the report tokens when given some cues in the form of prefix projections\. This process can be defined as:

P\(wt∣w<t,𝐳^i,1:Lp\)=softmax\(wo​u​t⋅fd\(w<t,𝐳^i,1:Lp\),P\(w\_\{t\}\\mid w\_\{<t\},\\hat\{\\mathbf\{z\}\}\_\{i,1:L\_\{p\}\}\)=\\text\{softmax\}\(w\_\{\{out\}\}\\cdot f\_\{d\}\(w\_\{<t\},\\hat\{\\mathbf\{z\}\}\_\{i,1:L\_\{p\}\}\),\(3\)
whereLpL\_\{p\}is the prefix length,w<tw\_\{<t\}signifies the previously generated tokens and𝐰o​u​t\\mathbf\{w\}\_\{out\}is the output token projection matrix that maps LLaMA’s final hidden state to vocabulary space\.

## 4Experiments

### 4\.1Configurations

Datasets:We conducted experiments using two publicly available large datasets: CT\-RATE\[hamamci2024foundation\]and INSPECT\[huang2023inspect\]datasets\. CT\-RATE is used for in\-domain evaluation since the CT\-CLIP\[hamamci2024foundation\]encoder is pre\-trained on it, while INSPECT serves as an out\-of\-domain dataset\. CT\-RATE consists of 50,188 non\-contrast chest CT volumes with 18 multi\-abnormality labels\. We used its official split, i\.e\., 47,149 and 3,039 scans for training and testing, respectively\. The findings section is used for report generation\.

Table 1:Comparative analysis of baseline and variant V\-1\. Best values are inbold\.![Refer to caption](https://arxiv.org/html/2606.17213v1/images/model_performance_radar_chart_1.png)Figure 4:Radar plots showing the impact of fine\-tuning \(solid\) and freezing \(dashed\) parameters of variable\-sized LLMs \(96\.1M\-1B\)\. Sequence: DistilGPT2, GPT2, GPT2\-medium, LLaMA\-3\.2\-1B, and BioGPT\-Large\.INSPECT comprises CT\-scans acquired from 19,402 patients, focused on pulmonary embolism\. After removing redundant data, we used 17,730 and 3,506 scans for training and testing, respectively\. Due to the unavailability of the official split details, a stratified sampling was performed based on 21 distinct anomalies\. More details on dataset split preparation are in the supplementary material\.

Pre\-processing and Training Details:Inspired by CT\-CLIP\[hamamci2024foundation\], each CT volume is resized for a uniform spacing, i\.e\., 0\.75 mm in the x\-axis and the y\-axis, and 1\.5 mm in the z\-axis\. All experiments are conducted using480×480×240480\\times 480\\times 240resolution with Hounsfield Units \(H\.U\.\) clipped to \-1000 to 1000 range, followed by normalization\. The experiments are performed on NVIDIA\-A100 GPU using PyTorch framework, training all models for 10 epochs to ensure a fair comparison\. The projection module is trained using an Adam optimizer with a learning rate of 2e\-5\.

Baseline:We compared our method with various state\-of\-the\-art techniques in terms of \(a\) vision\-text alignment approach, \(b\) 3D medical image to report generation task, and \(c\) different sized LLMs adoption\. We primarily compare our work with R2GenGPT\[wang2023r2gengpt\]due to its overall architectural similarities and emphasis on medical report generation\. Although it was originally proposed for 2D images, we performed its training by feeding it our extracted visual latent embeddings\. R2GenGPT uses a frozen LLM and, hence, to additionally compare with trainable LLM, we created our baseline using basic clip\-to\-text decoder333https://github\.com/fkodom/clip\-text\-decoderarchitecture, using same visual embeddings but a conventional projection approach and fine\-tunes the LLM during training\.

We compared our proposed model with R2GenGPT using variable LLMs: domain\-specialized LLM \(BioGPT\-Large\[10\.1093/bib/bbac409\]\), an LLM with the same number of parameters \(LLaMA\-3\.2\-1B\-Instruct\), and an LLM with higher parameters originally adopted in their study \(LLaMA\-2\-7b\-chat\-hf\)\. The other baseline is additionally used to validate our three proposed alignment module variants \(V\-1,V2, andV\-3, shown in Fig\.[2](https://arxiv.org/html/2606.17213#S1.F2)\)\. Further, we performed a comparative analysis with recent work in the same domain\. 3D CT report generation is understudied, with only a few methods available but without publicly available code for reproducibility \(E3D\-GPT, CT\-ARG\)\. Hence, we relied on reported values only\. These works are closest in scope, along with CT2Rep\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/eg3.png)Figure 5:Qualitative example of the baseline and RAD3D\-Prefix\. Matching sentence pairs are highlighted in the same color\.
### 4\.2Results

This section compares the baseline approach and three proposed variants\.

Baseline vs\. VariantV\-1:To assess the significance of including visual embedding as a prefix \(VariantV\-1\), we compared its report generation outcomes with that obtained using baseline \(without prefix\)\. As shown in Table[1](https://arxiv.org/html/2606.17213#S4.T1), the majority of metrics favour the variantV\-1across four different LLMs with varying parameter count\. The overall results indicate an increase of 9\.96% to 29\.95% in Avg\. BLEU, 0\.69% to 8\.80% in METEOR, and a maximum of 6\.94% and 7\.50% improvement in Avg\. ROUGE \(1\-2\) and ROUGE\-L, respectively\. Similarly, an increase of about 1\.39% is observed in BERTScore\-F1\. The reported improvements highlight the superior performance of variantV\-1over the baseline, underscoring the importance of the visual prefix in enhancing report generation\.

VariantV\-1vs\. VariantV\-2:This comparison discusses the key concept of this work, i\.e\., the impact of fine\-tuning and freezing parameters of smaller \(<1​B<1B\) and larger \(≈1​B\+\\approx 1B\+\) LLMs\. Fig\.[4](https://arxiv.org/html/2606.17213#S4.F4)clearly illustrates that models with parameter sizes ranging from 96\.1M to 455M exhibit a performance decline across various metrics, including the GREEN Score, when adopted as frozen\. Note that both fine\-tuned \(variantV\-1\) and frozen \(variantV\-2\) models with 96\.1M to 1B parameters utilize the same projection module, involving visual embeddings\. In contrast, the last model, BioGPT\-Large, is trained using the standard baseline approach for fine\-tuning, whereas the R2GenGPT model deploys it as a frozen module\. The model exhibited severe overfitting when fine\-tuned\.

We can conclude two findings from this comparison: \(a\) With the increase in parameter number, the performance gap between the frozen and fine\-tuned models starts to decrease\. Further increasing the parameters \(beyond 1B\) improved the frozen model’s performance compared to its fine\-tuned counterpart\. \(b\) Our proposed projection network outperforms the conventional mapping criteria followed in the existing work, as demonstrated by the degraded performance of the baseline models in the last graph plot\. Although the overall trend between the frozen and fine\-tuned models remains consistent, a clear distinction in performance is evident\. We further compare theV\-3variant in the ablation study section below\. Also, qualitative results are shown in Fig\.[5](https://arxiv.org/html/2606.17213#S4.F5)\. More qualitative samples are given in the supplementary material\.

Table 2:Comparison with the state\-of\-the\-art approach of vision and text embedding alignment and different sized LLM training\. The analysis is divided into three categories: \(a\) LLM \(both fine\-tuned: baseline, row 1 and frozen: row 2\) with specialized domain pre\-training, \(b\) frozen LLM with the same parameter count \(row 3\) and \(c\) with higher parameter count using a conventional alignment approach \(row 4\)\. Values inbolddenote the best outcome\.Metrics:We report results on four widely used NLG metrics, namely, BLEU \(1\-4\)\[papineni\-etal\-2002\-bleu\], METEOR\[banerjee\-lavie\-2005\-meteor\], ROUGE \(1, 2, L\)\[lin\-2004\-rouge\]and BERTScore\-F1\[zhangbertscore\]\. In addition, we adopt the GREEN score\[ostmeier\-etal\-2024\-green\], which is specifically relevant to the medical domain\. BLEU measures n\-gram precision, with a penalty for short translations, whereas METEOR incorporates synonym matching and word order penalties\. ROUGE focuses on n\-gram recall, and the BertScore\-F1 focuses on semantic similarity using a language model\. The GREEN score focuses on the factual correctness of clinical information\.

Table 3:Comparison of state\-of\-the\-art approach for 3D images to clinical report generation\.†The results are from the original paper on the same test split\. \- denotes the unreported metrics\.
### 4\.3Comparison with State\-of\-the\-art Methods

In this section, we compare our approach with state\-of\-the\-art methods for medical report generation\. We focus on models with similar \(i\) architectural frameworks \(R2GenGPT\) and \(ii\) methodological objectives \(3D image to report generation\) to ensure a meaningful comparison\. Table[2](https://arxiv.org/html/2606.17213#S4.T2)presents a comparative analysis of our approach against the first category models with \(a\) specialized domain LLM, \(b\) same parameter count LLM, and \(c\) higher parameter count LLM\. We used BioGPT\-Large\[10\.1093/bib/bbac409\]as a specialized domain LLM in our baseline setting, i\.e\., a conventional mapping network with fine\-tuning\. The model experiences significant overfitting, and hence, the results are not reported in the table\. We further used the same model with R2GenGPT, i\.e\., as a frozen module and demonstrated better performance than its baseline version\.

We further replaced LLaMA\-3\.2\-1B with a larger LLM, DeepSeek\-R1\-Distill\-LLaMA\-8B\. It can be observed that across different LLM configurations, RAD3D\-Prefix consistently outperforms the R2GenGPT baselines\. Using LLaMA\-3\.2\-1B, RAD3D\-Prefix achieves the highest Avg\. BLEU, METEOR, and the best GREEN score \(0\.5488\), indicating improved clinical relevance\. When paired with the larger DeepSeek\-R1\-Distill\-LLaMA\-8B, RAD3D\-Prefix further improves ROUGE metrics and BERTScore\-F1, showing that the framework effectively scales with stronger LLMs\. Although R2GenGPT performs better with a 7B LLM compared to its 1B version, RAD3D\-Prefix with only a 1B LLM already surpasses or matches the 7B baseline across most metrics, highlighting the efficiency and scalability of the proposed approach\.

In addition to the specialized LLMs, we trained R2GenGPT with two more LLMs, LLaMA\-3\.2\-1B\-Instruct and LLaMA\-2\-7b\-chat\-hf\. We selected LLaMA\-3\.2\-1B\-Instruct to conduct a fair comparison with the same version and size of LLaMA\. An instruction variant of LLaMA was selected to serve a similar purpose to the original LLM \(LLaMA\-2\-7b\-chat\-hf\) used by R2GenGPT\. The results indicate that our model clearly outperforms the same\-sized LLM, i\.e\., LLaMA\-3\.2\-1B\-Instruct, achieving a GREEN score of 0\.5488\. Furthermore, even when a larger LLM is deployed in a similar setting, our model outperformed it on several metrics and performed comparably on others\. For a fair comparison, we also replaced our transformer\-based projection network with a linear layer \(similar to R2GenGPT\) for which details are given in the supplementary material\.

We further evaluated our approach against existing 3D medical image\-to\-report generation methods\. Table[3](https://arxiv.org/html/2606.17213#S4.T3)presents results from several recent techniques applied to the same dataset\. Our method,RAD3D\-Prefix, outperforms most existing techniques on the CT\-RATE dataset, with the exception of CT2Rep on certain NLG metrics\. However, our model’s superiority is evident in its clinical relevance, as demonstrated by a higher GREEN score compared to CT2Rep\. Furthermore,RAD3D\-Prefixsignificantly outperformed CT2Rep on the out\-of\-domain INSPECT dataset, underscoring its generalizability and robustness in real\-world scenarios\.

DatasetVariantMulti\-labelClassificationLogitsMETEORGREEN\(ClinicalEfficacy\)No\. ofTrainableParameters\.CT\-RATEV\-1✓✗✗0\.33150\.44560\.44020\.36460\.88260\.44541\.51BV\-2✓✓✗0\.35430\.46040\.46770\.41900\.88830\.5428279\.09MV\-3✓✓✓0\.36370\.46940\.47160\.42560\.88940\.5488279\.46MINSPECTV\-1✓✗✗0\.09840\.24730\.15860\.17470\.85050\.15651\.51BV\-2✓✓✗0\.05790\.20280\.18100\.21890\.86590\.2355279\.09MV\-3✓✓✓0\.06570\.21220\.18890\.22680\.86700\.2400279\.46M

Table 4:Comparative analysis of the role of presence/absence of different features \(Visual Prefix, Frozen LLM, and Multi\-label Classification Logits\) in the proposed RAD3D\-Prefix model\.
### 4\.4Ablation Study and Out\-of\-domain Performance

We performed an ablation study to assess the significance of different concepts introduced inRAD3D\-Prefix, especially the anomaly\-aware projection module, as shown in Table[4](https://arxiv.org/html/2606.17213#S4.T4)\. Additionally, INSPECT, an out\-of\-domain dataset is utilized to assess the impact of various features when applied to a dataset from a different distribution\. It is evident that incorporating each feature leads to improvements across most metrics\. The GREEN score, which is considered the most relevant metric from a clinical perspective, demonstrates a 23\.22% improvement in the variantV\-3setting compared to variantV\-1\. Similarly, using the INSPECT dataset, a 53\.36% improvement is observed\.

UMAP Visualization:To inspect the learned representation in theV\-2andV\-3variants, we projected the embeddings onto the three components and visualized them from multiple viewpoints\. In Fig\.[6](https://arxiv.org/html/2606.17213#S4.F6)\(V\-2\), the blue/cyan ribbon is looser, less structured, and appears more scattered, suggesting noisy embedding\. In Fig\.[6](https://arxiv.org/html/2606.17213#S4.F6)\(V\-3\), the blue/cyan ribbon is tighter and more clearly defined\. It yields a more compact S‑shaped ribbon, reflecting a more structured organization of the latent space while preserving the overall disease continuum\. It has a denser core with fewer isolated “outlier” points, demonstrating that the anomaly\-aware mechanism creates unique vectors for structurally important diseases\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/3d_v2_1.png)

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/3d_v3_1.png)

Figure 6:UMAP visualizations across different projection networks: V\-2 \(top\) and V\-3 \(bottom\) \.
### 4\.5Classifier Results and Dependency Analysis

Results:To obtain the multi\-abnormality classification logits, we fine\-tuned the last layer of the CT\-CLIP encoder and added a classification layer\. \(Fig\.[3](https://arxiv.org/html/2606.17213#S2.F3)\)\. Note that the same CT\-CLIP encoder used for CT scan feature extraction is employed for this purpose; however, as an independent module, separate from the report generation pipeline\. For the CT\-RATE dataset, we used its original weights from\[hamamci2024foundation\]since the CT\-CLIP encoder is pre\-trained with CT\-RATE and for the INSPECT dataset, its last layer is fine\-tuned\. The training process involves binary cross\-entropy loss with class\-wise weights to address class imbalance\. A multi\-label classification is performed featuring the 18 and 21 abnormalities of the CT\-RATE and INSPECT datasets, respectively\. Therefore, the output dimension equals the number of findings \(CT\-RATE: 18, INSPECT:21\)\. We concatenate these 1D logits with image embeddings, and this adds anomaly\-specific priors to guide generation\. The classification results are shown in Fig\.[7](https://arxiv.org/html/2606.17213#S4.F7)\. These results demonstrate strong anomaly detection, capturing clinical patterns that enhance report generation, particularly for critical conditions \(e\.g\., effusion: 91\.0%, cardiomegaly: 90\.6%\)\.

Dependency Analysis:We observed multiple cases where the anomaly classifier assigned relatively high confidence to a finding, yet the report generation system correctly described the finding in agreement with the ground\-truth report\. Some sample reports from three different reports are shown in Table[5](https://arxiv.org/html/2606.17213#S4.T5)\. Notably, the classifier assigned relatively high confidence scores \(\>0\.6 and \>0\.5\) to pericardial effusion and hiatal hernia, respectively\. Despite these misleading classifier signals, the generated reports correctly stated that these abnormalities were not observed, consistent with the reference reports\. Similarly, a basicHiatal herniaprior is refined into a specific subtype,“Sliding type hiatal hernia…”, which does not exist in training labels\. These examples demonstrate that the report generator is not solely based on classifier predictions and can override incorrect classifier signals or refine clinical details using information retained in the visual features\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/ctrate_class.png)

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/inspect_class.png)

Figure 7:Classification results on 18 and 21 multi\-abnormality labels of the CT\-RATE \(left\) and INSPECT \(right\) dataset, respectively\.Table 5:Samples showing visual tokens overriding classifier errors and adding subtype\.Mechanism TypeTarget ConditionClassifier LogitGround Truth ReportRAD3D\-Prefix OutputFalse Positive OverridePericardial Effusion0\.62017“Pericardial effusion\-thickening not observed\.”“Pericardial effusion\-thickening was not observed\.”Pericardial Effusion0\.63840“Pericardial effusion\-thickening not observed\.”“Pericardial effusion\-thickening was not observed\.”Hiatal Hernia0\.51572Pathology absent in official scan study\.Pathology accurately suppressed from report text\.Subtype AdditionHiatal Hernia0\.52510“Sliding type hiatal herniawas observed…”“Sliding type hiatal herniawas observed…”

### 4\.6Statistical Significance Evaluation and Reader Study

Statistical Significance Evaluation:To evaluate the significance of improvements, we performed bootstrapping with 5,000 iterations\. Fig\.[8](https://arxiv.org/html/2606.17213#S4.F8)\(a\) and Fig\.[8](https://arxiv.org/html/2606.17213#S4.F8)\(b\) demonstrate our model’s consistent and stable performance, as indicated by the narrow bootstrapped 95% CI across all metrics using the CT\-RATE and INSPECT datasets, respectively\. In addition, we comparedV\-2andV\-3variants across the two datasets \(Fig\.[8](https://arxiv.org/html/2606.17213#S4.F8)\(c\) and \(d\)\) to assess the significance of introducing classification logits into our proposed projection network\. We observed that five out of six metrics illustrate statistically significant improvements \(p<0\.05p<0\.05\), signifying robust and generalized efficacy of theRAD3D\-Prefixthat hold across independent datasets and evaluation dimensions, thereby validating the significance of the anomaly\-aware prefix\. In Fig\.[8](https://arxiv.org/html/2606.17213#S4.F8)\(c\), although the GREEN score improvement does not meet the significance threshold but the marginal p\-value \(p=0\.0554p=0\.0554\) suggests that the improvement is borderline meaningful and may warrant further investigation\. Similarly, in Fig\.[8](https://arxiv.org/html/2606.17213#S4.F8)\(d\), the mean difference in Avg\. BLEU is minimally negative \(\-0\.000031\), suggesting thatV\-2achieved a marginally higher score\. However, since thepp\-value is high \(p=0\.4864p=0\.4864\), this difference is not statistically significant\. We further compared our model with the baseline R2GenGPT with a larger \(Fig\.[8](https://arxiv.org/html/2606.17213#S4.F8)\(e\)\) and similar\-sized LLM \(Fig\.[8](https://arxiv.org/html/2606.17213#S4.F8)\(f\)\)\. The consistent performance gain with all metrics meeting the significance criterion \(p<0\.05p<0\.05\) highlights the robustness of our method\.

Reader Study:We conducted a reader study with two clinical experts who evaluated 100 randomly selected predicted reports from the baseline and two variantsV\-2andV\-3\. Averaging scores from both clinicians demonstrates that V\-3 achieved the highest clinically relevant outcomes, improving scores by 9\.8% over the baseline and 3\.7% over V\-2\. More details are in the supplementary material\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/RAD3D-Prefix_stat.png)Figure 8:Forest plots of mean differences \(95% CIs\) for RAD3D\-Prefix on \(a\) CT\-RATE and \(b\) INSPECT, with comparisons to \(c–f\) Variant V2, R2GenGPT \(LLaMA\-2\-7b\-chat\-hf\), and RGenGPT \(LLaMA\-3\.2\-1B\-Instruct\)\. Asterisks denote significance levels:p∗<0\.05\{\}^\{\\ast\}p<0\.05,p∗∗<0\.01\{\}^\{\\ast\\ast\}p<0\.01,p∗⁣∗∗<0\.001\{\}^\{\\ast\\ast\\ast\}p<0\.001; ns indicatesp≥0\.05p\\geq 0\.05\.

## 5Conclusion

In this paper, we present RAD3D\-Prefix model, with alightweight anomaly\-aware prefix conditioningmodule for the generation of 3D CT reports\. This approach effectively bridges the semantic clinical gap by aligning high\-dimensional volumetric features and structured diagnostic priors with the frozen LLM’s text embedding space\. Despite requiring fewer trainable parameters, the proposed model achieves a 4\.6% improvement in GREEN score over the SoTA\. Additionally, we compare different variants to highlight scenarios where one variant can be preferred compared to another variant\. Through a systematic study, we investigate the impact of model scale and adaptation strategy, establishing practical recommendations for choosing between freezing and fine\-tuning paradigms in 3D medical imaging\.

## Appendix

## Dataset Details

#### CT\-RATE

The CT\-RATE dataset consists of 25,692 non\-contrast chest CT volumes, which are expanded to 50,188 using various reconstruction methods\. These scans come from 21,304 unique patients and are accompanied by corresponding radiology text reports, 18 multi\-abnormality labels, and metadata\. During training, samples from 20,000 patients \(47,149 scans\) are used, while the remaining samples from 1,304 patients \(3,039 scans\) are reserved for testing\. The associated radiology reports are segmented into four sections: clinical information, technique, findings, and impression\. However, only the findings section is used to train the report generation model\. For the classification task, the publicly available 18 multi\-abnormality labels are used\.

Table 6:Distribution of anomaly counts in the INSPECT dataset for both the training and test sets\.
#### INSPECT

The INSPECT dataset has 23,248 scans focusing mainly on pulmonary embolism\. After removing some redundant data, we used 17,730 and 3,506 scans for training and testing, respectively\. As the dataset is not accompanied by additional abnormality labels and an official split, ReXKG\[zhang2024uncovering\]is used to extract entities representing abnormalities\. These extracted entities are further utilized for two purposes: \(a\) stratified train and test dataset split, and \(b\) multi\-anomaly classification to obtain classification logits\. After applying ReXKG, the obtained entities are sorted in descending order based on their frequency of occurrence\. Entities occurring more than 1,000 times across both training and testing sets are selected for multi\-anomaly classification\. Also, anomalies with the same Concept Unique Identifier \(CUI\) are combined, with their frequencies summed accordingly\. For example,emboli, embolus, embolismhave a common CUIC1704212, therefore, their occurrence frequencies are aggregated under one entity“Embolism"\. A list of these anomalies with their occurrence count is shown in Table[6](https://arxiv.org/html/2606.17213#Sx2.T6)\. This distribution is used to create a training and test split of the INSPECT dataset via stratified sampling based on the occurrence of anomalies in the reports\.

Algorithm 1Anomaly\-Aware Projection Network Training1:Image embeddings

𝐳i\\mathbf\{z\}\_\{i\}obtained using

fef\_\{e\}, text embeddings

𝐑^i\\hat\{\\mathbf\{R\}\}\_\{i\}and classification logits

𝐥i\\mathbf\{l\}\_\{i\}
2:Frozen LLaMA\-3\.2\-1B with

fmf\_\{m\}projection network for report generation

3:foreach batch

\(𝐳i,𝐑^i,𝐥i\)∈D\(\\mathbf\{z\}\_\{i\},\\hat\{\\mathbf\{R\}\}\_\{i\},\\mathbf\{l\}\_\{i\}\)\\in Ddo

4:Concatenate

𝐳i\\mathbf\{z\}\_\{i\}with

𝐥i\\mathbf\{l\}\_\{i\}:

𝐳^i←concat\(𝐳i,𝐥i\\hat\{\\mathbf\{z\}\}\_\{i\}\\leftarrow\\text\{concat\}\(\\mathbf\{z\}\_\{i\},\\mathbf\{l\}\_\{i\}\), where

𝐳^i∈ℝv\+c\\hat\{\\mathbf\{z\}\}\_\{i\}\\in\\mathbb\{R\}^\{v\+c\}
5:Construct prefix mask:

𝐩m​a​s​k←𝟏B×Lp\\mathbf\{p\}\_\{mask\}\\leftarrow\\mathbf\{\\mathbf\{1\}\}^\{B\\times L\_\{p\}\}, where

BBis the batch size and

LpL\_\{p\}is the prefix length \(10 in our case\)

6:Concatenate prefix mask with

𝐑^i\\hat\{\\mathbf\{R\}\}\_\{i\}attention mask \(obtained using LLaMA\-3\.2\-1B tokenizer\):

mask←concat​\(𝐩m​a​s​k,𝐑^i\.attention\_mask,dim=1\)\\text\{mask\}\\leftarrow\\text\{concat\}\(\\mathbf\{p\}\_\{mask\},\\text\{$\\hat\{\\mathbf\{R\}\}\_\{i\}$\.attention\\\_mask\},\\text\{dim\}=1\)
7:Pass

𝐳^i\\hat\{\\mathbf\{z\}\}\_\{i\}through the linear projection layer in

fmf\_\{m\}:

𝐄p​r​o​j←𝐖​𝐳^i\+𝐛\\mathbf\{E\}\_\{proj\}\\leftarrow\\mathbf\{W\}\\hat\{\\mathbf\{z\}\}\_\{i\}\+\\mathbf\{b\}, where

𝐄p​r​o​j∈ℝLp×h\\mathbf\{E\}\_\{proj\}\\in\\mathbb\{R\}^\{L\_\{p\}\\times h\},

𝐖∈ℝ\(v\+c\)⁣×⁣\(Lp\.h\)\\mathbf\{W\}\\in\\mathbb\{R\}^\{\(v\+c\)\\times\(L\_\{p\}\.h\)\},

𝐛\\mathbf\{b\}is the bias term and

hhis the LLM’s hidden size\.

8:Reshape\(

𝐄p​r​o​j,B,Lp,h\)\\mathbf\{E\}\_\{proj\},B,L\_\{p\},h\),

𝐄p​r​o​jr​e​s​h​a​p​e∈ℝB×Lp×h\\mathbf\{E\}\_\{proj\_\{reshape\}\}\\in\\mathbb\{R\}^\{B\\times L\_\{p\}\\times h\}
9:Define the learnable prefix constant:

𝐩c\\mathbf\{p\}\_\{c\},

𝐩c∈ℝB×Lp×h\\mathbf\{p\}\_\{c\}\\in\\mathbb\{R\}^\{B\\times L\_\{p\}\\times h\}
10:Concatenate

𝐄p​r​o​jr​e​s​h​a​p​e\\mathbf\{\\mathbf\{E\}\}\_\{proj\_\{reshape\}\}with

𝐩c\\mathbf\{p\}\_\{c\}:

𝐒←concat​\(𝐄p​r​o​jr​e​s​h​a​p​e,𝐩c,dim=1\)\\mathbf\{S\}\\leftarrow\\text\{concat\}\(\\mathbf\{E\}\_\{proj\_\{reshape\}\},\\mathbf\{p\}\_\{c\},\\text\{dim=1\}\)
11:Pass

𝐒\\mathbf\{S\}through

KKtransformer layers \(here

K=8K=8\)\. For each transformer layer, process through multi\-head attention and an MLP feedforward network\.

12:Compute loss

ℒ\\mathcal\{L\}\(Eq\.[2](https://arxiv.org/html/2606.17213#S3.E2)\)

13:Backpropagate and update only

fmf\_\{m\}parameters

14:endfor

15:returnfine\-tuned model for report generation

Table 7:Comparison with the state\-of\-the\-art approach of vision and text embedding alignment and different sized LLM training based on two more clinically\-oriented metrics\.

## Qualitative Results

More qualitative samples are given in Fig\.[9](https://arxiv.org/html/2606.17213#Sx3.F9)\. It can be seen that the baseline model is overfitted on the dataset, generating the same result regardless of the input\. In contrast, our model performed better with input\-specific output covering most of the anomalies\. Some failure cases of our model involved content where abnormalities were accompanied by specific measurements\. Nevertheless, our model correctly identified critical conditions, including effusions and hiatal hernia\.

Fig\.[10](https://arxiv.org/html/2606.17213#Sx3.F10)illustrates reports predicted by the three variants,V\-1,V\-2, andV\-3\. It can be observed that RAD3D\-Prefix, based onV\-3variant, roduces the report most aligned with the radiologist\-annotated ground truth\. It is the only variant that correctly identifies “COVID\-19 pneumonia", “no obstructive pathology", “ground\-glass opacity with correct location"\. FollowingV\-3,V\-2performed better than V\-1, capturing “ground\-glass opacity" but with incorrect location, whileV\-1fails to detect it entirely\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/eg1.png)

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/eg2.png)

Figure 9:More qualitative examples of the baseline and our proposed method\. Matching sentence pairs are highlighted in the same color\.![Refer to caption](https://arxiv.org/html/2606.17213v1/images/sample3.png)Figure 10:Qualitative sample comparing outcomes of the three variants,V\-1,V\-2, andV\-3\.
## GREEN Score Definition and Computation Details

GREEN Score\[ostmeier\-etal\-2024\-green\]is a metric for radiology report generation that uses regular expressions to parse errors in generated reports and to identify matched findings\. The score can be calculated as:

GREEN​​Score=\#​matched findings\#​matched findings\+∑i∈sig\. errors\#​errorsi,\\mathrm\{GREEN\\text\{ \}Score\}=\\frac\{\\\#\\,\\text\{matched findings\}\}\{\\\#\\,\\text\{matched findings\}\+\\sum\_\{i\\in\\text\{sig\.\\ errors\}\}\\\#\\,\\text\{errors\}\_\{i\}\},\(4\)
where a “matched finding” is a clinical observation present in both the generated and reference reports\. “ errors” correspond to findings whose omission or hallucination would plausibly impact clinical decision\-making\. If no matched findings are present, the GREEN score is defined to be zero\. Sample GREEN Summaries forV\-2andV\-3variants are shown in Fig\.[11](https://arxiv.org/html/2606.17213#Sx4.F11)\. These two variants are compared specifically becauseV\-3incorporates classification logits whileV\-2does not, enabling a direct comparison of the impact of logits on clinical correctness\. Across 3039 test cases,V\-3produces 253 additional matched findings \(1\.43%1\.43\\%improvement\) compared toV\-2, corresponding to approximately one additional finding every 12 cases\. Given the sparsity and clinical importance of radiology findings, this improvement is clinically meaningful\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/RAD3D-Preifx_greensummary.png)Figure 11:Samples of GREEN Summary for V\-2 and V\-3 variants\.
## Projection Network Parameter Count comparison

To support the effectiveness of our proposed framework, we present results replacing our transformer mapping with a linear layer \(see Table[8](https://arxiv.org/html/2606.17213#Sx5.T8)\)\. This variant still uses our proposed prefix design, resulting in increased parameter count from 1\.05M \(linear\) to 1\.05M×\\times10 \(linear\+prefix\), still significantly less than 279\.5M \(original transformer\+prefix\)\. Replacing our transformer mapper \(279\.5M\) with linear layer cuts parameters 26×\\timeswith only a 0\.047 BLEU↓\\downarrow\(still outperforming R2GenGPT\)\. This shows gains stem from the prefix mechanism, not just increased model size\. 10×\\timesextra parameters above account for the prefix length \(our core proposal\)\.

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/prefixlenVsPerformance.png)

![Refer to caption](https://arxiv.org/html/2606.17213v1/images/layercountVsPerformance.png)

Figure 12:Performance analysis with respect to increasing trainable parameters, influenced by modifications to \(a\) Prefix Length, \(b\) Number of Layers in the transformer\-based projection network\.Table 8:Using a linear layer in the projection network instead of the transformer\-based network and comparing the setting with R2GenGPT\.
## Analysis of Prefix Length and Layers Vs\. Performance and Trainable Parameters

We performed an additional ablation study with different prefix lengths \(2,5,7, 10, and 14\) and the number of layers in the transformer architecture of the projection network \(2, 4, 8, and 16\) to analyze tradeoffs between performance and computational overhead\. The results are shown in Fig\.[12](https://arxiv.org/html/2606.17213#Sx5.F12)\. It can be observed that increasing prefix length significantly improves performance up to a threshold, after which gains saturate\. The most significant improvement occurs between prefix lengths 2 and 5 with gains above 60% across five metrics\. Beyond prefix length 5, the improvements are marginal despite increased parameter count\. Further, Fig\.[12](https://arxiv.org/html/2606.17213#Sx5.F12)\(b\) shows that while increasing the number of layers dramatically increases trainable parameters, there is no measurable performance gain across NLG metrics\. Although a slight improvement is observed in the GREEN metric, the gain is not proportional to the increase in parameters\.

## Reader Study

We conducted adouble\-blind reader studywith two clinicians who independently evaluated 100 randomly selected anonymized reports generated by the baseline,V\-2, andV\-3models\. The evaluation consisted of two sections: clinical accuracy and technical and linguistic quality, each scored on a five\-point scale \(1–5\) according to the following predefined criteria\.

1. 1\.Clinical Accuracy \(1–5\) 1. \(a\)Correct identification of presence/absence of findings 2. \(b\)No hallucinated \(false positive\) findings 3. \(c\)Whether errors would meaningfully affect diagnosis or management 4. \(d\)Correct anatomical location and placement of findings
2. 2\.Technical & Linguistic Quality \(1–5\) 1. \(a\)No repetition or redundancy 2. \(b\)No incomplete or truncated sentences 3. \(c\)No strange symbols or artifacts \(e\.g\., "\#", "?", "Question"\)

Scale: 1 = Very Poor \| 2 = Poor \| 3 = Acceptable \| 4 = Good \| 5 = Excellent

Averaging scores obtained from both clinicians, it is observed thatV\-3improved clinical relevance by 9\.8% over the baseline and by 3\.7% overV\-2, achieving the highest clinical score\. While the baseline achieved higher linguistic fluency scores, our proposed variants produced reports with stronger clinical relevance compared to the baseline\. Notably, clinicians observed that the baseline frequently generated normal outcomes, suggesting overfitting and a tendency to prioritize fluent but non\-informative text rather than condition\-specific findings\. Although both variants scored lower than the baseline in linguistic quality,V\-3still outperformedV\-2by 9\.4%\.

## References

Similar Articles