Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization
Summary
This paper investigates the risk of sensitive information inference from exported LLM representations in clinical summarization, showing that reducing leakage from one vector artifact does not guarantee privacy in others. It introduces SurfaceLoRA, a fine-tuning method that reduces race recovery from targeted vectors while preserving utility.
View Cached Full Text
Cached at: 05/27/26, 09:05 AM
# Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization
Source: [https://arxiv.org/html/2605.26433](https://arxiv.org/html/2605.26433)
Weixin Liu1Bowen Qu1Juming Xiong1Congning Ni2Bradley A\. Malin1,2Zhijun Yin1,2 1Vanderbilt University 2Vanderbilt University Medical Center \{weixin\.liu,bowen\.qu,juming\.xiong\}@vanderbilt\.edu \{congning\.ni\.1,b\.malin,zhijun\.yin\.1\}@vumc\.org
###### Abstract
Large language model \(LLM\) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows\. Even when source documents remain access\-restricted, derived vectors may be handled under different access controls and still support sensitive\-information inference, creating a residual information\-disclosure risk\. We study this issue in clinical discharge\-summary generation as a high\-stakes case study, using electronic health record \(EHR\)\-recorded race as a controlled sensitive\-label audit\. We audit two artifacts that a system might retain or expose to downstream components: the final prompt\-token hidden state and the mean\-pooled prompt representation\. Our results show that reducing recoverability of the case\-study sensitive label from one exported artifact does not necessarily reduce recoverability from another\. As a mitigation case study, we introduceSurfaceLoRA, an exported\-vector\-targeted parameter\-efficient fine\-tuning method that uses a gradient\-reversal discriminator attached to a designated exported vector\. Under a balanced five\-way probing protocol,SurfaceLoRAreduces EHR\-recorded race recoverability from the targeted final\-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts\. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components\.
Vectors Are Not Neutral: Sensitive\-Information Inference from Exported LLM Representations in Summarization
Weixin Liu1Bowen Qu1Juming Xiong1Congning Ni2Bradley A\. Malin1,2Zhijun Yin1,21Vanderbilt University2Vanderbilt University Medical Center\{weixin\.liu,bowen\.qu,juming\.xiong\}@vanderbilt\.edu\{congning\.ni\.1,b\.malin,zhijun\.yin\.1\}@vumc\.org
## 1Introduction
Large language models \(LLMs\) are increasingly used to summarize long, sensitive documents\. In many summarization workflows, the source text may remain access\-restricted while derived vector artifacts are retained, cached, logged, indexed, or passed to downstream system components for retrieval, monitoring, auditing, or analytics\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib6); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib7); Wanget al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib72); Douzeet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib67); Zenget al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib41)\)\. This creates a general information\-disclosure question: even when the raw texts used for summarization are protected, do derived vectors still support inference of sensitive information about the individual described by the source documents?
This question is policy\- and governance\-relevant even without assuming malicious use\. A downstream component, service provider, analyst, or auditor may have access to stored vectors while lacking authorization to inspect the original text or structured sensitive attributes\. If such vectors are treated as derived artifacts that should not reveal subject\-level information, then successful recovery of sensitive information from those vectors is itself a residual disclosure risk\. This concern is consistent with data\-protection guidance that treats anonymization as a risk\-based assessment involving residual risks such as singling out, linkability, and inference\(European Parliament and Council of the European Union,[2016](https://arxiv.org/html/2605.26433#bib.bib74); Article 29 Data Protection Working Party,[2014](https://arxiv.org/html/2605.26433#bib.bib75)\)\. It is also supported by prior work showing that text embeddings can reveal substantial information about the underlying text\(Morriset al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib34); Liet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib35); Chenet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib64); Huanget al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib63); Chenet al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib55)\)and that LLM\-derived internal representations can create inversion or attribute\-inference risks\(Zhuet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib54); Donget al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib58)\)\. Appendix[A](https://arxiv.org/html/2605.26433#A1)provides an expanded operational illustration and threat model\.
We instantiate this general problem in clinical note summarization\. Specifically, we study Brief Hospital Course \(BHC\) generation, where the model summarizes a hospitalization into the discharge\-summary narrative describing the admission, diagnoses, treatments, clinical trajectory, and follow\-up considerations\(Adamset al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib43); Searleet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib68); Yanget al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib38); Aaliet al\.,[2025b](https://arxiv.org/html/2605.26433#bib.bib56)\)\. Clinical data provide a high\-stakes setting in which raw notes and demographic fields are access\-restricted, but derived representations may still be reused for system operations\. As a case\-study sensitive attribute, we audit whether electronic health record \(EHR\)\-recorded race can be inferred from exported vector artifacts\. We use race as a controlled example of sensitive\-attribute inference rather than as the only attribute of concern; other attributes such as age, sex or gender, ethnicity, language, insurance status, and socioeconomic proxies raise analogous concerns and may require separate or multi\-attribute audits\.
The vector exposed by a summarization system can be defined in multiple ways\. In this study, we audit two plausible prompt\-derived artifacts computed before generation:lasttok, the final prompt\-token hidden state immediately before decoding, which serves as a compact prompt\-level vector; andmeanpool, the average of all non\-padding prompt\-token hidden states, which resembles pooled embeddings used for retrieval, semantic indexing, or analytics\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib6); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib7); Wanget al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib72); Douzeet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib67)\)\. Using standard post\-hoc probes trained on frozen exported vectors\(Elazar and Goldberg,[2018](https://arxiv.org/html/2605.26433#bib.bib32); Belinkov and Glass,[2019](https://arxiv.org/html/2605.26433#bib.bib14)\), we find that leakage is artifact\-specific: reducing sensitive\-attribute predictability onlasttokdoes not imply reduced predictability onmeanpool\.
As a mitigation case study, we introduceSurfaceLoRA, an exported\-vector\-targeted PEFT method built on LoRA and gradient reversal\(Huet al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib18); Ganinet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib11)\)\. SurfaceLoRA attaches a training\-time discriminator to a designated exported vector while updating only LoRA adapters and the discriminator\. Its goal is not universal sanitization, but reducing recoverability of a specified sensitive label from the specific vector artifact a system intends to retain or expose\. We evaluate frozen exported vectors with post\-hoc linear and nonlinear probes that are separate from the training\-time discriminator\.
Our results show thatSurfaceLoRAcan drive EHR\-recorded race predictability on the targetedlasttokartifact toward chance under a balanced five\-way probing protocol while preserving BHC summarization utility\. By contrast, EHR\-recorded race remains substantially more recoverable from pooled artifacts such asmeanpool\. In addition, the utility–leakage trade\-off is non\-monotonic across training, motivating held\-out checkpoint selection using the same exported artifact that will be retained or audited\.
Contributions\.We make three contributions: \(i\) we frame exported summarization vectors as concrete audit targets for sensitive\-information inference; \(ii\) we show that near\-chance recovery from one exported vector can coexist with substantial recovery from another; and \(iii\) we introduceSurfaceLoRA, an artifact\-targeted PEFT mitigation method that reduces recoverability onlasttokwhile preserving summarization utility, while pooled and multi\-attribute settings require separate auditing\.
## 2Related Work
Our work connects summarization, representation privacy, and adversarial mitigation\. Summarization systems are typically evaluated for generation quality, but sensitive\-document summarization also raises questions about what information is preserved in intermediate artifacts that are stored, reused, or exposed\. We study this broader issue through discharge\-oriented clinical summarization as a high\-stakes case study\. Language models have shown strong performance on hospital\-course and discharge\-summary generation\(Adamset al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib43); Searleet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib68); Yanget al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib38); Chenet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib39); Aaliet al\.,[2025b](https://arxiv.org/html/2605.26433#bib.bib56)\), and MIMIC\-derived corpora provide credentialed\-access benchmarks built from deidentified electronic health record data\(Johnsonet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib5),[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2),[b](https://arxiv.org/html/2605.26433#bib.bib56)\)\. Recent clinical summarization work has emphasized verification and risk\-aware evaluation beyond surface\-level generation quality\(Asgariet al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib61); Chunget al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib62)\)\. We extend this risk\-aware view from generated summaries to exported vector representations\.
Representation leakage is often assessed via probing, where an attacker is trained to recover sensitive attributes from frozen representations\(Elazar and Goldberg,[2018](https://arxiv.org/html/2605.26433#bib.bib32); Belinkov and Glass,[2019](https://arxiv.org/html/2605.26433#bib.bib14)\)\. However, probe results depend on attacker capacity and the specific representation choice under audit\(Hewitt and Liang,[2019](https://arxiv.org/html/2605.26433#bib.bib15); Pimentelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib16)\)\. Complementary attacks can operate on generated text or model outputs\(Carliniet al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib44)\), while embedding\-inversion attacks recover text or attributes from embedding representations\(Morriset al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib34); Liet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib35); Chenet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib64); Huanget al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib63); Chenet al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib55)\)\. Recent studies further show that LLM\-derived embeddings and internal states can expose sensitive information\(Zhuet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib54); Donget al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib58)\)\. Less is known about artifact\-specific leakage in summarization workflows, where different prompt\-derived vectors may be retained or reused for different downstream purposes\.
Mitigation strategies operate at different points in the pipeline\. Text\-level approaches such as de\-identification and redaction reduce disclosure in the raw text before downstream modeling\(Dernoncourtet al\.,[2017](https://arxiv.org/html/2605.26433#bib.bib46)\), but they do not directly address what remains recoverable from transformed or embedded representations after a model processes the text\. Representation\-level approaches instead aim to reduce protected\-attribute recoverability from learned features\. Adversarial learning with gradient reversal trains task\-useful representations while discouraging prediction of protected attributes\(Ganinet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib11); Edwards and Storkey,[2015](https://arxiv.org/html/2605.26433#bib.bib12); Madraset al\.,[2018](https://arxiv.org/html/2605.26433#bib.bib13); Zhanget al\.,[2018](https://arxiv.org/html/2605.26433#bib.bib28); Elazar and Goldberg,[2018](https://arxiv.org/html/2605.26433#bib.bib32)\)\. Post\-hoc methods such as INLP, linear adversarial concept erasure, and LEACE remove linearly recoverable information from learned representations\(Ravfogelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib17),[2022](https://arxiv.org/html/2605.26433#bib.bib40); Belroseet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib45)\)\. We apply a lightweight LoRA\-based intervention\(Huet al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib18)\), but our focus is deployment\-oriented: auditing and mitigating the exact exported vector, then selecting checkpoints using a held\-out utility–leakage trade\-off\.
## 3Methods
#### Overview\.
We instantiate the exported\-vector audit in BHC generation, a clinical summarization case study where prompt\-derived vectors may be cached, indexed, logged, or reused downstream\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib6); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib7); Wanget al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib72); Douzeet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib67); Zenget al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib41)\)\. For each example, the prompt contains the system instruction, source clinical context, and assistant generation header, but not the target BHC, generated summary, or race label\. We audit two pre\-generation prompt artifacts:lasttok, the final prompt\-token hidden state, andmeanpool, the mean prompt\-token hidden state\. Given a frozen artifact, post\-hoc probes predict the five\-way EHR\-recorded race label from that exact vector\.
### 3\.1Datasets and Pre\-processing
Our primary dataset is MIMIC\-IV\-Ext\-BHC \(v1\.2\.0\), a PhysioNet hospitalization summarization corpus derived from deidentified MIMIC\-IV EHR data\(Johnsonet al\.,[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2)\); each instance pairs a discharge note with the BHC removed and a cleaned BHC target\. We also evaluate on Discharge Me \(v1\.3\), a BHC\-adjacent MIMIC\-derived task with different input construction, including chief complaint, diagnosis codes, and radiology reports\(Johnsonet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib5),[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\)\.
For each dataset, we extract encounter\-level EHR\-recorded race labels and map them into five groups \(White,Black,Hispanic,Asian,Other\), excludingUnknownbecause its assignment is ambiguous\. We split the primary dataset at the patient level into disjoint train/validation/test sets with 193,470/21,552/24,445 instances\. To audit race recoverability under controlled class proportions, we construct race\-balanced subsets within each split by sampling an equal number of examples per race group, yieldingbal\_train\(20,000 total\),bal\_val\(2,500 total\), andtest\_balanced\(2,500 total\)\. These subsets are used for race\-audit components, including adversary training and probe\-based evaluation, while utility fine\-tuning uses the full training split\. Additional preprocessing and split statistics are provided in Appendix[D](https://arxiv.org/html/2605.26433#A4); Discharge Me details are provided in Appendix[Q](https://arxiv.org/html/2605.26433#A17)\. Because the audit subsets are class\-balanced, chance accuracy is0\.200\.20\.
### 3\.2Base Summarization Model
We fine\-tune a locally deployed Llama\-3\.1\-8B\-Instruct model for BHC generation\(Llama Team, AI @ Meta,[2024](https://arxiv.org/html/2605.26433#bib.bib26)\)\. For a given sample, letccdenote the source clinical context\. In the primary dataset,ccis the discharge\-note input with the BHC section removed; it may contain detailed clinical narrative text, but it is not the full longitudinal EHR record\. We renderccwith the system instruction and assistant generation header to obtain the input promptxpromptx\_\{\\mathrm\{prompt\}\}\. Structured demographic labels, including EHR\-recorded race, are not included inxpromptx\_\{\\mathrm\{prompt\}\}and are used only for the adversarial and post\-hoc leakage audits\. Lety1:Ty\_\{1:T\}denote the target BHC tokens\. We optimize the standard token\-level cross\-entropy over target tokens:
ℒutil\(θ\)=−∑t=1Tlogpθ\(yt∣xprompt,y<t\)\.\\mathcal\{L\}\_\{\\text\{util\}\}\(\\theta\)=\-\\sum\_\{t=1\}^\{T\}\\log p\_\{\\theta\}\(y\_\{t\}\\mid x\_\{\\mathrm\{prompt\}\},y\_\{<t\}\)\.\(1\)whereθ\\thetadenotes the trainable PEFT parameters \(LoRA adapters\) on top of a fixed backbone; unless stated otherwise, the base LLM weights are kept fixed\. During supervised fine\-tuning, the rendered promptxpromptx\_\{\\mathrm\{prompt\}\}is concatenated with the target BHC tokens, and prompt tokens are masked so that gradients are computed only on the target sequence\. For prompt\-only representation extraction, adversarial training, and probing, we use onlyxpromptx\_\{\\mathrm\{prompt\}\}without the target BHC\. We truncatexpromptx\_\{\\mathrm\{prompt\}\}to at most 1,024 tokens for prompt\-only passes, and truncate the full prompt–target training sequence to at most 1,536 tokens for supervised fine\-tuning\. Results on another LLM backbone, Qwen\-2\.5\-7B\-Instruct\(Qwen Teamet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib27)\), are reported in Appendix[P](https://arxiv.org/html/2605.26433#A16)\.
### 3\.3SurfaceLoRA: Exported\-Vector\-Targeted Mitigation via Adversarial PEFT
Figure 1:SurfaceLoRA mitigation training\.SurfaceLoRA combines \(A\) a utility branch for summary generation with \(B\) an adversarial branch that attaches a gradient reversal layer \(GRL\)\-based training\-time discriminator to the exported prompt vector\. Herezlastz\_\{\\mathrm\{last\}\}is the exportedlasttok\_L\-1artifact at the generation boundary\. The discriminator predicts a sensitive attribute fromzlastz\_\{\\mathrm\{last\}\}; reversed gradients update LoRA adapters so the exported artifact becomes less predictive while preserving generation utility\. The discriminator is separate from the post\-hoc probe attackers used for evaluation\. Only the LoRA adapters and discriminator are updated; the backbone, LM head, and decoding procedure remain fixed\. In our empirical case study, the generation task is Brief Hospital Course summarization and the audited sensitive attribute is EHR\-recorded race\.SurfaceLoRAis a mitigation method, not an attribute\-inference model\. It trains LoRA adapters so that a designated exported artifact remains useful for summary generation while becoming less predictive of a specified sensitive label under a training\-time adversarial discriminator\. In our empirical case study, the task is BHC summarization and the sensitive label is EHR\-recorded race\. The training\-time discriminator is a lightweight linear classifier in the defaultlasttok\-targeted experiments\. For the meanpool\-targeted stress test, we also use a two\-layer MLP discriminator, because pooled prompt representations may contain distributed attribute\-associated signals that are not fully captured by a linear head\.
For post\-hoc leakage evaluation, we use two attacker classes on frozen exported representations: a multinomial logistic\-regression probe as the primary linear attacker and a two\-layer MLP probe as a stronger nonlinear attacker\. These probes are trained only after model training and are independent of the training\-time discriminator used bySurfaceLoRA\. They are not used to update the LLM or LoRA adapters\. We include both linear and nonlinear probes because probe\-based leakage estimates can depend on attacker capacity as well as the representation under audit\(Belinkov and Glass,[2019](https://arxiv.org/html/2605.26433#bib.bib14); Hewitt and Liang,[2019](https://arxiv.org/html/2605.26433#bib.bib15); Pimentelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib16)\)\.
The design principle is simple: the vector targeted during training should match the vector that will be retained, reused, or exposed\. We instantiate this principle with LoRA\-based PEFT\(Huet al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib18)\)and a gradient reversal layer \(GRL\)\(Ganinet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib11)\)\. This follows prior work on adversarially learned and adversarially fair representations\(Edwards and Storkey,[2015](https://arxiv.org/html/2605.26433#bib.bib12); Madraset al\.,[2018](https://arxiv.org/html/2605.26433#bib.bib13); Zhanget al\.,[2018](https://arxiv.org/html/2605.26433#bib.bib28); Elazar and Goldberg,[2018](https://arxiv.org/html/2605.26433#bib.bib32)\)\. At each training step, SurfaceLoRA combines a supervised fine\-tuning \(SFT\) batch for summary generation with a prompt\-only balanced batch for the adversarial sensitive\-label objective\. Only LoRA adapters and the discriminator head are updated; the backbone, LM head, and decoding procedure remain fixed\.
Formally, letxpromptx\_\{\\mathrm\{prompt\}\}denote the rendered input prompt used immediately before generation\. It contains the system instruction, the source clinical context, and the assistant generation header, but does not contain the target BHC, any generated tokens, or the race label\. We run a prompt\-only forward pass onxpromptx\_\{\\mathrm\{prompt\}\}withoutput\_hidden\_states=True\. With left padding, leti⋆=max\{i:attention\_maski=1\}i^\{\\star\}=\\max\\\{i:\\texttt\{attention\\\_mask\}\_\{i\}=1\\\}be the index of the last non\-padding token inxpromptx\_\{\\mathrm\{prompt\}\}\.
We define thelasttokartifact as
fθ\(xprompt\)=𝐡i⋆\(L−1\)\(xprompt\)\.f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)\\;=\\;\\mathbf\{h\}^\{\(L\-1\)\}\_\{i^\{\\star\}\}\(x\_\{\\mathrm\{prompt\}\}\)\.\(2\)This vector is available at the generation boundary and can be stored as one fixed\-size prompt\-level artifact\.
#### Layer indexing\.
We use 0\-indexed transformer blocks; thusL−1L\{\-\}1denotes the final block\.
#### Alternative pooled artifact \(meanpool\)\.
In addition tolasttok, we audit a pooled prompt representation that averages prompt\-token states\. Let𝒫\\mathcal\{P\}denote the set of non\-padding prompt token indices inxpromptx\_\{\\mathrm\{prompt\}\}, including the system instruction, source clinical context, and assistant generation header\. We define
fθmeanpool\(xprompt\)=1\|𝒫\|∑i∈𝒫𝐡i\(L−1\)\(xprompt\)\.f^\{\\texttt\{meanpool\}\}\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{i\\in\\mathcal\{P\}\}\\mathbf\{h\}^\{\(L\-1\)\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\)\.\(3\)
We usemeanpoolfor additional representation auditing in Section[4\.4](https://arxiv.org/html/2605.26433#S4.SS4)\. For the meanpool\-targeted variant, the discriminator input is replaced withfθmeanpool\(xprompt\)f^\{\\texttt\{meanpool\}\}\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)while decoding remains unchanged\. Boundary and rendering variants are detailed in Appendix[E](https://arxiv.org/html/2605.26433#A5)\.
Given a chosen exported artifactfθ\(xprompt\)f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\), we attach the training\-time discriminatorgϕg\_\{\\phi\}to predict a categorical sensitive label\. In our case study, this label is the 5\-way EHR\-recorded race labelr∈\{1,…,5\}r\\in\\\{1,\\dots,5\\\}:
ℒadv\(θ,ϕ\)=CE\(gϕ\(fθ\(xprompt\)\),r\)\.\\mathcal\{L\}\_\{\\text\{adv\}\}\(\\theta,\\phi\)=\\mathrm\{CE\}\\\!\\left\(g\_\{\\phi\}\(f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)\),\\,r\\right\)\.\(4\)
Training couples the summarization objective with exported\-vector\-targeted adversarial pressure through a gradient reversal layer \(GRL\)\. At each step, we computeℒutil\\mathcal\{L\}\_\{\\text\{util\}\}on an SFT batch andℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}on a prompt\-only balanced batch \(frombal\_train\), and update parameters synchronously\. The GRL reverses gradients fromℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}flowing into the trainable adapters, yielding:
minθ\\displaystyle\\min\_\{\\theta\}ℒutil\(θ\)−λℒadv\(θ,ϕ\),\\displaystyle\\mathcal\{L\}\_\{\\text\{util\}\}\(\\theta\)\-\\lambda\\,\\mathcal\{L\}\_\{\\text\{adv\}\}\(\\theta,\\phi\),\(5\)minϕ\\displaystyle\\min\_\{\\phi\}ℒadv\(θ,ϕ\)\.\\displaystyle\\mathcal\{L\}\_\{\\text\{adv\}\}\(\\theta,\\phi\)\.We implement this with PEFT: only LoRA adapters \(and the discriminator head\) are trainable, while the backbone remains fixed\. Extended implementation details, including chat\-template rendering, boundary extraction, representation variants, and the training schedule, are provided in Appendix[E](https://arxiv.org/html/2605.26433#A5)\.
### 3\.4Training, Decoding, and Baselines
We tune adversarial strength by sweepingλ∈\{0\.0,0\.02,0\.05,0\.10,0\.20,0\.50\}\\lambda\\in\\\{0\.0,0\.02,0\.05,0\.10,0\.20,0\.50\\\}and train each configuration for 2,000 steps using AdamW with learning rate2×10−42\\times 10^\{\-4\}\. Utility and adversarial objectives are optimized jointly with a fixed 1:1 utility/adversarial batch ratio\. Consistent with the setup above, prompt\-only forward passes used for representation extraction, adversarial training, and probing truncatexpromptx\_\{\\mathrm\{prompt\}\}to at most 1,024 tokens, while supervised fine\-tuning truncates the full prompt–target sequence to at most 1,536 tokens\. At inference time, we provide the truncatedxpromptx\_\{\\mathrm\{prompt\}\}as input and use greedy decoding withmax\_new\_tokens=256; this generation cap is separate from the input truncation lengths\. We keep decoding fixed across methods\. For the meanpool\-targeted analysis in Section[4\.4](https://arxiv.org/html/2605.26433#S4.SS4), we use the same training, truncation, and decoding settings but attach a two\-layer multi\-layer perceptron \(MLP\) adversary tomeanpool\_L\-1; optimization details and full sweep results are provided in Appendix[F](https://arxiv.org/html/2605.26433#A6)and Appendix[L\.2](https://arxiv.org/html/2605.26433#A12.SS2)\.
#### Prompt\-only baselines\.
We compareSurfaceLoRAagainst two prompt\-only baselines that use the same frozen instruction\-tuned backbone without LoRA or adversarial training\.Baseuses the standard clinical summarization instruction\.Neutraluses the same summarization instruction with an additional demographic\-neutrality directive\. For both baselines, we render the source clinical context intoxpromptx\_\{\\mathrm\{prompt\}\}, extract the same exported prompt representations, and evaluate them with the same post\-hoc probing and ROUGE protocols as the trained models\.
### 3\.5Evaluation Metrics
To assess representation leakage, we report probe accuracy and LeakageGap \(Eq\.[6](https://arxiv.org/html/2605.26433#A5.E6)\) for predicting the case\-study sensitive label, EHR\-recorded race, from the exported prompt vector\. Probe accuracy is the balanced five\-way classification accuracy onbal\_valortest\_balanced, where chance accuracy is0\.200\.20\. LeakageGap is the absolute deviation from chance, with smaller values indicating lower recoverability under the specified post\-hoc probe class\. We report these metrics for the linear and nonlinear post\-hoc probes described in Section[3\.3](https://arxiv.org/html/2605.26433#S3.SS3); implementation details are provided in Appendix[G](https://arxiv.org/html/2605.26433#A7)\.
To simulate deployment\-time model selection, we select checkpoints using a held\-outbal\_valtrade\-off audit rather than defaulting to the final training step\. This validation procedure is specified before test evaluation and uses onlybal\_val, nottest\_balanced\. For the exported artifact under consideration, we first apply the LR\-based validation leakage budget and then choose the feasible checkpoint with the highest validation ROUGE\-L\. If no checkpoint satisfies the budget, we fall back to a Pareto\-optimal low\-leakage checkpoint onbal\_val\. The MLP probe is used only for post\-hoc evaluation, not checkpoint selection\. The full checkpoint selection rule is detailed in Appendix[H](https://arxiv.org/html/2605.26433#A8)\.
To assess summarization utility, we report ROUGE\-1/2/L\(Lin,[2004](https://arxiv.org/html/2605.26433#bib.bib19)\)\. We also report two clinical utility proxies in Appendix[I](https://arxiv.org/html/2605.26433#A9): BERTScore\-F1\(Zhanget al\.,[2020b](https://arxiv.org/html/2605.26433#bib.bib20)\)and concept\-level overlap computed using an i2b2\-2010 clinical named entity recognition \(NER\) model\(Uzuneret al\.,[2011](https://arxiv.org/html/2605.26433#bib.bib22)\)\. Finally, to assess residual privacy risk in generated text, we report output\-attacker accuracy and Macro\-AUROC ontest\_balanced\. We further report mention rates, defined as the percentage of generated summaries containing explicit race\- or ethnicity\-related terms; Appendix[K\.3](https://arxiv.org/html/2605.26433#A11.SS3)further decomposes these terms into generic meta terms \(e\.g\., “race” or “ethnicity”\) and explicit group identifiers\. Full output\-attacker details are provided in Appendix[I](https://arxiv.org/html/2605.26433#A9)\. As an additional attribute stress test, Appendix[M](https://arxiv.org/html/2605.26433#A13)reports a binary EHR\-recorded gender audit using the same exported\-vector probing protocol\.
## 4Results
### 4\.1Unified Test Evaluation
Table[1](https://arxiv.org/html/2605.26433#S4.T1)provides the unified test\-set evaluation ontest\_balanced, including prompt\-only baselines, validation\-screened early checkpoints, and full\-run checkpoints trained to step 2,000\. The table reports summarization utility with ROUGE\-1/2/L and representation\-level recoverability of the case\-study sensitive attribute, EHR\-recorded race, from the targetedlasttok\_L\-1artifact using both LR and MLP probes\. Here, “validation\-selected” means that checkpoint selection is performed on the same exported artifact used for the audit\. We use a pre\-specified validation leakage budget ofLeakageGapLR≤0\.025\\mathrm\{LeakageGap\}\_\{\\mathrm\{LR\}\}\\leq 0\.025, i\.e\., within 2\.5 percentage points of the balanced five\-way chance baseline of0\.200\.20\. Among checkpoints satisfying this budget, we choose the one with the highest validation ROUGE\-L; if no checkpoint satisfies the budget, we fall back to a low\-leakage Pareto point\.
Table 1:Test\-set evaluation under a unified protocol\. ROUGE is computed ontest\_balanced\(n=2,500n=2\{,\}500\); LR and MLP probes are fit onbal\_train\(n=20,000n=20\{,\}000\) and evaluated ontest\_balanced\. Bold marks the validation\-selected checkpoint chosen by maximizing validation ROUGE\-L subject to the LR\-based validation leakage budget on the exported artifact \(LeakageGapLR≤0\.025\\mathrm\{LeakageGap\}\_\{\\mathrm\{LR\}\}\\leq 0\.025\)\. R\-1/R\-2/R\-L denote ROUGE\-1/2/L; Acc reports probe accuracy with 95% CI; Gap is\|Acc−0\.2\|\|\\mathrm\{Acc\}\-0\.2\|\. ROUGE CIs are paired bootstrap percentile intervals \(B=10,000B\{=\}10\{,\}000\); probe CIs are patient\-level stratified cluster bootstrap percentile intervals \(B=10,000B\{=\}10\{,\}000\), resamplingsubject\_idwithin each race group; see Appendix[J\.3](https://arxiv.org/html/2605.26433#A10.SS3)\.
### 4\.2Prompting and Post\-hoc Removal are Insufficient
We first examine the prompt\-only baselines defined in Section[3\.4](https://arxiv.org/html/2605.26433#S3.SS4)\. As shown in Table[1](https://arxiv.org/html/2605.26433#S4.T1), bothBaseandNeutralremain above chance under LR and MLP probes, indicating that instruction\-level demographic neutrality is not sufficient to remove recoverable signal about the case\-study sensitive attribute from the exported prompt artifact\. TheNeutralbaseline also induces output meta\-term contamination: Appendix[K\.3](https://arxiv.org/html/2605.26433#A11.SS3)shows a 46\.44% meta\-term rate driven by prompt\-induced wording such as “race” or “racial identity” rather than genuine demographic disclosure\.
We also evaluate additional leakage\-mitigation baselines onlasttok\_L\-1under the same balanced five\-way audit\. Appendix[J\.2](https://arxiv.org/html/2605.26433#A10.SS2)shows that cached\-vector post\-hoc transforms mainly reduce*linear*recoverability: PCA removal gives the smallest LR LeakageGap ontest\_balanced\(Acc=0\.2048=0\.2048, LeakageGap=0\.0048=0\.0048\), but the MLP probe remains above chance \(Acc=0\.2120=0\.2120\)\. The XCov training\-time decorrelation baseline also increases recoverability relative to no removal on the exported artifact\. Together, these results motivate training\-time, artifact\-aligned mitigation rather than relying on prompt\-only constraints or linear post\-processing alone\.
### 4\.3SurfaceLoRA Approaches the Chance Baseline on the TargetedlasttokArtifact
Table[1](https://arxiv.org/html/2605.26433#S4.T1)shows that the validation\-selected checkpoint \(λ=0\.02\\lambda\{=\}0\.02, step 1200\) approaches the chance baseline on the targeted artifact while preserving utility\. It achieves LR ProbeAcc=0\.203=0\.203\(95% CI \[0\.188, 0\.219\]\) and MLP ProbeAcc=0\.206=0\.206\(95% CI \[0\.191, 0\.222\]\), with chance \(0\.200\.20\) inside both intervals\. ROUGE\-L remains strong at14\.5414\.54\(95% CI \[14\.19, 14\.86\]\), substantially above prompt\-only baselines\. Thus, approximately chance\-level recovery is observed for the designated exported artifact under both evaluated probe classes\.
### 4\.4Leakage Depends on the Exported Representation
We next test whether mitigation transfers beyond the targetedlasttok\_L\-1artifact\. Table[2](https://arxiv.org/html/2605.26433#S4.T2)shows that suppression is artifact\-specific:SurfaceLoRAreduces targetedlasttokvariants to near chance \(LR/MLP≈0\.203\\approx 0\.203–0\.2080\.208\), butmeanpoolremains substantially predictive \(LR≈0\.316\\approx 0\.316–0\.3230\.323; MLP≈0\.270\\approx 0\.270–0\.3260\.326\)\. Therefore, chance\-level recovery on one exported artifact does not imply model\-wide sanitization; if mean\-pooled embeddings are exported, the audit and adversarial target should use that representation\.
We further run a meanpool\-targeted sweep with an MLP discriminator and perform checkpoint selection onbal\_val\. Even the lowest\-leakage validation checkpoint remains far above chance \(LR=0\.308=0\.308, Gap=0\.108=0\.108\), indicating thatmeanpoolis a harder artifact to sanitize; full validation results are reported in Appendix[L\.2](https://arxiv.org/html/2605.26433#A12.SS2)\.
Table 2:Representation\-choice sensitivity ontest\_balancedfor the balanced five\-way audit; chance accuracy is0\.200\.20\.B/N/Sdenoteprompt\_base,prompt\_neutral, andSurfaceLoRA\(λ=0\.02\\lambda\{=\}0\.02, step 1200\)\. Entries are LR/MLP probe accuracies\.L\-4meandenotes the last\-four\-layer mean\. Bold marks targetedlasttokvariants for the selectedSurfaceLoRAcheckpoint\.
### 4\.5Trade\-off Fragility Requires Checkpoint Selection
The leakage–utility trade\-off is non\-monotonic in both training time and adversarial strength\. We therefore treat checkpoint selection as part of the deployment procedure rather than defaulting to the final training step\. This is not post\-hoc selection on the test set: before test evaluation, checkpoints are selected using a pre\-specified validation rule onbal\_valfor the same exported representation that will be audited in deployment \(Appendix[H](https://arxiv.org/html/2605.26433#A8)\)\.
Checkpoint choice is consequential\. Forλ=0\.02\\lambda\{=\}0\.02, the validation\-selected step\-1200 checkpoint is near chance under the MLP probe ontest\_balanced\(Acc=0\.206=0\.206, Gap=0\.006=0\.006\), but continuing the same training to step 2000 increases recoverability to Acc=0\.228=0\.228\(Gap=0\.028=0\.028\) \(Table[1](https://arxiv.org/html/2605.26433#S4.T1)\)\. This rebound suggests that adversarial pressure can temporarily reduce probe\-accessible sensitive\-attribute signal in the targeted representation, while later utility optimization or representation reorganization can make attribute\-associated structure recoverable again\. Accordingly, deployment should audit saved checkpoints on a held\-out validation split and select the checkpoint using the same exported representation and the pre\-specified LR validation probe used for model selection\.
### 4\.6Output\-Level Attribute Inference and Task Utility
Representation\-level mitigation does not imply output\-level sensitive\-attribute invariance\. Generated summaries remain above chance under the Bio\_ClinicalBERT output attacker\. The selected checkpoint modestly improves over prompt\-only baselines \(Acc0\.2990\.299vs\.0\.3060\.306; Macro\-AUROC0\.6160\.616vs\.0\.6310\.631\), but does not approach chance\. A diagnostic attacker trained on gold BHC targets performs similarly \(Acc0\.2980\.298, Macro\-AUROC0\.6210\.621\), suggesting that case\-study attribute correlates are present in human\-written summaries\. Mention\-rate and group\-wise utility analyses are reported in Appendices[K\.3](https://arxiv.org/html/2605.26433#A11.SS3)and[L\.3](https://arxiv.org/html/2605.26433#A12.SS3)\.
## 5Discussion: Artifact\-Specific Auditing for Sensitive\-Information Inference
Exported vectors from LLM\-based summarization systems should be treated as artifact\-specific privacy and governance surfaces\. A mitigation claim is meaningful only when it identifies both the sensitive information being audited and the exact vector artifact retained, logged, indexed, or reused\. Prior work shows that embeddings and LLM\-derived representations can reveal information about underlying text or attributes\(Morriset al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib34); Liet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib35); Huanget al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib63); Chenet al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib55); Zhuet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib54); Donget al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib58)\)\. Our results show that recoverability can also vary across prompt\-derived artifacts from the same summarization model, so auditing one vector does not justify claims about another\.
SurfaceLoRAreduces recoverability of a chosen sensitive attribute from its designated vector artifact, but should not be interpreted as a universal sanitizer\. Although our case study audits EHR\-recorded race, it is attribute\-agnostic in form: given an audit label, the discriminator can target another sensitive attribute, and the resulting leakage–utility trade\-off should be evaluated separately\.
Alternative exports such asmeanpoolcan remain probeable even when the targetedlasttokartifact approaches chance\-level recovery\. One possible explanation is thatmeanpoolaggregates race\-associated cues distributed across many prompt tokens, making leakage harder to suppress with a low\-rank PEFT intervention such asSurfaceLoRA, especially when the adversarial loss targets a single exported vector\. This suggests that pooled artifacts may require dedicated mitigation, such as directly targetingmeanpoolor higher\-capacity adapters\.
Finally, representation\-level mitigation should not be interpreted as output\-level sensitive\-attribute invariance\. Longer training or stronger adversarial strength do not guarantee monotonically improved invariance, so checkpoint selection should be part of deployment\. In other words, it should be selected on a held\-out validation trade\-off set by maximizing summarization utility metric \(e\.g\., ROUGE\-L\) subject to a pre\-specified LeakageGap budget, with a Pareto minimum\-leakage fallback\.
## 6Conclusion
We study sensitive\-information inference from exported LLM representations in summarization systems\. We instantiate this problem in clinical BHC summarization, using EHR\-recorded race as a high\-stakes case\-study label for auditing whether sensitive information can be recovered from exported prompt\-vector artifacts\. We find that reducing recoverability from one exported vector artifact does not guarantee reduced recoverability from another\. As a mitigation case study,SurfaceLoRAreduces recoverability from the targetedlasttokartifact toward chance while preserving summarization utility\. However, pooled representations such asmeanpoolremain substantially predictive\. We also find non\-monotonic leakage–utility trade\-offs, motivating checkpoint selection for the exact vector artifact that will be retained or exposed\. Overall, privacy auditing for summarization representations should be artifact\-specific: the vector exposed in deployment is the vector that must be audited and, when necessary, mitigated\.
## 7Limitations
Our study has several limitations that motivate future work\. We study sensitive\-information inference from exported LLM representations, and instantiate the problem in one empirical setting: BHC summarization with EHR\-recorded race as a controlled case\-study audit label\. This scope enables a focused artifact\-specific analysis\. However, it does not establish generality across all sensitive information, summarization domains, institutions, or model architectures\.SurfaceLoRAis designed to reduce recoverability from the representation a system plans to retain or expose, withlasttok\_L\-1as the default targeted artifact in our experiments; accordingly, its strongest effect is expected on that artifact rather than across all internal states\. Other exported artifacts, such asmeanpoolprompt embeddings, may retain sensitive information, reinforcing the need to audit the exact representation used by the system\. Our evaluation uses linear and nonlinear MLP probes without exhausting all possible attackers, layers, representation combinations, or information\-theoretic leakage guarantees\. A further limitation is multi\-attribute scalability: practical governance may require limiting inference of several sensitive factors simultaneously, such as age, sex or gender, ethnicity, language, insurance status, or socioeconomic proxies\. Simply applying the same tuning process separately to each factor may be inefficient and simultaneous multi\-attribute mitigation remains an important open problem\. Finally, EHR\-recorded race should be interpreted as an administrative category rather than a biological attribute; other sensitive or socially mediated labels will require separate definitions, baselines, probing protocols, and mitigation audits\.
## Ethical Considerations
We use credentialed\-access MIMIC data under the PhysioNet data use agreement \(DUA\) and do not release any protected text\(Goldbergeret al\.,[2000](https://arxiv.org/html/2605.26433#bib.bib3); Johnsonet al\.,[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2)\)\. Our goal is to study representation\-level sensitive\-information recoverability as an audit and governance risk, instantiated here with EHR\-recorded race as a case\-study label, not to enable attribute inference in practice\. We report coarse EHR\-recorded race groupings derived from administrative registration fields\. Because race and ethnicity fields in healthcare databases can be incomplete, noisy, and institution\-dependent\(Johnsonet al\.,[2023b](https://arxiv.org/html/2605.26433#bib.bib71)\), these labels should be interpreted as recorded administrative categories rather than ground\-truth identity categories or biological variables\. Our findings therefore concern the recoverability of recorded categories and their textual proxies, not biological race\. We did not send any dataset content to third\-party APIs, consistent with the dataset rules\.
#### Data and code availability\.
Code, including training/evaluation scripts, configuration files, deterministic split seeds, prompt templates, and preprocessing instructions, will be released upon publication, excluding protected clinical text\. Due to the MIMIC DUA, we cannot share raw notes; we will provide instructions to reproduce results for credentialed PhysioNet users\.
## References
- A\. Aali, D\. Van Veen, Y\. Arefeen, J\. Hom, C\. Bluethgen, E\. P\. Reis, S\. Gatidis, N\. Clifford, J\. Daws, A\. Tehrani, J\. Kim, and A\. Chaudhari \(2025a\)MIMIC\-IV\-Ext\-BHC: labeled clinical notes dataset for hospital course summarization\.PhysioNet\.Note:Version 1\.2\.0External Links:[Document](https://dx.doi.org/10.13026/5gte-bv70),[Link](https://doi.org/10.13026/5gte-bv70)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§C\.1](https://arxiv.org/html/2605.26433#A3.SS1.p1.1),[§C\.1](https://arxiv.org/html/2605.26433#A3.SS1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.26433#S3.SS1.p1.1),[Ethical Considerations](https://arxiv.org/html/2605.26433#Sx1.p1.1)\.
- A\. Aali, D\. Van Veen, Y\. I\. Arefeen, J\. Hom, C\. Bluethgen, E\. P\. Reis, S\. Gatidis, N\. Clifford, J\. Daws, A\. S\. Tehrani, J\. Kim, and A\. S\. Chaudhari \(2025b\)A dataset and benchmark for hospital course summarization with adapted large language models\.Journal of the American Medical Informatics Association32\(3\),pp\. 470–479\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocae312)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§C\.1](https://arxiv.org/html/2605.26433#A3.SS1.p1.1),[§C\.1](https://arxiv.org/html/2605.26433#A3.SS1.p2.1),[§1](https://arxiv.org/html/2605.26433#S1.p3.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1)\.
- What’s in a summary? laying the groundwork for advances in hospital\-course summarization\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 4794–4811\.External Links:[Link](https://aclanthology.org/2021.naacl-main.382/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.382)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p3.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1)\.
- E\. Alsentzer, J\. Murphy, W\. Boag, W\. Weng, D\. Jindi, T\. Naumann, and M\. McDermott \(2019\)Publicly available clinical BERT embeddings\.InProceedings of the 2nd Clinical Natural Language Processing Workshop,A\. Rumshisky, K\. Roberts, S\. Bethard, and T\. Naumann \(Eds\.\),Minneapolis, Minnesota, USA,pp\. 72–78\.External Links:[Link](https://aclanthology.org/W19-1909/),[Document](https://dx.doi.org/10.18653/v1/W19-1909)Cited by:[§I\.1](https://arxiv.org/html/2605.26433#A9.SS1.SSS0.Px1.p1.2)\.
- Article 29 Data Protection Working Party \(2014\)Opinion 05/2014 on anonymisation techniques\.Note:WP216External Links:[Link](https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf)Cited by:[§1](https://arxiv.org/html/2605.26433#S1.p2.1)\.
- E\. Asgari, N\. Montaña\-Brown, M\. Dubois, S\. Khalil, J\. Balloch, J\. A\. Yeung, and D\. Pimenta \(2025\)A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation\.npj Digital Medicine8\(1\),pp\. 274\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01670-7)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1)\.
- Y\. Belinkov and J\. Glass \(2019\)Analysis methods in neural language processing: a survey\.Transactions of the Association for Computational Linguistics7,pp\. 49–72\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00254)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px2.p1.1),[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p4.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p2.1)\.
- N\. Belrose, D\. Schneider\-Joseph, S\. Ravfogel, R\. Cotterell, E\. Raff, and S\. Biderman \(2023\)LEACE: perfect linear concept erasure in closed form\.Advances in Neural Information Processing Systems36,pp\. 66044–66063\.Cited by:[§2](https://arxiv.org/html/2605.26433#S2.p3.1)\.
- N\. Carlini, F\. Tramèr, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel \(2021\)Extracting training data from large language models\.In30th USENIX Security Symposium \(USENIX Security 21\),pp\. 2633–2650\.Cited by:[§2](https://arxiv.org/html/2605.26433#S2.p2.1)\.
- Y\. Chen, H\. Lent, and J\. Bjerva \(2024\)Text embedding inversion security for multilingual language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 7808–7827\.External Links:[Link](https://aclanthology.org/2024.acl-long.422/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.422)Cited by:[§1](https://arxiv.org/html/2605.26433#S1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1)\.
- Y\. Chen, Q\. Xu, and J\. Bjerva \(2025\)ALGEN: few\-shot inversion attacks on textual embeddings via cross\-model alignment and generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 24330–24348\.External Links:[Link](https://aclanthology.org/2025.acl-long.1185/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1185),ISBN 979\-8\-89176\-251\-0Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§5](https://arxiv.org/html/2605.26433#S5.p1.1)\.
- Z\. Chen, A\. Hernández Cano, A\. Romanou, A\. Bonnet, K\. Matoba, F\. Salvi, M\. Pagliardini, S\. Fan, A\. Köpf, A\. Mohtashami, A\. Sallinen, A\. Sakhaeirad, V\. Swamy, I\. Krawczuk, D\. Bayazit, A\. Marmet, S\. Montariol, M\. Hartley, M\. Jaggi, and A\. Bosselut \(2023\)Meditron\-70B: scaling medical pretraining for large language models\.External Links:2311\.16079Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1)\.
- P\. Chung, A\. Swaminathan, A\. J\. Goodell, Y\. Kim, S\. M\. Reincke, L\. Han, B\. Deverett, M\. A\. Sadeghi, A\. Ariss, M\. Ghanem, D\. Seong, A\. A\. Lee, C\. E\. Coombes, B\. Bradshaw, M\. A\. Sufian, H\. J\. Hong, T\. P\. Nguyen, M\. R\. Rasouli, K\. Kamra, M\. A\. Burbridge, J\. C\. McAvoy, R\. Saffary, S\. P\. Ma, D\. Dash, J\. Xie, E\. Y\. Wang, C\. A\. Schmiesing, N\. Shah, and N\. Aghaeepour \(2025\)Verifying facts in patient care documents generated by large language models using electronic health records\.NEJM AI\.External Links:[Document](https://dx.doi.org/10.1056/AIdbp2500418)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1)\.
- T\. Dao \(2024\)FlashAttention\-2: faster attention with better parallelism and work partitioning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mZn2Xyh9Ec)Cited by:[§F\.2](https://arxiv.org/html/2605.26433#A6.SS2.SSS0.Px6.p1.1)\.
- F\. Dernoncourt, J\. Y\. Lee, O\. Uzuner, and P\. Szolovits \(2017\)De\-identification of patient notes with recurrent neural networks\.Journal of the American Medical Informatics Association24\(3\),pp\. 596–606\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocw156)Cited by:[§2](https://arxiv.org/html/2605.26433#S2.p3.1)\.
- T\. Dong, Y\. Meng, S\. Li, G\. Chen, Z\. Liu, and H\. Zhu \(2025\)Depth gives a false sense of privacy: LLM internal states inversion\.In34th USENIX Security Symposium \(USENIX Security 25\),pp\. 1629–1648\.Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§5](https://arxiv.org/html/2605.26433#S5.p1.1)\.
- M\. Douze, A\. Guzhva, C\. Deng, J\. Johnson, G\. Szilvasy, P\. Mazaré, M\. Lomeli, L\. Hosseini, and H\. Jégou \(2024\)The Faiss library\.External Links:2401\.08281,[Document](https://dx.doi.org/10.48550/arXiv.2401.08281)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p3.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p4.1),[§1](https://arxiv.org/html/2605.26433#S1.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p4.1),[§3](https://arxiv.org/html/2605.26433#S3.SS0.SSS0.Px1.p1.1)\.
- H\. Edwards and A\. Storkey \(2015\)Censoring representations with an adversary\.External Links:1511\.05897Cited by:[§B\.3](https://arxiv.org/html/2605.26433#A2.SS3.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p3.1)\.
- B\. Efron and R\. J\. Tibshirani \(1993\)An introduction to the bootstrap\.Monographs on Statistics and Applied Probability, Vol\.57,Chapman and Hall/CRC,New York\.Cited by:[§J\.3](https://arxiv.org/html/2605.26433#A10.SS3.SSS0.Px1.p1.3)\.
- Y\. Elazar and Y\. Goldberg \(2018\)Adversarial removal of demographic attributes from text data\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 11–21\.External Links:[Link](https://aclanthology.org/D18-1002/),[Document](https://dx.doi.org/10.18653/v1/D18-1002)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px2.p1.1),[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1),[§I\.1](https://arxiv.org/html/2605.26433#A9.SS1.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.26433#S1.p4.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p3.1)\.
- European Parliament and Council of the European Union \(2016\)Regulation \(EU\) 2016/679 of the European Parliament and of the Council of 27 april 2016\.Note:Official Journal of the European UnionExternal Links:[Link](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679)Cited by:[§1](https://arxiv.org/html/2605.26433#S1.p2.1)\.
- Y\. Ganin, E\. Ustinova, H\. Ajakan, P\. Germain, H\. Larochelle, F\. Laviolette, M\. Marchand, and V\. Lempitsky \(2016\)Domain\-adversarial training of neural networks\.Journal of Machine Learning Research17\(59\),pp\. 1–35\.Cited by:[§B\.3](https://arxiv.org/html/2605.26433#A2.SS3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p5.1),[§2](https://arxiv.org/html/2605.26433#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p3.1)\.
- J\. W\. Gichoya, I\. Banerjee, A\. R\. Bhimireddy, J\. L\. Burns, L\. A\. Celi, L\. Chen, R\. Correa, N\. Dullerud, M\. Ghassemi, S\. Huang, P\. Kuo, M\. P\. Lungren, L\. J\. Palmer, B\. J\. Price, S\. Purkayastha, A\. T\. Pyrros, L\. Oakden\-Rayner, C\. Okechukwu, L\. Seyyed\-Kalantari, H\. Trivedi, R\. Wang, Z\. Zaiman, and H\. Zhang \(2022\)AI recognition of patient race in medical imaging: a modelling study\.The Lancet Digital Health4\(6\),pp\. e406–e414\.External Links:[Document](https://dx.doi.org/10.1016/S2589-7500%2822%2900063-2)Cited by:[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1)\.
- A\. L\. Goldberger, L\. A\. N\. Amaral, L\. Glass, J\. M\. Hausdorff, P\. Ch\. Ivanov, R\. G\. Mark, J\. E\. Mietus, G\. B\. Moody, C\. Peng, and H\. E\. Stanley \(2000\)PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals\.Circulation101\(23\),pp\. e215–e220\.External Links:[Document](https://dx.doi.org/10.1161/01.CIR.101.23.e215)Cited by:[§Q\.1](https://arxiv.org/html/2605.26433#A17.SS1.SSS0.Px1.p1.1),[§C\.1](https://arxiv.org/html/2605.26433#A3.SS1.SSS0.Px1.p1.1),[Ethical Considerations](https://arxiv.org/html/2605.26433#Sx1.p1.1)\.
- S\. Heidari, T\. F\. Babor, P\. De Castro, S\. Tort, and M\. Curno \(2016\)Sex and gender equity in research: rationale for the SAGER guidelines and recommended use\.Research Integrity and Peer Review1\(1\),pp\. 2\.External Links:[Document](https://dx.doi.org/10.1186/s41073-016-0007-6)Cited by:[Appendix M](https://arxiv.org/html/2605.26433#A13.p1.1)\.
- J\. Hewitt and P\. Liang \(2019\)Designing and interpreting probes with control tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 2733–2743\.External Links:[Link](https://aclanthology.org/D19-1275/),[Document](https://dx.doi.org/10.18653/v1/D19-1275)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px2.p1.1),[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§B\.3](https://arxiv.org/html/2605.26433#A2.SS3.p1.1),[§E\.7](https://arxiv.org/html/2605.26433#A5.SS7.p1.3),[§1](https://arxiv.org/html/2605.26433#S1.p5.1),[§2](https://arxiv.org/html/2605.26433#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p3.1)\.
- Y\. Huang, Y\. Tsai, H\. Hsiao, H\. Lin, and S\. Lin \(2024\)Transferable embedding inversion attack: uncovering privacy risks in text embeddings without model queries\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4193–4205\.External Links:[Link](https://aclanthology.org/2024.acl-long.230/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.230)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§5](https://arxiv.org/html/2605.26433#S5.p1.1)\.
- A\. E\. W\. Johnson, L\. Bulgarelli, L\. Shen, A\. Gayles, A\. Shammout, S\. Horng, T\. J\. Pollard, S\. Hao, B\. Moody, B\. Gow, L\. H\. Lehman, L\. A\. Celi, and R\. G\. Mark \(2023a\)MIMIC\-IV, a freely accessible electronic health record dataset\.Scientific Data10\(1\),pp\. 1\.External Links:[Document](https://dx.doi.org/10.1038/s41597-022-01899-x)Cited by:[§Q\.1](https://arxiv.org/html/2605.26433#A17.SS1.SSS0.Px1.p1.1),[§Q\.1](https://arxiv.org/html/2605.26433#A17.SS1.SSS0.Px2.p1.1),[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§C\.1](https://arxiv.org/html/2605.26433#A3.SS1.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.26433#S3.SS1.p1.1),[Ethical Considerations](https://arxiv.org/html/2605.26433#Sx1.p1.1)\.
- A\. E\. W\. Johnson, T\. J\. Pollard, L\. Shen, L\. H\. Lehman, M\. Feng, M\. Ghassemi, B\. Moody, P\. Szolovits, L\. A\. Celi, and R\. G\. Mark \(2016\)MIMIC\-III, a freely accessible critical care database\.Scientific Data3,pp\. 160035\.External Links:[Document](https://dx.doi.org/10.1038/sdata.2016.35)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.26433#S3.SS1.p1.1)\.
- J\. A\. Johnson, B\. Moore, E\. K\. Hwang, A\. Hickner, and H\. Yeo \(2023b\)The accuracy of race and ethnicity data in US\-based healthcare databases: a systematic review\.The American Journal of Surgery226\(4\),pp\. 463–470\.External Links:[Document](https://dx.doi.org/10.1016/j.amjsurg.2023.05.021)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p4.1),[Ethical Considerations](https://arxiv.org/html/2605.26433#Sx1.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 6769–6781\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.550),[Link](https://aclanthology.org/2020.emnlp-main.550/)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p3.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p4.1),[§1](https://arxiv.org/html/2605.26433#S1.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p4.1),[§3](https://arxiv.org/html/2605.26433#S3.SS0.SSS0.Px1.p1.1)\.
- P\. Koehn \(2004\)Statistical significance tests for machine translation evaluation\.InProceedings of the 2004 Conference on Empirical Methods in Natural Language Processing,D\. Lin and D\. Wu \(Eds\.\),Barcelona, Spain,pp\. 388–395\.External Links:[Link](https://aclanthology.org/W04-3250/)Cited by:[§J\.3](https://arxiv.org/html/2605.26433#A10.SS3.SSS0.Px1.p1.3)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, and S\. Riedel \(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9459–9474\.Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p3.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p4.1),[§1](https://arxiv.org/html/2605.26433#S1.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p4.1),[§3](https://arxiv.org/html/2605.26433#S3.SS0.SSS0.Px1.p1.1)\.
- H\. Li, M\. Xu, and Y\. Song \(2023\)Sentence embedding leaks more information than you expect: generative embedding inversion attack to recover the whole sentence\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 14022–14040\.External Links:[Link](https://aclanthology.org/2023.findings-acl.881/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.881)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§5](https://arxiv.org/html/2605.26433#S5.p1.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§3\.5](https://arxiv.org/html/2605.26433#S3.SS5.p3.1)\.
- Llama Team, AI @ Meta \(2024\)The Llama 3 herd of models\.External Links:2407\.21783Cited by:[§F\.1](https://arxiv.org/html/2605.26433#A6.SS1.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.26433#S3.SS2.p1.6)\.
- D\. Madras, E\. Creager, T\. Pitassi, and R\. Zemel \(2018\)Learning adversarially fair and transferable representations\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 3384–3393\.Cited by:[§B\.3](https://arxiv.org/html/2605.26433#A2.SS3.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p3.1)\.
- S\. Mohammed, J\. Matos, M\. Doutreligne, L\. A\. Celi, and T\. Struja \(2023\)Racial disparities in invasive ICU treatments among septic patients: high resolution electronic health records analysis from MIMIC\-IV\.The Yale Journal of Biology and Medicine96\(3\),pp\. 293–306\.External Links:[Document](https://dx.doi.org/10.59249/wdji8829)Cited by:[Appendix D](https://arxiv.org/html/2605.26433#A4.SS0.SSS0.Px2.p1.1)\.
- J\. Morris, V\. Kuleshov, V\. Shmatikov, and A\. Rush \(2023\)Text embeddings reveal \(almost\) as much as text\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12448–12460\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.765/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.765)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§5](https://arxiv.org/html/2605.26433#S5.p1.1)\.
- Z\. Obermeyer, B\. Powers, C\. Vogeli, and S\. Mullainathan \(2019\)Dissecting racial bias in an algorithm used to manage the health of populations\.Science366\(6464\),pp\. 447–453\.External Links:[Document](https://dx.doi.org/10.1126/science.aax2342)Cited by:[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1)\.
- T\. Pimentel, J\. Valvoda, R\. H\. Maudslay, R\. Zmigrod, A\. Williams, and R\. Cotterell \(2020\)Information\-theoretic probing for linguistic structure\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4609–4622\.External Links:[Link](https://aclanthology.org/2020.acl-main.420/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.420)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px2.p1.1),[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p2.1)\.
- Qwen Team, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Lin, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.External Links:2412\.15115Cited by:[Appendix P](https://arxiv.org/html/2605.26433#A16.p1.1),[§3\.2](https://arxiv.org/html/2605.26433#S3.SS2.p1.10)\.
- A\. Rajkomar, M\. Hardt, M\. D\. Howell, G\. Corrado, and M\. H\. Chin \(2018\)Ensuring fairness in machine learning to advance health equity\.Annals of Internal Medicine169\(12\),pp\. 866–872\.External Links:[Document](https://dx.doi.org/10.7326/M18-1990)Cited by:[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1)\.
- S\. Ravfogel, Y\. Elazar, H\. Gonen, M\. Twiton, and Y\. Goldberg \(2020\)Null it out: guarding protected attributes by iterative nullspace projection\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 7237–7256\.External Links:[Link](https://aclanthology.org/2020.acl-main.647/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.647)Cited by:[§J\.2](https://arxiv.org/html/2605.26433#A10.SS2.SSS0.Px1.p1.1),[Table 7](https://arxiv.org/html/2605.26433#A10.T7.3.8.5.1.1.1),[Appendix O](https://arxiv.org/html/2605.26433#A15.SS0.SSS0.Px8.p1.3),[§B\.3](https://arxiv.org/html/2605.26433#A2.SS3.p1.1),[§B\.5](https://arxiv.org/html/2605.26433#A2.SS5.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p3.1)\.
- S\. Ravfogel, M\. Twiton, Y\. Goldberg, and R\. D\. Cotterell \(2022\)Linear adversarial concept erasure\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 18400–18421\.Cited by:[§B\.3](https://arxiv.org/html/2605.26433#A2.SS3.p1.1),[§2](https://arxiv.org/html/2605.26433#S2.p3.1)\.
- T\. Searle, Z\. Ibrahim, J\. Teo, and R\. J\. B\. Dobson \(2023\)Discharge summary hospital course summarisation of inpatient electronic health record text with clinical concept guided deep pre\-trained transformer models\.Journal of Biomedical Informatics141,pp\. 104358\.External Links:[Document](https://dx.doi.org/10.1016/j.jbi.2023.104358)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p3.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1)\.
- Ö\. Uzuner, B\. R\. South, S\. Shen, and S\. L\. DuVall \(2011\)2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text\.Journal of the American Medical Informatics Association18\(5\),pp\. 552–556\.External Links:[Document](https://dx.doi.org/10.1136/amiajnl-2011-000203)Cited by:[§3\.5](https://arxiv.org/html/2605.26433#S3.SS5.p3.1)\.
- J\. Wang, X\. Yi, R\. Guo, H\. Jin, P\. Xu, S\. Li, X\. Wang, X\. Guo, C\. Li, X\. Xu, K\. Yu, Y\. Yuan, Y\. Zou, J\. Long, Y\. Cai, Z\. Li, Z\. Zhang, Y\. Mo, J\. Gu, R\. Jiang, Y\. Wei, and C\. Xie \(2021\)Milvus: a purpose\-built vector data management system\.InProceedings of the 2021 International Conference on Management of Data,pp\. 2614–2627\.Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p3.1),[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p4.1),[§1](https://arxiv.org/html/2605.26433#S1.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p4.1),[§3](https://arxiv.org/html/2605.26433#S3.SS0.SSS0.Px1.p1.1)\.
- J\. Xu \(2024\)Discharge Me: BioNLP ACL’24 shared task on streamlining discharge documentation\.PhysioNet\.Note:Version 1\.3External Links:[Document](https://dx.doi.org/10.13026/0zf5-fx50),[Link](https://doi.org/10.13026/0zf5-fx50)Cited by:[§Q\.1](https://arxiv.org/html/2605.26433#A17.SS1.SSS0.Px1.p1.1),[§Q\.1](https://arxiv.org/html/2605.26433#A17.SS1.SSS0.Px2.p1.1),[§Q\.1](https://arxiv.org/html/2605.26433#A17.SS1.SSS0.Px4.p1.1),[§3\.1](https://arxiv.org/html/2605.26433#S3.SS1.p1.1)\.
- X\. Yang, A\. Chen, N\. PourNejatian, H\. C\. Shin, K\. E\. Smith, C\. Parisien, C\. Compas, C\. Martin, A\. B\. Costa, M\. G\. Flores, Y\. Zhang, T\. Magoc, C\. A\. Harle, G\. Lipori, D\. A\. Mitchell, W\. R\. Hogan, E\. A\. Shenkman, J\. Bian, and Y\. Wu \(2022\)A large language model for electronic health records\.npj Digital Medicine5\(1\),pp\. 194\.External Links:[Document](https://dx.doi.org/10.1038/s41746-022-00742-2)Cited by:[§B\.1](https://arxiv.org/html/2605.26433#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p3.1),[§2](https://arxiv.org/html/2605.26433#S2.p1.1)\.
- S\. Zeng, J\. Zhang, P\. He, Y\. Liu, Y\. Xing, H\. Xu, J\. Ren, Y\. Chang, S\. Wang, D\. Yin, and J\. Tang \(2024\)The good and the bad: exploring privacy issues in retrieval\-augmented generation \(RAG\)\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4505–4524\.External Links:[Link](https://aclanthology.org/2024.findings-acl.267/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.267)Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p1.1),[§3](https://arxiv.org/html/2605.26433#S3.SS0.SSS0.Px1.p1.1)\.
- B\. H\. Zhang, B\. Lemoine, and M\. Mitchell \(2018\)Mitigating unwanted biases with adversarial learning\.InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society,pp\. 335–340\.External Links:[Document](https://dx.doi.org/10.1145/3278721.3278779)Cited by:[§2](https://arxiv.org/html/2605.26433#S2.p3.1),[§3\.3](https://arxiv.org/html/2605.26433#S3.SS3.p3.1)\.
- H\. Zhang, A\. X\. Lu, M\. Abdalla, M\. McDermott, and M\. Ghassemi \(2020a\)Hurtful words: quantifying biases in clinical contextual word embeddings\.InProceedings of the ACM Conference on Health, Inference, and Learning,pp\. 110–120\.Cited by:[§B\.2](https://arxiv.org/html/2605.26433#A2.SS2.p1.1),[§I\.1](https://arxiv.org/html/2605.26433#A9.SS1.SSS0.Px1.p1.2)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020b\)BERTScore: evaluating text generation with BERT\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkeHuCVFDr)Cited by:[§3\.5](https://arxiv.org/html/2605.26433#S3.SS5.p3.1)\.
- Z\. Zhu, N\. Shao, D\. Lian, C\. Wu, Z\. Liu, Y\. Yang, and E\. Chen \(2024\)Understanding privacy risks of embeddings induced by large language models\.External Links:2404\.16587Cited by:[Appendix A](https://arxiv.org/html/2605.26433#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.26433#S1.p2.1),[§2](https://arxiv.org/html/2605.26433#S2.p2.1),[§5](https://arxiv.org/html/2605.26433#S5.p1.1)\.
## Appendix AExtended Introduction: Threat Model, Probing, and Operational Scope
Figure 2:Vector\-artifact audit for summarization systems\.Left: a summarization pipeline may retain or reuse derived vector artifacts under controls that differ from those governing the protected source document\. A downstream component, analyst, or auditor with vector access but no source\-text access may train a post\-hoc probe to infer a sensitive attribute, creating a residual information\-disclosure risk\. Right: an audited pipeline specifies the exact exported artifactz=fθ\(x\)z=f\_\{\\theta\}\(x\), evaluates it against a leakage budget, and applies artifact\-aligned mitigation and checkpoint selection\. In our empirical case study, the source documents are clinical notes and the audited sensitive attribute is EHR\-recorded race\. This illustration motivates our focus on auditing the concrete exported vector artifact rather than making model\-wide privacy claims\.#### Artifact\-specific leakage and mitigation\.
Leakage mitigation in deployed systems should be attached to the*stored or reused*vector artifact\. Operationally, this is testable: sanitization on one representation \(e\.g\.,lasttok\) does not imply sanitization on another plausible exported representation \(e\.g\.,meanpool\)\. When adversarial pressure is applied to a single exported vector \(the last\-token prompt state\), race\-associated signal can remain highly extractable from other exported representations\. For example, retrieval embeddings are often produced by a dedicated encoder; when deployments reuse the same LLM backbone for embedding, mean\-pooled hidden states are a common proxy\. As a result, near\-chance leakage on the targeted vector can coexist with substantial leakage on an alternative exported artifact\. This challenges the implicit assumption that suppressing leakage on a canonical token or pooling choice implies broader safety, and it can create a false sense of security if deployments export an untargeted artifact\. Accordingly, we frame leakage mitigation as attack\-surface reduction on the exported vector: audit the artifact that is stored or reused, and target that same artifact during mitigation when needed\.
#### What probing does \(and does not\) imply\.
Throughout, we use probing to quantify extractable sensitive\-attribute signal under a specified attacker class\(Belinkov and Glass,[2019](https://arxiv.org/html/2605.26433#bib.bib14); Hewitt and Liang,[2019](https://arxiv.org/html/2605.26433#bib.bib15); Pimentelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib16); Elazar and Goldberg,[2018](https://arxiv.org/html/2605.26433#bib.bib32)\)\. High probe accuracy indicates a larger attack surface on an exposed representation, but it does not by itself establish that the generator causally uses the sensitive attribute to produce summaries\. We therefore interpret probing as a capability audit over a concrete exported vector and complement it with stronger probes and output\-based attackers\. In our empirical case study, the probed attribute is EHR\-recorded race\.
#### Operational scope and threat model\.
Our scope is motivated by an access\-boundary mismatch between protected source documents and derived representation artifacts\. A summarization system may keep the original document within a tightly controlled environment while still storing, logging, indexing, caching, or reusing derived vectors for operational workflows such as retrieval, monitoring, drift analysis, incident review, or quality analytics\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib6); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib7); Wanget al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib72); Douzeet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib67); Zenget al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib41)\)\. We do not assume that source documents are publicly released, nor that all systems share embeddings externally\. The risk is that a component, service provider, analyst, or auditor with access to derived artifacts under one governance boundary may not be authorized to inspect the original document or structured sensitive attributes, yet may still infer sensitive\-attribute information from the vectors\(Morriset al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib34); Liet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib35); Zhuet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib54); Huanget al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib63); Chenet al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib55); Donget al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib58)\)\.
Once such vectors are stored or reused outside the immediate generation computation, attribute inference can become feasible even when generated summaries contain no explicit demographic mentions\. We therefore study representation\-level leakage and adopt attack\-surface reduction as the engineering objective: reduce recoverable sensitive\-label signal from a deployment\-relevant representation while preserving summarization utility\. In our empirical case study, the sensitive label is EHR\-recorded race\.
We distinguish two common vector\-export settings\. First, retrieval embeddings support similarity search in RAG and dense retrieval systems\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib6); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib7)\); such embeddings are commonly stored or queried through vector\-indexing infrastructure\(Wanget al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib72); Douzeet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib67)\)\. These are often produced by a dedicated embedding encoder, but when deployments reuse the same LLM backbone for embedding, mean\-pooled hidden states are a common proxy\. Second, runtime logging features may record one vector per request for throughput\-efficient monitoring, gating, auditing, caching, or drift dashboards\. SurfaceLoRA primarily targets this second setting by attaching adversarial pressure to a single prompt\-level feature, the last\-token state, intended to protect real\-time operational logs\. We treat retrieval embeddings as a distinct attack surface: if a deployment requires mean\-pooled embeddings for retrieval, the adversarial objective should be attached to that specific exported vector during training\.
These distinctions align with common vector\-retention scenarios\. In RAG\-based assistants, embeddings may be stored in vector databases or similarity\-search indexes\(Lewiset al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib6); Karpukhinet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib7); Wanget al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib72); Douzeet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib67)\)\. Separately, derived vectors may be used in internal, vendor\-supported, or research/governance analytics \(e\.g\., clustering, drift monitoring, cohort discovery, or quality monitoring\) without giving analysts direct access to the original source text\. In such settings, the governance requirement is often asymmetric: the shared representation should preserve utility\-relevant structure while limiting recoverability of sensitive attributes\. Finally, while our main audit uses EHR\-recorded race labels, such supervision is plausible in clinical environments because demographics are commonly recorded as structured registration/EHR fields and may be available for authorized auditing\(Johnsonet al\.,[2023b](https://arxiv.org/html/2605.26433#bib.bib71)\); moreover, partial disclosure, weak supervision, or linkage via auxiliary metadata can provide sufficient supervision to train attribute\-inference probes\.
#### Scope, setting, and boundary\.
This paper asks a targeted question: can we reduce sensitive\-attribute signal in a retained or exposed internal representation while preserving summarization utility? Our goal is not to “wash” generated text to eliminate all correlates of a sensitive attribute; such correlates may be task\-relevant, and aggressively sanitizing outputs can harm usefulness\. Instead, we focus on reducing extractable sensitive\-attribute information from a commonly reused internal feature: a prompt representation\. In our empirical case study, the sensitive label is EHR\-recorded race and the summarization task is BHC generation\.
We study brief hospital course \(BHC\) summarization in a setting motivated by common vector\-use patterns in LLM systems: derived representations may be stored or reused for retrieval, monitoring, or analytics; the summarizer can be adapted locally using parameter\-efficient fine\-tuning \(PEFT\) with LoRA; and representation\-level leakage can be mitigated with gradient\-reversal adversarial training\. Concretely, we evaluate leakage via representation probing \(linear and nonlinear probes\) and utility via ROUGE plus clinical utility proxies \(BERTScore and concept overlap\)\. Consistent with the operational framing above, we interpret improvements as reductions in what is recoverable from the chosen representation under specified attacker classes, rather than as information\-theoretic removal or a claim that the generator causally uses the audited sensitive label\.
#### Novelty: an artifact\-specific deployment recipe\.
Our contribution is not a new adversarial objective; GRL\-style adversarial learning and LoRA are established components\. Instead, we contribute an artifact\-specific deployment recipe: \(i\) the exported vector defines the attack surface, \(ii\) leakage suppression is artifact\-specific and should match the deployed vector, and \(iii\) the best utility–leakage trade\-off is checkpoint\-dependent rather than monotonic in training\. SurfaceLoRA instantiates this view with an exported\-vector\-targeted training\-time discriminator under PEFT and a validation\-based checkpoint selection rule\.
## Appendix BExtended Related Work
This work studies sensitive\-attribute extractability from reusable internal representations as a practical audit issue in summarization systems\. We use clinical summarization and EHR\-recorded race as a high\-stakes empirical case study\. We briefly situate our approach in three areas and emphasize how our artifact\-specific framing differs from much of the prior literature\.
### B\.1Clinical Summarization of Discharge Narratives
Clinical summarization aims to condense long\-form notes \(e\.g\., discharge summaries and progress notes\) into concise narratives that preserve the patient trajectory, major interventions, and outcomes\(Adamset al\.,[2021](https://arxiv.org/html/2605.26433#bib.bib43); Searleet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib68); Aaliet al\.,[2025b](https://arxiv.org/html/2605.26433#bib.bib56)\)\. Modern systems are largely dominated by Transformer\-based sequence\-to\-sequence models and instruction\-tuned LLMs adapted to clinical text\(Yanget al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib38); Chenet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib39)\)\. In discharge settings, summarization is challenging due to \(i\) long contexts, \(ii\) redundant and templated sections, and \(iii\) the high cost of factual errors \(hallucinations\) and omissions, motivating careful dataset curation and evaluation beyond surface overlap metrics\(Asgariet al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib61); Chunget al\.,[2025](https://arxiv.org/html/2605.26433#bib.bib62)\)\. MIMIC\-based corpora have become a common testbed due to their scale and linkage to structured EHR signals\(Johnsonet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib5),[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2)\)\. We focus on brief hospital course \(BHC\) generation, a clinically meaningful narrative slice of discharge documentation, and evaluate both summarization utility and recorded\-attribute proxy information in model representations\.
### B\.2Sensitive\-Attribute Leakage and Attribute Inference via Probing
Neural representations can encode sensitive attributes even when those attributes are not explicitly required for the task\(Elazar and Goldberg,[2018](https://arxiv.org/html/2605.26433#bib.bib32); Zhanget al\.,[2020a](https://arxiv.org/html/2605.26433#bib.bib37)\)\. A standard measurement paradigm is probing: fitting a lightweight classifier \(often linear/logistic regression\) on frozen representations to quantify how easily an attribute can be recovered\(Belinkov and Glass,[2019](https://arxiv.org/html/2605.26433#bib.bib14)\)\. Probe accuracy provides an interpretable signal of extractable information, but it does not guarantee invariance to stronger adversaries or different attack surfaces, and results can depend on the representation choice and probe capacity\(Hewitt and Liang,[2019](https://arxiv.org/html/2605.26433#bib.bib15); Pimentelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib16)\)\. In clinical NLP, recorded\-attribute recoverability is a high\-stakes instance of this broader problem because demographic fields and clinical text can be correlated in complex ways\(Gichoyaet al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib36); Obermeyeret al\.,[2019](https://arxiv.org/html/2605.26433#bib.bib42); Rajkomaret al\.,[2018](https://arxiv.org/html/2605.26433#bib.bib50)\)\. We adopt a controlled probing protocol on prompt representations and report deviation from chance in a balanced 5\-class EHR\-recorded race setting, complemented by a stronger nonlinear MLP probe and output\-based attackers\.
### B\.3Adversarial Invariance, Representation Sanitization, and PEFT
Adversarial representation learning is a common strategy for reducing nuisance information in learned features\. A widely used mechanism is the gradient reversal layer \(GRL\), originally popularized for domain\-adversarial training, which encourages representations to be predictive for the main task while uninformative for an auxiliary discriminator\(Ganinet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib11)\)\. Related ideas appear across protected\-attribute leakage and privacy settings, where an adversary predicts a protected attribute and the backbone is trained \(often via GRL\) to reduce attribute predictability on a chosen representation\(Edwards and Storkey,[2015](https://arxiv.org/html/2605.26433#bib.bib12); Madraset al\.,[2018](https://arxiv.org/html/2605.26433#bib.bib13)\)\. Beyond GRL\-style objectives, other sanitization approaches include iterative nullspace projection and linear removal techniques that aim to reduce protected information from representations\(Ravfogelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib17),[2022](https://arxiv.org/html/2605.26433#bib.bib40)\)\. Separately, parameter\-efficient fine\-tuning \(PEFT\) adapts large pretrained models with a small number of additional parameters, enabling faster experimentation and lower storage/compute overhead\. LoRA injects low\-rank updates into attention projections while keeping base weights frozen\(Huet al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib18)\)\.
### B\.4Positioning and mitigation choices
Our focus is representation\-level extractability from deployment\-relevant exported vectors, evaluated via probing and complemented by stronger probes and output\-based attackers\. Compared with full\-model adversarial training, we study a micro\-adversarial setup: a single lightweight discriminator attached to a specific prompt representation under LoRA\. Beyond the method itself, we highlight a systems\-level observation: the leakage–utility trade\-off can be non\-monotonic under training dynamics, motivating validation\-based checkpoint selection as a deployment primitive\.
### B\.5Post\-hoc vs\. training\-time mitigation
INLP\-style and related linear removal methods can reduce linear attribute predictability via post\-processing\(Ravfogelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib17)\)\. However, nonlinear extractability can persist on the same exported vector after post\-hoc sanitization \(Section[4\.2](https://arxiv.org/html/2605.26433#S4.SS2)\), motivating training\-time, artifact\-targeted mitigation and validation\-based checkpoint selection\.
## Appendix CDataset Details and Access
### C\.1Dataset: MIMIC\-IV\-Ext\-BHC
We use MIMIC\-IV\-Ext\-BHC v1\.2\.0, a curated clinical\-notes dataset released on PhysioNet\(Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2)\)\. It is derived from MIMIC\-IV\-Note, which contains deidentified free\-text clinical notes from Beth Israel Deaconess Medical Center \(2008–2019\) and is linkable to the broader MIMIC\-IV EHR database\(Johnsonet al\.,[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2)\)\. Each record pairs \(i\) aninputdischarge summary with the BHC section removed and \(ii\) atargetconsisting of the corresponding cleaned BHC section\(Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2),[b](https://arxiv.org/html/2605.26433#bib.bib56)\)\. The released CSV \(mimic\-iv\-bhc\.csv\) additionally providesnote\_idas well as descriptive token\-length metadata \(input\_tokens,target\_tokens\) computed with a GPT\-4 tokenizer\(Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2)\)\.
The dataset authors extract the substring under the “Brief Hospital Course” heading using regular expressions, discard notes without a BHC section, and exclude note–BHC pairs with BHC length<100<100characters\(Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2),[b](https://arxiv.org/html/2605.26433#bib.bib56)\)\. They standardize formatting \(e\.g\., whitespace cleanup and header normalization\) and retain notes containing a “Sex” section to support downstream subgroup analyses\(Aaliet al\.,[2025a](https://arxiv.org/html/2605.26433#bib.bib2)\)\.
#### Access and compliance\.
MIMIC\-IV\-Ext\-BHC is a credentialed\-access dataset distributed under PhysioNet’s credentialed health data license\(Goldbergeret al\.,[2000](https://arxiv.org/html/2605.26433#bib.bib3)\)\. Access requires completing the required training and signing a data use agreement \(DUA\)\. All experiments are conducted under the dataset’s usage constraints; we do not release any protected text\.
Table 3:Summary statistics of MIMIC\-IV\-Ext\-BHC v1\.2\.0\. Token counts are provided by the dataset and computed with a GPT\-4 tokenizer for descriptive purposes only; all training and truncation use the base model’s tokenizer\.
## Appendix DOur Preprocessing and Split Construction
#### Raw tables, linkage, and de\-duplication\.
Starting from paired discharge\-note inputs and BHC targets, we augment each example with patient\- and encounter\-level metadata by joining identifiers:note\_id→\\rightarrow\(subject\_id,hadm\_id\), thenhadm\_id→\\rightarrowrace\. We de\-duplicate bynote\_idand drop rows with missing identifiers, demographics, or text fields\.
#### Race normalization\.
We map raw race strings into five coarse groups,White,Black,Hispanic,Asian, andOther, and dropUnknown\(e\.g\., unknown/declined/unable\) to avoid ambiguous subgroup assignment\. Concretely, we assignWhiteif the raw string contains “WHITE” or “PORTUGUESE”;Blackif it contains “BLACK” or “AFRICAN”;Hispanicif it contains “HISPANIC” or “LATINO”;Asianif it contains “ASIAN”; andOtherotherwise\. Following common practice in prior MIMIC\-IV studies, we group “WHITE” variants \(e\.g\., “WHITE – PORTUGUESE”, “WHITE – BRAZILIAN”\) intoWhiteto align with dataset conventions and enable comparable subgroup evaluation\(Mohammedet al\.,[2023](https://arxiv.org/html/2605.26433#bib.bib1)\)\. We retainOtherto preserve coverage of records that do not fall into the four named groups\.
#### Text cleaning and filtering\.
We apply conservative heuristics to remove low\-quality examples and reduce irrelevant long tails\. Input truncation: we truncate the input at the first occurrence of report\-heavy block headers such asIMAGING,MICROBIOLOGY,\(DISCHARGE\) LABS,FINAL REPORT,RESULTS, orDATA; we additionally cap inputs at 4,000 characters and collapse whitespace\. Length thresholds: we require the cleaned input to have at least 200 characters and the target BHC to have at least 80 characters\. Instruction\-like target removal: we remove targets that resemble patient\-facing discharge instructions, detected via strong phrases \(e\.g\., “DISCHARGE INSTRUCTIONS”, “FOLLOWUP INSTRUCTIONS”, “Please call”, “Return to the emergency room”, “call 911”\) and weak second\-person templates \(e\.g\., “You were admitted”\) combined with action words \(e\.g\., “please”, “appointment”, “return”, “call”, “take”\)\.
#### Resulting cleaned dataset\.
After cleaning, we retain 239,467 note–BHC pairs\. The race\-group distribution is:White171,584 \(71\.65%\),Black36,923 \(15\.42%\),Hispanic13,426 \(5\.61%\),Asian7,873 \(3\.29%\), andOther9,661 \(4\.03%\)\.
Table 4:Race\-group distribution in the cleaned dataset after dropping unknown or ambiguous race entries\.
#### Leakage\-free patient split\.
To prevent patient\-level leakage, we perform a group split bysubject\_idusing a fixed random seed\. We allocate 10% of subjects to the test set\. From the remaining subjects, we allocate 10% to validation and use the rest for training\. This yields 193,470 train \(80\.79%\), 21,552 validation \(9\.00%\), and 24,445 test \(10\.21%\) examples\. We verify thatsubject\_idis disjoint across train/validation/test\.
#### Balanced leakage subsets\.
For controlled comparisons, we construct balanced race\-stratified subsets within each split by sampling a fixed number of examples per race group \(White,Black,Hispanic,Asian,Other\) and shuffling afterward\. We denote these subsets asbal\_train,bal\_val, andtest\_balanced; each subset is drawn only from its corresponding train/validation/test split\. Concretely: \(i\)test\_balanced\(final reporting\): 500 per class from the test split, totaling 2,500 examples; \(ii\)bal\_train\(stable tuning / auxiliary training\): 4,000 per class from the training split, totaling 20,000 examples; \(iii\)bal\_val\(checkpoint selection / model selection\): 500 per class from the validation split, totaling 2,500 examples\.
Table 5:Dataset sizes with patient\-disjoint splits bysubject\_id\. Balanced subsets are constructed within each split for race\-audit training/evaluation\.
## Appendix EExtended Method Details
### E\.1Scope\- and Vector\-Targeted Adversarial Training
Our adversarial intervention is deliberately scope\-limited and anchored to a specific exported vector, rather than applied to all internal states or all generated outputs\. Concretely, the setup has three design constraints: \(1\) a lightweight adversary, a single linear 5\-way classifier head; \(2\) vector\-targeted pressure, adversarial gradients are applied only to a single deployment\-relevant prompt vector at the generation boundary \(the last non\-padding prompt token state\), rather than to all token states or generated tokens; and \(3\) minimal disruption under PEFT, we update only the LoRA adapters \(and the small adversary head\) while keeping the backbone frozen\. Importantly, this scope limitation does not imply zero overhead: under our synchronous 1:1 schedule, each step includes one SFT forward pass and one additional prompt\-only forward pass to computeℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}\.
### E\.2Exactlasttokrepresentation definition
#### Layer indexing\.
LetLLdenote the number of Transformer blocks\. We use 0\-indexed blocks; thusL−1L\{\-\}1denotes the final block \(Python index\-1in HuggingFacehidden\_states\),L−2L\{\-\}2the second\-to\-last, etc\.
Letccdenote the source clinical context, and letxpromptx\_\{\\mathrm\{prompt\}\}denote the rendered input prompt used immediately before generation\. It is obtained by applying the model\-specific chat template to the system instruction and the user message containingcc, with the assistant generation header appended\. Thus,xpromptx\_\{\\mathrm\{prompt\}\}is the exact token sequence seen by the model immediately before it starts generating the BHC\. It contains the system instruction, source clinical context, and assistant generation header, but no target BHC tokens, generated tokens, or race label\.
Unless otherwise stated,lasttokis extracted from this prompt\-only input\. With left padding, leti⋆=max\{i:attention\_maski=1\}i^\{\\star\}=\\max\\\{i:\\texttt\{attention\\\_mask\}\_\{i\}=1\\\}denote the index of the final non\-padding token inxpromptx\_\{\\mathrm\{prompt\}\}\. We definelasttok\_L\-1as the final\-block hidden state at this boundary position:
lasttok\_L\-1\(xprompt\)=𝐡i⋆\(L−1\)\(xprompt\)\.\\texttt\{lasttok\\\_L\-1\}\(x\_\{\\mathrm\{prompt\}\}\)=\\mathbf\{h\}^\{\(L\-1\)\}\_\{i^\{\\star\}\}\(x\_\{\\mathrm\{prompt\}\}\)\.Because the chat template injects role delimiters and an assistant generation header,lasttokrepresents the model state at the generation boundary, immediately before decoding begins\.
#### Boundary choice \(after\-header vs\. user\-end\)\.
Our main definition corresponds to a boundary\-after\-header representation: the exported vector is taken after the assistant generation header is appended, immediately before decoding begins\. As a sensitivity check, we also consider a boundary\-at\-user\-end variant by rendering the same system and user messages without the assistant generation header and taking the last non\-padding token state from that sequence\. We observe the same qualitative conclusions under both boundary definitions: SurfaceLoRA drives the targetedlasttokartifact toward chance\-level recovery at the validation\-selected checkpoint, while alternative pooled representations such asmeanpoolremain substantially probeable\.
### E\.3Representation variants
Unless otherwise stated, all representation variants are computed from the same rendered promptxpromptx\_\{\\mathrm\{prompt\}\}defined above\. That prompt contains the system instruction, the source clinical context, and the assistant generation header, but no target BHC tokens, generated tokens, or race label\. We run a prompt\-only forward pass onxpromptx\_\{\\mathrm\{prompt\}\}withoutput\_hidden\_states=True\.
Let𝐡i\(ℓ\)\(xprompt\)∈ℝd\\mathbf\{h\}^\{\(\\ell\)\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\)\\in\\mathbb\{R\}^\{d\}denote the hidden state at token positioniifrom transformer blockℓ\\ell\. With left padding, leti⋆=max\{i:attention\_maski=1\}i^\{\\star\}=\\max\\\{i:\\texttt\{attention\\\_mask\}\_\{i\}=1\\\}be the index of the last non\-padding token inxpromptx\_\{\\mathrm\{prompt\}\}\. We study several operational representations that could plausibly be stored or reused by downstream systems\.
Last\-token representations \(generation\-boundary\)\.For compact display,L4m\\mathrm\{L4m\}denotes the last\-four\-layer mean variant \(L\-4mean\)\.
lasttokL−1\(xprompt\)\\displaystyle\\mathrm\{lasttok\}\_\{L\-1\}\(x\_\{\\mathrm\{prompt\}\}\)=𝐡i⋆\(L−1\)\(xprompt\),\\displaystyle=\\mathbf\{h\}^\{\(L\-1\)\}\_\{i^\{\\star\}\}\(x\_\{\\mathrm\{prompt\}\}\),lasttokL−2\(xprompt\)\\displaystyle\\mathrm\{lasttok\}\_\{L\-2\}\(x\_\{\\mathrm\{prompt\}\}\)=𝐡i⋆\(L−2\)\(xprompt\),\\displaystyle=\\mathbf\{h\}^\{\(L\-2\)\}\_\{i^\{\\star\}\}\(x\_\{\\mathrm\{prompt\}\}\),lasttokL4m\(xprompt\)\\displaystyle\\mathrm\{lasttok\}\_\{\\mathrm\{L4m\}\}\(x\_\{\\mathrm\{prompt\}\}\)=14∑k=03\\displaystyle=\\frac\{1\}\{4\}\\sum\_\{k=0\}^\{3\}𝐡i⋆\(L−1−k\)\(xprompt\)\.\\displaystyle\\quad\\mathbf\{h\}^\{\(L\-1\-k\)\}\_\{i^\{\\star\}\}\(x\_\{\\mathrm\{prompt\}\}\)\.
Mean pooling over prompt tokens\.Let𝒫=\{i:attention\_maski=1\}\\mathcal\{P\}=\\\{i:\\texttt\{attention\\\_mask\}\_\{i\}=1\\\}be the set of non\-padding token positions inxpromptx\_\{\\mathrm\{prompt\}\}\. We define:
meanpoolL−1\(xprompt\)\\displaystyle\\mathrm\{meanpool\}\_\{L\-1\}\(x\_\{\\mathrm\{prompt\}\}\)=1\|𝒫\|∑i∈𝒫𝐡i\(L−1\)\(xprompt\),\\displaystyle=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{i\\in\\mathcal\{P\}\}\\mathbf\{h\}^\{\(L\-1\)\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\),meanpoolL−2\(xprompt\)\\displaystyle\\mathrm\{meanpool\}\_\{L\-2\}\(x\_\{\\mathrm\{prompt\}\}\)=1\|𝒫\|∑i∈𝒫𝐡i\(L−2\)\(xprompt\),\\displaystyle=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{i\\in\\mathcal\{P\}\}\\mathbf\{h\}^\{\(L\-2\)\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\),meanpoolL4m\(xprompt\)\\displaystyle\\mathrm\{meanpool\}\_\{\\mathrm\{L4m\}\}\(x\_\{\\mathrm\{prompt\}\}\)=1\|𝒫\|∑i∈𝒫𝐡¯i\(xprompt\),\\displaystyle=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{i\\in\\mathcal\{P\}\}\\bar\{\\mathbf\{h\}\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\),𝐡¯i\(xprompt\)\\displaystyle\\bar\{\\mathbf\{h\}\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\)=14∑k=03𝐡i\(L−1−k\)\(xprompt\)\.\\displaystyle=\\frac\{1\}\{4\}\\sum\_\{k=0\}^\{3\}\\mathbf\{h\}^\{\(L\-1\-k\)\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\)\.
#### Meanpool\-targeted training\.
When meanpool\-targeted training is enabled, the adversary is attached tomeanpool\_L\-1computed from the same rendered promptxpromptx\_\{\\mathrm\{prompt\}\}\.
Training target vs\. evaluation\.SurfaceLoRA applies adversarial pressure to a*single*deployment\-relevant exported vector by default, typicallylasttok\_L\-1\. When a deployment exports mean\-pooled embeddings, for example for retrieval or vector\-store sharing, the adversarial objective should instead be attached tomeanpool\_L\-1computed from the same exported prompt artifact\.
### E\.4Leakage metric \(deviation from chance\)
We measure demographic extractability in a balanced 5\-way race setting using probe accuracyProbeAcc∈\[0,1\]\\text\{ProbeAcc\}\\in\[0,1\]\. Because the evaluation subsets are perfectly balanced across five classes, chance accuracy is0\.20\.2\. We defineLeakageGapas the absolute deviation from chance:
LeakageGap=\|ProbeAcc−0\.2\|\.\\text\{LeakageGap\}=\\left\|\\text\{ProbeAcc\}\-0\.2\\right\|\.\(6\)For compactness, we denote LeakageGap as Gap in table headers\.
### E\.5Joint training schedule \(synchronous 1:1\)
At each training step, we sample \(i\) one SFT batch from the main training set and \(ii\) one prompt\-only balanced leakage batch frombal\_train, using the same per\-device batch size for both loaders \(a strict 1:1 batch ratio\)\. We computeℒutil\\mathcal\{L\}\_\{\\text\{util\}\}on the SFT batch andℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}on the balanced batch within the same step, then perform a single backward pass and a single optimizer step synchronously\. Equivalently,θ\\theta\(LoRA adapters\) is optimized to minimizeℒutil\\mathcal\{L\}\_\{\\text\{util\}\}while maximizingℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}via GRL, andϕ\\phi\(the adversary head\) is optimized to minimizeℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}\.
### E\.6Meanpool\-targeted SurfaceLoRA
SurfaceLoRA is artifact\-specific: the GRL adversary is attached to the*same*representation artifact that will be exported in deployment\. In our default setting, the adversary input islasttok\_L\-1computed fromxpromptx\_\{\\mathrm\{prompt\}\}\. Formeanpool\-targetedtraining, we instead attach the adversary tomeanpool\_L\-1, computed by averaging the final\-block hidden states over all non\-padding tokens in the same rendered promptxpromptx\_\{\\mathrm\{prompt\}\}:
fθmp\(xprompt\)\\displaystyle f\_\{\\theta\}^\{\\mathrm\{mp\}\}\(x\_\{\\mathrm\{prompt\}\}\):=meanpoolL−1\(xprompt\),\\displaystyle=\\mathrm\{meanpool\}\_\{L\-1\}\(x\_\{\\mathrm\{prompt\}\}\),\(7\)uλ\\displaystyle u\_\{\\lambda\}:=GRLλ\(fθmp\(xprompt\)\),\\displaystyle=\\mathrm\{GRL\}\_\{\\lambda\}\\\!\\left\(f\_\{\\theta\}^\{\\mathrm\{mp\}\}\(x\_\{\\mathrm\{prompt\}\}\)\\right\),ℒadv\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{adv\}\}=CE\(gϕ\(uλ\),r\)\.\\displaystyle=\\operatorname\{CE\}\\\!\\left\(g\_\{\\phi\}\(u\_\{\\lambda\}\),\\,r\\right\)\.
All other components \(utility objective, LoRA parameterization, probing protocol, and checkpoint selection\) remain unchanged, except that the meanpool\-targeted stress test reported in Appendix[L\.2](https://arxiv.org/html/2605.26433#A12.SS2)uses a two\-layer MLP discriminator\. Thus, the targeted representation differs from the defaultlasttoksetting, and the training\-time adversary is strengthened for this pooled\-artifact stress test\.
### E\.7Parameter\-efficient tuning \(LoRA\)
We update the base model using LoRA adapters\(Huet al\.,[2022](https://arxiv.org/html/2605.26433#bib.bib18)\)\(rankr=16r\{=\}16,α=32\\alpha\{=\}32, dropout0\.050\.05\) applied to attention projection modulesq\_projandv\_proj\. All other base parameters remain frozen\. In the defaultlasttok\-targeted race experiments, the adversary is a single linear layer on top of the hidden state; stress\-test variants using a two\-layer MLP discriminator are noted explicitly\.
## Appendix FTraining, Tokenization, and Inference Details
### F\.1Base Model, Tokenizer, and Chat Template
#### Base model\.
Unless otherwise stated, the main results are obtained by fine\-tuningLlama\-3\.1\-8B\-Instructas a local instruction\-tuned causal language model \(CLM\)\(Llama Team, AI @ Meta,[2024](https://arxiv.org/html/2605.26433#bib.bib26)\)\. We load the model from a local checkpoint and do not use any hosted APIs\.
#### Tokenizer and chat template\.
We use the corresponding HuggingFace tokenizer \(AutoTokenizer, fast tokenizer enabled\) and the model\-provided chat template to construct prompts\. For each example, we create a two\-message list with rolessystemanduser, and render it viatokenizer\.apply\_chat\_template\(messages, add\_generation\_prompt=True\)\. Using the tokenizer\-provided template ensures that special tokens and role delimiters follow the exact Llama\-3\.1\-Instruct formatting specified by the tokenizer configuration, which matters because hidden states depend on these tokens\. We use left padding for both training\-time batching and evaluation\-time generation\. If the tokenizer has no explicit pad token, we setpad\_token = eos\_token\.
#### Truncation\.
Prompt\-only inputs are truncated to at most 1,024 tokens by keeping the most recent tokens \(left truncation of the oldest context\)\. For SFT \(prompt \+ target\), sequences are truncated to at most 1,536 tokens using the same left\-truncation rule to fit GPU memory constraints\.
### F\.2Training Configuration
#### Sweep\.
Forlasttok\-targeted SurfaceLoRA, we sweep GRL coefficientsλ∈\{0\.0,0\.02,0\.05,0\.10,0\.20,0\.50\}\\lambda\\in\\\{0\.0,0\.02,0\.05,0\.10,0\.20,0\.50\\\}with a fixed random seed\. Each run trains for 2,000 steps using AdamW with learning ratelr=2×10−4\\mathrm\{lr\}=2\\times 10^\{\-4\}\. For the meanpool\-targeted sweep, we useλ∈\{0\.0,0\.05,0\.10,0\.30,1\.0\}\\lambda\\in\\\{0\.0,0\.05,0\.10,0\.30,1\.0\\\}and evaluate checkpoints up to step 1,400 under the same checkpoint selection procedure\. This sweep uses a two\-layer MLP discriminator attached tomeanpool\_L\-1\.
#### Optimization\.
We use per\-device batch size 4 and gradient accumulation 32, resulting in an effective batch size of 128\. We apply gradient clipping with max norm 1\.0\. Unless otherwise stated, AdamW uses PyTorch defaults \(betas\(0\.9,0\.999\)\(0\.9,0\.999\),ϵ=10−8\\epsilon=10^\{\-8\}, and weight decay0\.010\.01\)\. We do not use a learning\-rate scheduler or warmup \(fixed learning rate for all steps\)\.
#### Joint optimization schedule \(simultaneous update; 1:1 batch ratio\)\.
We optimize the utility objective and the adversarial objective jointly \(not alternating separate steps\)\. At each training step, we sample \(i\) one SFT batch from the main training set and \(ii\) one prompt\-only balanced leakage batch frombal\_train, using the same per\-device batch size for both loaders \(a strict 1:1 batch ratio\)\. We computeℒutil\\mathcal\{L\}\_\{\\text\{util\}\}on the SFT batch andℒadv\\mathcal\{L\}\_\{\\text\{adv\}\}on the balanced batch within the same step, and aggregate them as:
uλ\\displaystyle u\_\{\\lambda\}:=GRLλ\(fθ\(xprompt\)\),\\displaystyle=\\mathrm\{GRL\}\_\{\\lambda\}\\\!\\left\(f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)\\right\),\(8\)ℒtotal\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{total\}\}=ℒutil\+CE\(gϕ\(uλ\),r\)\.\\displaystyle=\\mathcal\{L\}\_\{\\mathrm\{util\}\}\+\\operatorname\{CE\}\\\!\\left\(g\_\{\\phi\}\(u\_\{\\lambda\}\),\\,r\\right\)\.followed by a single backward pass and a single optimizer step that updates all trainable parameters synchronously\.
#### Equivalent optimization view\.
With GRL, the synchronous update is equivalent to
minθ\\displaystyle\\min\_\{\\theta\}\\quadℒutil\(θ\)−λ𝔼\[ℒadv\],\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{util\}\}\(\\theta\)\-\\lambda\\,\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\mathrm\{adv\}\}\],minϕ\\displaystyle\\min\_\{\\phi\}\\quad𝔼\[ℒadv\]\.\\displaystyle\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\mathrm\{adv\}\}\]\.Here,ℒadv=ℒadv\(θ,ϕ\)\\mathcal\{L\}\_\{\\mathrm\{adv\}\}=\\mathcal\{L\}\_\{\\mathrm\{adv\}\}\(\\theta,\\phi\)\. The GRL reverses and scales gradients flowing intofθ\(xprompt\)f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)and thus intoθ\\theta, while gradients with respect toϕ\\phiare unchanged\.
#### Optimizer sharing andλ\\lambdaschedule\.
We use a single shared AdamW optimizer over the union of trainable LoRA parameters and adversary\-head parameters \(no separate optimizers or time\-sharing updates\)\. Unless otherwise stated,λ\\lambdais held constant throughout training \(no warm\-up or annealing\), and we sweepλ\\lambdaacross runs to control adversarial strength\.
#### Precision and memory\.
On GPU, we run in bfloat16 precision by loading the base model withtorch\_dtype=bfloat16\. We enable gradient checkpointing to reduce memory usage\. We allow TF32 for CUDA matmul/cuDNN to improve throughput where supported\. We attempt to enable FlashAttention\-2 viaattn\_implementation=flash\_attention\_2when available\(Dao,[2024](https://arxiv.org/html/2605.26433#bib.bib31)\); otherwise we fall back to the default attention implementation\. We do not use DeepSpeed/ZeRO; each sweep run trains in a single GPU process\.
#### Sequence lengths and generation\.
We set the maximum SFT sequence length \(xpromptx\_\{\\mathrm\{prompt\}\}\+ target BHC tokens\) to 1,536 tokens and the maximum prompt\-only length forxpromptx\_\{\\mathrm\{prompt\}\}to 1,024 tokens\. At evaluation time, we use greedy decoding \(no sampling\) and cap the output to 256 new tokens\.
#### Inference settings \(fixed across all methods\)\.
For all reported ROUGE results, we use greedy decoding withdo\_sample=False,max\_new\_tokens=256, and the default repetition penalty\. We keep the decoding configuration identical across the baseline and all SurfaceLoRA runs to ensure comparability\.
#### Evaluation stability \(out\-of\-memory \(OOM\) avoidance\)\.
To reduce evaluation\-time OOM risk, we generate with small micro\-batches \(e\.g\., 8 prompts per batch\) and left padding\. To mitigate CUDA memory fragmentation, we setPYTORCH\_CUDA\_ALLOC\_CONF=expandable\_segments:True\.
## Appendix GRepresentation Leakage Evaluation Details
### G\.1Representation Leakage Probe Protocol
We quantify EHR\-recorded race extractability from prompt representations using probe classifiers\. For each example, we render the source clinical context intoxpromptx\_\{\\mathrm\{prompt\}\}and extractfθ\(xprompt\)f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\), the exported prompt representation under the specified representation choice\. Under the default setting, this is thelasttok\_L\-1artifact\. We then fit a 5\-way multinomial logistic regression probe onbal\_train\(20,000; balanced\)\. We evaluate the trained probe onbal\_val\(2,500; balanced\) for model selection and ontest\_balanced\(2,500; balanced\) for final reporting\. LetProbeAcc∈\[0,1\]\\text\{ProbeAcc\}\\in\[0,1\]denote probe accuracy on balanced 5\-way subsets; chance accuracy is therefore0\.20\.2\. We reportLeakageGapusing the deviation\-from\-chance definition in Eq\.[6](https://arxiv.org/html/2605.26433#A5.E6)\.
#### Linear probe implementation\.
Unless otherwise stated, the linear probe is a 5\-way multinomial logistic regression trained onbal\_trainwith fixed hyperparameters:solver=saga,penalty=l2,multi\_class=multinomial,max\_iter=800,tol=1e\-3, andn\_jobs=16, using a fixed random seed for reproducibility\. We then evaluate the probe onbal\_valortest\_balancedand report probe accuracy along withLeakageGap=\|ProbeAcc−0\.2\|\\texttt\{LeakageGap\}=\|\\texttt\{ProbeAcc\}\-0\.2\|\.
#### Probed representations\.
Unless otherwise stated, the main tables probe the default exported artifact,lasttok\_L\-1\. For analysis, we additionally probe alternative operational representations \(mean pooling over prompt tokens and last\-4\-layer averaging; see Section[4\.4](https://arxiv.org/html/2605.26433#S4.SS4)and Appendix[E](https://arxiv.org/html/2605.26433#A5)\) to assess whether leakage reduction is global or confined to the targeted artifact\.
#### Stronger probe: MLP classifier\.
To address the possibility that a linear probe underestimates extractable demographic information, we additionally evaluate a 2\-layer MLP probe \(hidden dimension 128, ReLU, dropout 0\.3\)\. We train the MLP onbal\_trainusing AdamW \(learning rate10−310^\{\-3\}, batch size 128\) and select the best checkpoint onbal\_valvia early stopping \(patience 5 epochs; maximum 50 epochs\)\. This probe is strictly more expressive than logistic regression and can capture nonlinear structure in the representation space\.
## Appendix HCheckpoint Selection Policy
#### Model selection is artifact\-specific\.
We select deployable checkpoints based on validation performance*on the exported artifact*zz\. For a given exported representationzz\(e\.g\.,lasttok\_L\-1ormeanpool\_L\-1\), checkpoint selection uses the logistic\-regression probe as the pre\-specified validation attacker\. We first apply the LR\-based validation leakage budget and then maximize validation ROUGE\-L among feasible checkpoints\. If no checkpoint satisfies the budget, we select a Pareto\-optimal checkpoint with minimumLeakageGapLR\(c;z\)\\textit\{LeakageGap\}\_\{\\mathrm\{LR\}\}\(c;z\)\. We additionally report Pareto\-optimal checkpoints under minimizingLeakageGapLR\(c;z\)\\textit\{LeakageGap\}\_\{\\mathrm\{LR\}\}\(c;z\)and maximizing ROUGE\-L\(c\)\(c\)\. Operationally, this treats thebal\_valaudit as a deployment decision rule: the deployable model is the selected checkpoint under the*same exported representation*that the system uses\.
Checkpoint selection as a deployment primitive \(*validation\-only; artifact\-specific*\)\.Input: checkpoints saved during training\.Choose exported representationzz\(e\.g\.,lasttok\_L\-1ormeanpool\_L\-1, computed fromxpromptx\_\{\\mathrm\{prompt\}\}unless otherwise stated\)\.Audit: for each checkpointcc, computeLeakageGapLR\(c;z\)\\textit\{LeakageGap\}\_\{\\mathrm\{LR\}\}\(c;z\)using an LR probe fit onbal\_trainwith representationzzand evaluated onbal\_valwith the same representation; also compute validationROUGE\-L\(c\)\(c\)onbal\_val\.Constraint: retain checkpoints withLeakageGapLR\(c;z\)≤ϵ\\textit\{LeakageGap\}\_\{\\mathrm\{LR\}\}\(c;z\)\\leq\\epsilon\.Selection: among feasible checkpoints, choose the checkpoint with the highest validationROUGE\-L\.Fallback: if no checkpoint is feasible underzz, choose a Pareto\-optimal checkpoint with minimum validationLeakageGapLR\(c;z\)\\textit\{LeakageGap\}\_\{\\mathrm\{LR\}\}\(c;z\)\.Test use: evaluate the selected checkpoint once ontest\_balanced\.
We setϵ=0\.025\\epsilon=0\.025as an engineering leakage budget for the LR validation attacker \(a small deviation allowance under the balanced 5\-way audit\), and make statistical near\-chance claims only on the final test evaluation with patient\-level cluster bootstrap confidence intervals\. While early stopping is one implementation of this rule, our use of checkpoint selection is not test\-set cherry\-picking: the rule is specified before test evaluation and uses only the held\-out LR\-based validation audit onbal\_val\. We intentionally log the full trajectory to expose non\-monotonic rebound and to make the selection decision auditable\.
## Appendix IOutput\-Based Privacy Evaluation Details
### I\.1Output\-Based Privacy: Text Attacker and Mention Rate
Representation probing characterizes recorded\-category signal in internal features, but it does not directly measure leakage from generated text\. We therefore add complementary output\-based evaluations on the same balancedtest\_balancedsplit\.
#### Text attacker and mention rate\.
Given each generated summaryy^\\hat\{y\}, we train a 5\-way race text attacker by fine\-tuning Bio\_ClinicalBERT\(Alsentzeret al\.,[2019](https://arxiv.org/html/2605.26433#bib.bib21)\)as a sequence classifier on generations frombal\_trainand selecting the best checkpoint onbal\_valusing Macro\-F1\. We then evaluate attacker Accuracy and Macro\-AUROC ontest\_balanced\. This attacker models an attribute\-inference scenario in which generated summaries are exposed to downstream analytics or monitoring, complementing prior work on demographic and protected\-attribute leakage from text representations\(Elazar and Goldberg,[2018](https://arxiv.org/html/2605.26433#bib.bib32); Zhanget al\.,[2020a](https://arxiv.org/html/2605.26433#bib.bib37)\)\. To reduce the possibility that the attacker succeeds only by exploiting explicit demographic mentions, we additionally compute a mention rate ony^\\hat\{y\}using \(i\) strict patterns and \(ii\) loose patterns that also match contextual mentions\.
#### Uncertainty\.
For output\-based metrics, we compute cluster bootstrap confidence intervals by resampling patients \(subject\_id\) with replacement\. In the main tables, we report point estimates for readability; per\-metric bootstrap summaries are saved alongside the results\.
### I\.2Full Results: Bio\_ClinicalBERT Attacker and Clinical Proxies
Table 6:Output\-based privacy and clinical utility proxies ontest\_balanced\.SLoRAdenotesSurfaceLoRA\.Text attacker:a Bio\_ClinicalBERT classifier fine\-tuned on generated summaries frombal\_trainand selected onbal\_val; chance for balanced 5\-way classification is 0\.20\.Clinical proxies:BERTScore\-F1 uses Bio\_Discharge\_Summary\_BERT \(no\-IDF\); Concept\-F1 uses i2b2\-2010 NER \(Problem/Test/Treatment\) via normalized concept\-set overlap between reference and generation\.Boldindicates the validation\-selected checkpoint\. Mention\-rate analyses are reported separately in Appendix[K\.3](https://arxiv.org/html/2605.26433#A11.SS3)to avoid conflating race\-group terms with meta\-terms \(e\.g\., “race”, “ethnicity”\) induced by prompts\.
## Appendix JBaselines and Statistical Methods
### J\.1Prompt Engineering Baselines \(No Fine\-Tuning\)
To test whether prompting alone can mitigate recorded\-attribute recoverability, we evaluate two prompt\-only baselines using the same base instruction\-tuned model \(no LoRA, no GRL training\): BASE, a standard clinical summarization system instruction, and NEUTRAL, which adds an explicit directive to remain neutral and avoid including race\-related cues\. We use the same probing protocol asSurfaceLoRA: render each source clinical context intoxpromptx\_\{\\mathrm\{prompt\}\}, extract prompt representationsfθ\(xprompt\)f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)from a prompt\-only forward pass, fit the LR and MLP post\-hoc probes onbal\_train, and evaluate ProbeAcc and LeakageGap ontest\_balanced\. For utility, we report ROUGE ontest\_balanced\.
### J\.2Post\-hoc Representation Sanitization Baselines \(No Fine\-Tuning\)
In addition to prompt engineering, we evaluate post\-hoc representation sanitization methods that operate on cached prompt vectors without modifying the generator\. We focus on the exported representationlasttok\_L\-1and apply each transform to extracted prompt vectors frombal\_train,bal\_val, andtest\_balanced\.
#### Baselines\.
We include four representative post\-hoc methods and a no\-removal reference: \(i\) PCA removal \(pca\_removal\_topk\); \(ii\) random subspace removal \(random\_removal\_topk\); \(iii\) one\-shot linear removal \(linear\_removal\_oneshot\); \(iv\) INLP\(Ravfogelet al\.,[2020](https://arxiv.org/html/2605.26433#bib.bib17)\); and \(v\) no removal \(no\_removal\)\. All post\-hoc transforms are learned onbal\_trainand applied tobal\_valandtest\_balancedwithout accessing test labels\.
#### Evaluation protocol\.
For each transformed representation, we train the same linear probe and nonlinear probe onbal\_trainand evaluate accuracy and LeakageGap onbal\_valandtest\_balanced\. Because these baselines do not affect generation, they isolate how much leakage can be reduced purely by post\-processing the exported representation\.
Table 7:Post\-hoc and training\-time leakage mitigation baselines ontest\_balancedforlasttok\_L\-1under balanced five\-way race prediction \(chance=0\.20=0\.20\)\. Gap \(LeakageGap\) is\|Acc−0\.2\|\|\\text\{Acc\}\-0\.2\|\(lower is better\)\. Post\-hoc transforms are learned onbal\_trainand applied totest\_balancedwithout accessing test labels\.Abbrev\.LR: logistic\-regression probe; MLP: multi\-layer perceptron probe; rm: removal; ft: fine\-tuned; s1600: step 1600; Rand: random; Lin: linear; PCA: principal component analysis\.
#### Matched\-dimension protocol and fixed hyperparameters\.
To ensure comparability across post\-hoc baselines, we match removal strength by aligning the removed dimensionality\. We first run INLP onbal\_trainrepresentations and define the total removed dimensionality ask:=∑t=1Trtk:=\\sum\_\{t=1\}^\{T\}r\_\{t\}, wherertr\_\{t\}is the rank of the row space of thett\-th classifier weight matrix \(estimated by SVD\)\. We then set PCA and random removal to remove exactlykkdimensions \(top\-kkprincipal components for PCA; a randomkk\-dimensional orthonormal subspace for random removal\), and apply the resulting linear projection tobal\_valandtest\_balanced\.
INLP is run for at mostT=20T=20iterations\. At each iteration, we fit a 5\-way multinomial logistic regression classifier onbal\_trainwith fixed hyperparameters:solver=saga,penalty=l2,multi\_class=multinomial,max\_iter=800,tol=1e\-3, andn\_jobs=16\(with a fixed random seed\)\. Given the classifier weight matrixWtW\_\{t\}, we compute its SVD and definertr\_\{t\}as the number of singular values greater than10−610^\{\-6\}; we then remove the corresponding row\-space directions by projecting to the orthogonal complement and repeat on the projected representations\. We usebal\_valto monitor leakage viaLeakageGap=\|Acc−0\.2\|\\mathrm\{LeakageGap\}=\|\\mathrm\{Acc\}\-0\.2\|\(5\-way chance is0\.20\.2\), and early\-stop INLP ifLeakageGap≤0\.003\\mathrm\{LeakageGap\}\\leq 0\.003; otherwise we run the full 20 iterations\. As a boundary\-case fallback, if INLP removes zero dimensions \(k=0k=0\), we setkkto the rank of the one\-shot linear removal classifier\.
In our main setting, INLP does not trigger early stopping and runs the full 20 iterations, removingk=100k=100dimensions; therefore PCA and random removal usek=100k=100\. The one\-shot linear removal removes 5 dimensions in this setting\.
### J\.3Statistical Uncertainty and Significance
To quantify uncertainty on the fixedtest\_balancedsplit \(2,500 examples; 500 per race\), we attach confidence intervals \(CIs\) to both utility \(ROUGE\) and leakage \(probe accuracy\)\.
#### ROUGE confidence intervals\.
For ROUGE, we use a paired bootstrap protocol\(Koehn,[2004](https://arxiv.org/html/2605.26433#bib.bib23); Efron and Tibshirani,[1993](https://arxiv.org/html/2605.26433#bib.bib24)\)\. After caching one deterministic generation per example \(greedy decoding; no sampling\), we compute per\-example ROUGE\-1/2/L F1 scores by comparing each hypothesis to its reference\. We drawB=10,000B\{=\}10\{,\}000bootstrap resamples of sizeN=2,500N\{=\}2\{,\}500with replacement from the test indices\. For each resample, we recompute corpus\-level ROUGE as the mean of per\-example ROUGE scores in the resampled set\. We report 95% CIs using the percentile method\. For pairwise comparisons, we use paired bootstrap and treat a difference as significant atα=0\.05\\alpha\{=\}0\.05if the 95% CI of the difference excludes zero\.
#### Probe accuracy confidence intervals\.
For probing\-based leakage metrics, we report probe accuracy \(ProbeAcc\) on the fixedtest\_balancedsplit\. Because multiple examples may originate from the same patient \(subject\_id\), example\-level binomial/Wilson intervals can be anti\-conservative\. We therefore compute 95% CIs using a patient\-level stratified cluster bootstrap onsubject\_id, preserving the balanced 5\-way composition\.
Concretely, letℐc\\mathcal\{I\}\_\{c\}denote the set of indices intest\_balancedwith race labelc∈\{1,…,5\}c\\in\\\{1,\\dots,5\\\}, and let𝒮c\\mathcal\{S\}\_\{c\}be the set of uniquesubject\_ids appearing inℐc\\mathcal\{I\}\_\{c\}\. For each bootstrap replicate \(B=10,000B\{=\}10\{,\}000\), we construct a balanced resample by: \(i\) sampling 500subject\_ids with replacement from𝒮c\\mathcal\{S\}\_\{c\}for each classcc; \(ii\) for each sampled patient, uniformly sampling one index from that patient’s indices inℐc\\mathcal\{I\}\_\{c\}\. Concatenating across the five classes yields a bootstrap sample of size 2,500 \(500 per class\), on which we compute ProbeAcc\. We report 95% CIs using the percentile method \(2\.5/97\.5 percentiles over theBBbootstrap ProbeAcc values\)\.
For pairwise comparisons of ProbeAcc between two methods, we use a paired patient\-level bootstrap: within each replicate we reuse the same sampled patients and within\-patient sampled indices for both methods\. We treat a difference as significant atα=0\.05\\alpha\{=\}0\.05if the 95% CI of the accuracy difference excludes zero\. We report ProbeAcc as consistent with random guessing when the chance level \(0\.200\.20\) falls within the patient\-level bootstrap 95% CI\.
## Appendix KAdditional Output\-Based Leakage Analyses
This appendix reports additional evaluations that quantify generated\-text leakage for the case\-study sensitive label, EHR\-recorded race, as opposed to internal representation leakage\. All results are computed on the balancedtest\_balancedsplit \(n=2500n\{=\}2500; 500 per race group\), with patient\-level \(cluster\) bootstrap confidence intervals obtained by resamplingsubject\_idwith replacement\.
### K\.1Generated\-Text Leakage Attacker: Term Frequency–Inverse Document Frequency \(TF\-IDF\) \+ Logistic Regression
#### Attacker setup\.
We train a bag\-of\-words attacker on generated summaries using TF\-IDF features with 1–2 grams and a multinomial logistic regression classifier\. The attacker is trained onbal\_trainand tuned onbal\_valby selecting the regularization strengthCCthat maximizes validation Macro\-F1 \(tie\-broken by Accuracy\)\. We then evaluate ontest\_balancedand report Accuracy, Macro\-F1, and Macro\-AUROC \(one\-vs\-rest\)\.
#### Why include a simple attacker?
Compared with neural attackers that may exploit deeper semantic correlates, TF\-IDF\+LR provides a transparent baseline that primarily captures lexical and short\-context cues\. Strong performance from this model indicates that race\-associated information is recoverable from relatively surface\-level lexical signals\.
#### Results\.
Table[8](https://arxiv.org/html/2605.26433#A11.T8)summarizes leakage from generated text\. The TF\-IDF attacker yields slightly different absolute accuracies from the Bio\_ClinicalBERT attacker in Table[6](https://arxiv.org/html/2605.26433#A9.T6), but the qualitative conclusion is the same: generated summaries remain above chance, and representation\-level mitigation does not eliminate output\-level demographic predictability\. The attacker remains consistently above chance \(0\.20\) across methods, with prompt\-only baselines exhibiting the largest leakage\. SurfaceLoRA reduces leakage relative to prompt\-only baselines, although it does not reduce the attacker to chance\.
Table 8:Generated\-text leakage ontest\_balanced\(n=2500n\{=\}2500; balanced\)\. Attacker is TF\-IDF \+ Logistic Regression trained onbal\_trainand tuned onbal\_val\. Confidence intervals are patient\-level cluster bootstrap \(95%\)\. SLoRA denotes SurfaceLoRA;sXXXXsXXXXindicates the training step\. Men\(S/L\) are legacy mention\-rate metrics \(Strict/Loose; prior to the meta/group decomposition in Table[9](https://arxiv.org/html/2605.26433#A11.T9)\)\.Boldindicates the validation\-selected checkpoint\. Chance is 0\.20\.
### K\.2Reference Attacker as a Diagnostic Baseline
To contextualize output leakage, we also train the same TF\-IDF\+LR attacker on the gold reference targets \(BHC\) as a diagnostic baseline for demographic inference from the narrative itself\. Ontest\_balanced, the reference attacker achieves Acc=0\.3284=0\.3284, Macro\-F1=0\.3255=0\.3255, and Macro\-AUROC=0\.6566=0\.6566\. These results indicate that correlates of the case\-study sensitive label are present even in the ground\-truth clinical narrative\. Accordingly, output\-based inference of EHR\-recorded race may persist even when models avoid explicit mentions, and stronger attackers could potentially achieve higher performance than this baseline\.
### K\.3Decomposed Mention Rate: Meta Terms vs\. Group Terms
#### Motivation and definitions\.
A naive mention\-rate metric can be confounded by meta terms \(e\.g\., “race”, “ethnicity”, “demographics”\) that appear due to prompt wording rather than demographic disclosure\. We therefore decompose mention signals into: \(i\) meta\-term rate \(generic references to race/ethnicity\) and \(ii\) group\-term rate \(explicit group identifiers such as “Hispanic” or “Asian”\)\. For ambiguous terms such asblack/white, we apply a strict local\-context filter to reduce false positives \(e\.g\., “black stool”, “white blood cell count”\) and count these tokens only when identity cues \(e\.g\., “patient”, “race”, “ethnicity”\) are present and medical\-color cues are absent\.
#### Results ontest\_balanced\.
Table[9](https://arxiv.org/html/2605.26433#A11.T9)reports decomposed mention rates\. The neutral prompt baseline exhibits an extremely high meta\-term rate \(46\.44%\) driven by prompt\-induced wording, while its group\-term rate remains low \(0\.48%\)\. In contrast, other methods have very low meta\-term rates \(≤0\.12%\\leq 0\.12\\%\) and low group\-term rates \(≤0\.96%\\leq 0\.96\\%\), suggesting that text\-attacker leakage is not primarily driven by overt demographic mentions\.
Table 9:Decomposed mention rate ontest\_balanced\(n=2500n\{=\}2500\)\. Meta counts generic references to race/ethnicity \(e\.g\.,race,racial,ethnicity,demographics\)\. Group counts explicit group identifiers \(e\.g\.,Hispanic,Asian\) with strict context filtering forblack/white\. Meta\-only and Group\-only exclude overlaps\.Boldindicates the validation\-selected checkpoint\.MethodGroup\(%\)Meta\(%\)Meta\-only\(%\)Group\-only\(%\)Both\(%\)prompt\_base0\.960\.080\.080\.960\.00prompt\_neutral0\.4846\.4446\.160\.200\.28SurfaceLoRA\_lam0\.00\_step20000\.480\.040\.040\.480\.00SurfaceLoRA\_lam0\.02\_step12000\.560\.040\.040\.560\.00SurfaceLoRA\_lam0\.02\_step20000\.480\.040\.040\.480\.00SurfaceLoRA\_lam0\.05\_step8000\.480\.040\.040\.480\.00SurfaceLoRA\_lam0\.05\_step20000\.440\.080\.080\.440\.00SurfaceLoRA\_lam0\.10\_step20000\.440\.040\.040\.440\.00SurfaceLoRA\_lam0\.20\_step6000\.480\.040\.040\.480\.00SurfaceLoRA\_lam0\.20\_step20000\.520\.120\.080\.480\.04SurfaceLoRA\_lam0\.50\_step20000\.440\.040\.040\.440\.00
### K\.4Takeaways
Across methods, the TF\-IDF\+LR attacker remains above chance on generated text, indicating that demographic inference can be feasible from relatively surface lexical cues even when explicit demographic mentions are rare\. The decomposed mention\-rate analysis supports this interpretation: for most methods, overt race\-group identifiers occur infrequently, so residual output leakage likely arises from implicit correlates in clinical narratives \(e\.g\., comorbidities, social history, medications, or care patterns\) rather than direct demographic statements\. Finally, the reference attacker trained on gold BHC targets suggests that race\-associated correlates are present in the underlying narrative distribution, providing context for why output\-based inference may persist even when representation\-level leakage is reduced\.
## Appendix LAdditional Validation and Group\-Wise Results
### L\.1Validation\-Time Pareto Analysis and Best Checkpoints
During training, we evaluate every 200 steps and record validation ROUGE and LR probe accuracy, where the LR probe is fit onbal\_trainand evaluated onbal\_val\. For deployment selection, we apply the checkpoint selection rule in Appendix[H](https://arxiv.org/html/2605.26433#A8): choose the highest\-ROUGE\-L checkpoint among those satisfying the LR\-based validation leakage budget\. For diagnostic reporting, we also compute the Pareto front under minimizing LRLeakageGapand maximizing ROUGE\-L\. The validation\-selected checkpoint and Pareto\-optimal validation points are summarized in Table[10](https://arxiv.org/html/2605.26433#A12.T10)\.
Table 10:Validation checkpoints and Pareto front for thelasttok\_L\-1representation\. R\-1/R\-2/R\-L are ROUGE\-1/2/L, LR Acc is logistic\-regression probe accuracy, and LR Gap is\|LR Acc−0\.2\|\|\\text\{LR Acc\}\-0\.2\|\. The validation LR probe is fit onbal\_trainand evaluated onbal\_val\.Boldindicates the validation\-selected checkpoint chosen by maximizing validation ROUGE\-L subject to the LR\-based validation leakage budget\.
### L\.2Meanpool\-Targeted Validation Pareto Analysis
We repeat the same checkpoint selection procedure using themeanpool\_L\-1representation, where mean pooling is computed over all non\-padding tokens inxpromptx\_\{\\mathrm\{prompt\}\}\. Across evaluated checkpoints, no model satisfies the LR\-based validation leakage budget for this representation; indeed, all evaluated meanpool checkpoints remain well above chance \(Gap\>0\.10\>0\.10\)\. We therefore report Pareto\-optimal checkpoints and select the minimum\-gap point as the fallback checkpoint\. Table[11](https://arxiv.org/html/2605.26433#A12.T11)summarizes the best diagnostic checkpoint for eachλ\\lambdaafter the LR\-based leakage\-budget rule fails formeanpool\_L\-1; the bold row marks the fallback checkpoint with the smallest validation LeakageGap\. Table[12](https://arxiv.org/html/2605.26433#A12.T12)reports the validation\-time Pareto front under minimizing Gap and maximizing ROUGE\-L\.
Table 11:Diagnostic validation checkpoints perλ\\lambdaunder themeanpool\_L\-1representation, computed fromxpromptx\_\{\\mathrm\{prompt\}\}\. Because no evaluated meanpool checkpoint satisfies the leakage budget, rows are shown for diagnostic comparison and theboldrow indicates the fallback checkpoint with minimum validation LeakageGap\.Table 12:Validation\-time Pareto front under themeanpool\_L\-1representation, computed fromxpromptx\_\{\\mathrm\{prompt\}\}, optimizing lower validation LeakageGap and higher validation ROUGE\-L\. Because no checkpoint satisfies the leakage budget, theboldrow indicates the fallback checkpoint with minimum validation LeakageGap\.#### Takeaway\.
Compared withlasttok\-targeted training, meanpool\-targeted runs do not reach chance\-level recovery under the validation audit: the best observed Gap is0\.1080\.108\(ProbeAcc=0\.308=0\.308\)\. This reinforces the deployment rule that mitigation should be evaluated on the exact exported vector, and that mean\-pooled prompt embeddings constitute a distinct and harder\-to\-sanitize artifact if used in deployment\.
### L\.3Group\-Wise Utility Stability Analysis
To assess whether leakage reduction comes at the cost of subgroup\-specific utility degradation, we evaluate group\-wise ROUGE\-L on the balancedtest\_balancedsplit \(500 examples per race group\)\. We report per\-group ROUGE\-L means, the absolute gap \(max−\-min\), the standard deviation across groups, worst\-group performance, and the relative gap \(gap divided by the overall mean\)\.
Table 13:Group\-wise ROUGE\-L on the balancedtest\_balancedsplit \(500 examples per race group;n=2500n\{=\}2500\)\. Gap = max−\-min across groups; Std = population standard deviation across groups; Worst = minimum group performance; RelGap = Gap / Overall\. Lower Gap, Std, and RelGap indicate more consistent utility across groups; higher Worst indicates better worst\-group utility\.Boldindicates the validation\-selected checkpoint\.#### Group\-wise utility remains stable\.
The selected checkpoint \(λ=0\.02\\lambda\{=\}0\.02, step 1200\) achieves overall ROUGE\-L 14\.51 with a gap of 0\.40 points \(relative gap 2\.8%\)\. Performance across the five race groups is tightly clustered \(White 14\.46, Black 14\.31, Hispanic 14\.37, Asian 14\.68, Other 14\.71\)\. This gap is comparable to the vanilla LoRA full\-run checkpoint \(λ=0\.00\\lambda\{=\}0\.00, step 2,000: gap 0\.42, relative gap 2\.8%\), suggesting that exported\-vector\-targeted adversarial training does not introduce large group\-conditional differences in summarization quality\.
#### Degradation under largeλ\\lambdais primarily global\.
Asλ\\lambdaincreases in full\-run training, the main change is a decrease in overall ROUGE\-L \(e\.g\., 15\.02 atλ=0\.00\\lambda\{=\}0\.00vs\. 7\.28 atλ=0\.50\\lambda\{=\}0\.50\), while absolute group gaps remain modest \(0\.39–0\.65\)\. The larger relative gaps at low utility \(e\.g\.,λ=0\.50\\lambda\{=\}0\.50\) are driven largely by the smaller overall denominator\. Overall, these results suggest that the fragility observed in Sections[4\.5](https://arxiv.org/html/2605.26433#S4.SS5)–[4\.1](https://arxiv.org/html/2605.26433#S4.SS1)manifests primarily as global utility loss rather than amplified cross\-group variation\.
#### Prompting alone does not achieve the same trade\-off\.
While NEUTRAL exhibits slightly smaller absolute group gaps than some fine\-tuned models, its overall ROUGE\-L is substantially lower than the best SurfaceLoRA checkpoint, and its representations remain highly probeable for race \(Table[1](https://arxiv.org/html/2605.26433#S4.T1)\)\. Prompt\-only methods therefore do not simultaneously achieve strong utility and low representation\-level leakage\.
#### Summary\.
Across all evaluated methods, group\-wise ROUGE\-L gaps remain small in absolute terms \(≤0\.65\\leq 0\.65points\) and comparable to non\-adversarial baselines\. SurfaceLoRA reduces representational EHR\-recorded race recoverability \(Table[1](https://arxiv.org/html/2605.26433#S4.T1)\) without increasing cross\-group utility variation, supporting its use as a practical leakage\-mitigation approach in clinical summarization\.
## Appendix MAdditional Attribute Stress Test: EHR\-Recorded Gender
To assess whether the representation\-specific pattern is limited to EHR\-recorded race, we run an additional stress test using binary EHR\-recorded gender\. As with EHR\-recorded race, this variable should be interpreted as a recorded administrative field rather than a complete account of sex or gender identity\(Heidariet al\.,[2016](https://arxiv.org/html/2605.26433#bib.bib73)\)\. This experiment uses the same exported representations and the same post\-hoc logistic\-regression probing protocol used in the main audit, but replaces the training\-time adversarial discriminator with a stronger two\-layer MLP discriminator\. Because the audit uses a balanced binary gender subset, the chance baseline is0\.500\.50; these results should not be interpreted as chance\-level gender sanitization\. Instead, they test whether mitigation remains artifact\-specific under another recorded attribute\.
Table[14](https://arxiv.org/html/2605.26433#A13.T14)shows that gender recoverability from the targetedlasttokartifact decreases under adversarial training, while generation utility remains comparable\. For example, the best final\-step MLP\-discriminator run reduceslasttokprobe accuracy from0\.7690\.769to0\.7000\.700with similar ROUGE\-L \(15\.51615\.516vs\.15\.41715\.417\)\. However,lasttokrecoverability remains well above the0\.500\.50chance baseline\. At the same time,meanpoolremains near\-saturated \(0\.9950\.995–0\.9970\.997\), producing a large gap across exported artifacts\. Thus, the gender stress test supports the main finding that leakage and mitigation are representation\-specific, while also showing that mitigation difficulty is attribute\-dependent\.
Table 14:Additional EHR\-recorded gender stress test using the same exported\-representation audit\. The training\-time adversarial discriminator is a two\-layer MLP, while evaluation uses the same post\-hoc logistic\-regression probing protocol as the main experiments\. Chance accuracy for the balanced binary gender audit is0\.500\.50\. Gap is\|Lasttok Acc−0\.50\|\|\\text\{Lasttok Acc\}\-0\.50\|, andΔ\\Deltaismeanpoolaccuracy minuslasttokaccuracy\. The validation\-selected row usesλ=0\.015\\lambda\{=\}0\.015at step 1600; the best final\-step row usesλ=0\.030\\lambda\{=\}0\.030at the final training step\. These results are not evidence of chance\-level gender sanitization; rather, they show that even when recoverability decreases on the targetedlasttokartifact, an untargeted pooled representation can remain near\-saturated\.
## Appendix NAdditional Discussion: Representation Mismatch and Audit Rule
### N\.1System Warning: Representation Mismatch and the Need to Audit the Exported Artifact
Our representation\-sensitivity results \(Section[4\.4](https://arxiv.org/html/2605.26433#S4.SS4)\) highlight a deployment pitfall: mitigation on one exported vector may not transfer to another\. A model can appear mitigated at one representation choice \(e\.g\., the last\-token prompt vector\) while remaining strongly vulnerable to attribute inference at another \(e\.g\., mean\-pooled embeddings\)\. This indicates that leakage reduction is not generally transferable across aggregation strategies\. Consequently, model\-level claims about mitigation can create a false sense of security if the deployment pipeline exports a representation that differs from the artifact targeted during training\.
#### Audit rule: audit the artifact\.
Deployers should not only ask whether a model is “leakage\-mitigated,” but whether the specific exported representation \(the artifact that is stored, logged, indexed, cached, or reused\) is mitigated under the attacker classes of interest\. If a deployment exports mean\-pooled embeddings, common in retrieval pipelines, the adversarial objective should be attached to that representation; sanitizing the last\-token representation alone is insufficient\.
## Appendix OImplementation Details
#### Prompt construction \(exact template\)\.
We construct prompts using the base model’s tokenizer\-defined chat template rather than a hand\-written string format\. For each example, letccdenote the source clinical context\. In the primary dataset,ccis the discharge\-note input with the BHC section removed\. We create \(i\) a system message \(“You are a clinical assistant\. Summarize the note into a Brief Hospital Course\.”\) and \(ii\) a user message containingcc\. We then render these messages withtokenizer\.apply\_chat\_template\(\.\.\., add\_generation\_prompt=True\), which appends the assistant generation header and injects model\-specific special tokens and role delimiters\. The resulting rendered prompt is denotedxpromptx\_\{\\mathrm\{prompt\}\}\. Because hidden states depend on these special tokens, using the tokenizer\-provided template is important for reproducibility\.
#### Padding and truncation\.
We use left padding for batching\. Ifpad\_token\_idis undefined, we setpad\_token=eos\_token\. For prompt\-only forward passes used in probing and adversarial training, we truncatexpromptx\_\{\\mathrm\{prompt\}\}to at most 1,024 tokens by keeping the most recent tokens\. For SFT, we concatenatexpromptx\_\{\\mathrm\{prompt\}\}with the target BHC tokens and truncate the full prompt–target sequence to at most 1,536 tokens using the same rule\.
#### Representation for adversary and probing\.
Letfθ\(xprompt\)f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)denote the exported representation of the rendered prompt\. Unless otherwise stated, we run a*prompt\-only*forward pass onxpromptx\_\{\\mathrm\{prompt\}\}withoutput\_hidden\_states=True\. With left padding, leti⋆=max\{i:attention\_maski=1\}i^\{\\star\}=\\max\\\{i:\\texttt\{attention\\\_mask\}\_\{i\}=1\\\}be the index of the last non\-padding token inxpromptx\_\{\\mathrm\{prompt\}\}, and let𝐡i\(ℓ\)\(xprompt\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\)denote the hidden state at positioniifrom blockℓ\\ell\.
Default representation \(lasttok\_L\-1\):
fθ\(xprompt\)=𝐡i⋆\(L−1\)\(xprompt\)\.f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)=\\mathbf\{h\}^\{\(L\-1\)\}\_\{i^\{\\star\}\}\(x\_\{\\mathrm\{prompt\}\}\)\.Meanpool representation \(meanpool\_L\-1, when used\): letting𝒫=\{i:attention\_maski=1\}\\mathcal\{P\}=\\\{i:\\texttt\{attention\\\_mask\}\_\{i\}=1\\\}be the set of non\-padding token positions inxpromptx\_\{\\mathrm\{prompt\}\},
fθ\(xprompt\)=1\|𝒫\|∑i∈𝒫𝐡i\(L−1\)\(xprompt\)\.f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{i\\in\\mathcal\{P\}\}\\mathbf\{h\}^\{\(L\-1\)\}\_\{i\}\(x\_\{\\mathrm\{prompt\}\}\)\.For meanpool\-targeted experiments, the GRL adversary and the probing classifiers are attached to this samemeanpool\_L\-1representation; otherwise they uselasttok\_L\-1\.
#### Generation and ROUGE evaluation\.
We generate summaries with greedy decoding \(do\_sample=False\) andmax\_new\_tokens=256\. To reduce OOM risk during validation/testing, we generate in micro\-batches \(e\.g\., 8 prompts per batch\) and clear the CUDA cache between batches\. ROUGE is computed offline at the word level after lowercasing and whitespace normalization\.
#### MLP probe training details\.
The 2\-layer MLP probe has the form: Linear\(hidden\_dim→\\rightarrow128\)→\\rightarrowReLU→\\rightarrowDropout\(0\.3\)→\\rightarrowLinear\(128→\\rightarrow5\)\. We train with AdamW \(learning rate10−310^\{\-3\}, weight decay0\.010\.01, batch size 128\) for up to 50 epochs with early stopping \(patience 5 epochs\) based onbal\_valaccuracy\. We select the checkpoint with the best validation accuracy and evaluate it ontest\_balanced\. All MLP probes are trained independently for each model checkpoint\.
#### Prompt\-only baselines\.
For prompt\-only baselines \(BASE/NEUTRAL\), we keep the underlying model weights frozen \(no LoRA and no adversary training\) and vary only the instruction text in the system message\. We extract the same prompt representationfθ\(xprompt\)f\_\{\\theta\}\(x\_\{\\mathrm\{prompt\}\}\)and evaluate it using the same LR and MLP post\-hoc probing protocol asSurfaceLoRA\.
#### Post\-hoc and decorrelation baselines\.
To contextualizeSurfaceLoRA, we additionally evaluate leakage\-mitigation baselines in Appendix[J\.2](https://arxiv.org/html/2605.26433#A10.SS2)\. The post\-hoc baselines operate on cachedlasttok\_L\-1vectors without changing the generator: PCA removal, random subspace removal, one\-shot linear removal, and INLP\. We also include XCov as a training\-time decorrelation baseline, which adds a cross\-covariance penalty between exported representations and one\-hot race labels during fine\-tuning\. All baselines are evaluated with the same balanced five\-way probing protocol used forSurfaceLoRA\.
#### Post\-hoc sanitization details\.
For post\-hoc baselines \(Section[J\.2](https://arxiv.org/html/2605.26433#A10.SS2)\), we learn a linear transform onbal\_trainrepresentations and apply it tobal\_valandtest\_balanced\. PCA removal projects onto the orthogonal complement of a top\-kkPCA subspace estimated frombal\_train\. Random removal removes a matchedkk\-dimensional random subspace\. One\-shot linear removal applies a single supervised nullspace projection derived from a linear classifier\. INLP iteratively fits linear classifiers and composes nullspace projections as inRavfogelet al\.\([2020](https://arxiv.org/html/2605.26433#bib.bib17)\)\. Hyperparameters \(e\.g\.,kkand the number of INLP iterations\) are fixed across splits and determined onbal\_val\.
## Appendix PAdditional Backbone Results: Qwen Replication
This appendix reports a backbone replication onQwen\-2\.5\-7B\-Instruct\(Qwen Teamet al\.,[2024](https://arxiv.org/html/2605.26433#bib.bib27)\)using the same dataset splits, balanced leakage subsets, probing protocol, and decoding settings as the main experiments\. We report a unified evaluation ontest\_balancedthat includes prompt\-only baselines, validation\-selected checkpoints, and train\-to\-end checkpoints\.
Table 15:Qwen replication ontest\_balanced\. LeakageGap is defined as\|ProbeAcc−0\.2\|\|\\text\{ProbeAcc\}\-0\.2\|under balanced 5\-way race classification\. We report prompt\-only baselines, validation\-selected checkpoints, and train\-to\-end checkpoints\.NNis the number of evaluated examples;test\_balancedcontains 2,500 examples\.Boldindicates the validation\-selected checkpoint chosen by maximizing validation ROUGE\-L subject to the validation LeakageGap budget on the exported artifact\.
## Appendix QCross\-Dataset Stress Test on Discharge Me
### Q\.1How Discharge Me differs from MIMIC\-IV\-Ext\-BHC
Compared to MIMIC\-IV\-Ext\-BHC \(discharge\-summary input with BHC removed\), Discharge Me packages inputs in an ED\-centric format that can include structured fields \(chief complaint/diagnosis codes\) and radiology report text, and provides extracted targets \(BHC and discharge instructions\)\. This changes the input composition and information density, and we therefore treat Discharge Me as a dataset\-construction stress test of the same summarization objective\.
#### Dataset overview\.
We evaluate robustness onDischarge Me, a BioNLP@ACL’24 shared\-task dataset hosted on PhysioNet \(v1\.3\)\(Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33); Goldbergeret al\.,[2000](https://arxiv.org/html/2605.26433#bib.bib3)\)\. Discharge Me is derived from MIMIC\-IV\-Note and MIMIC\-IV\-ED\(Johnsonet al\.,[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\); each admission includes ED chief complaint and diagnosis codes, at least one radiology report, and a discharge summary with extracted targets for Brief Hospital Course \(BHC\) and Discharge Instructions\(Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\)\. The official split sizes are train \(68,785\), validation \(14,719\), phase\-I test \(14,702\), and phase\-II test \(10,962\)\(Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\)\.
#### Positioning and interpretation\.
Because Discharge Me is also derived from MIMIC\-IV sources\(Johnsonet al\.,[2023a](https://arxiv.org/html/2605.26433#bib.bib4); Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\), it is not intended as cross\-institution generalization\. Instead, we use it as a construction/packaging stress test: relative to our primary corpus, Discharge Me packages inputs in an ED\-centric form \(chief complaint, ICD codes, radiology reports\) and defines targets via a shared\-task extraction pipeline\(Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\)\. We therefore interpret results as robustness to dataset formulation and task packaging rather than fully independent external validation\.
Table 16:Discharge Me \(balanced 5\-way race\) GRL sweep on thelasttokrepresentation\. Validation metrics are reported at the selected checkpoint \(BestStep\) chosen by the validation selection rule: maximize validation ROUGE\-L subject to the validation LeakageGap budget on the exported artifact; if no checkpoint satisfies the budget, select a Pareto\-optimal checkpoint with minimum LeakageGap\. Test metrics are obtained by evaluating the same selected checkpoint on the balanced test split \(n=1210n\{=\}1210\)\. Chance baseline for race prediction is0\.20\.2\.
#### Our derived ED\-inputs→\\rightarrowBHC benchmark \(balanced 5\-way race\)\.
To align with our threat model and avoid confounding from label imbalance, we construct an ED\-inputs→\\rightarrowBHC subset from Discharge Me as follows\. We join the provided tables \(edstays,triage,diagnosis,radiology,target\) usingstay\_idandhadm\_id\. We form the input by concatenating: \(i\) chief complaint \(triage\.chiefcomplaint\), \(ii\) ICD diagnosis codes \(diagnosis\.icd\_codewithicd\_version\), and \(iii\) radiology report text \(radiology\.text\)\. The target istarget\.brief\_hospital\_course\. We then create balanced 5\-class race subsets with labels \{ASIAN,BLACK,HISPANIC,OTHER,WHITE\}, using equal counts per class in each split: trainn=9705n\{=\}9705\(1941 per class\), validationn=1215n\{=\}1215\(243 per class\), and testn=1210n\{=\}1210\(242 per class\)\. This balanced construction sets the chance baseline for race prediction to1/5=0\.21/5=0\.2, matching our leakage metric\.
#### Preprocessing \(provided by the dataset\)\.
Discharge Me provides structured source tables and extracted discharge\-summary targets for the shared task, including Brief Hospital Course and Discharge Instructions sections\(Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\)\. The dataset is derived from MIMIC\-IV\-Note and MIMIC\-IV\-ED, and each admission includes ED chief complaint information, diagnosis codes, at least one radiology report, and a discharge summary containing the two target sections\(Xu,[2024](https://arxiv.org/html/2605.26433#bib.bib33)\)\. We use the provided Brief Hospital Course target field and construct the input from the shared\-task source tables as described above\.
#### Evaluation protocol\.
We use the same method and hyperparameters as in the main experiments unless noted otherwise\. Utility is measured by ROUGE \(we report ROUGE\-L\), using greedy decoding with a fixedmax\_new\_tokens\. Leakage is measured by a post\-hoc linear probe \(multinomial logistic regression\) trained to predict race from the exported representation\. Following our threat model, the representation is the hidden state at the last non\-padding prompt token \(generation boundary\); we denote probe accuracy asProbeAccand define the deviation\-from\-chance metricGap=\|ProbeAcc−0\.2\|\\textit\{Gap\}=\|\\textit\{ProbeAcc\}\-0\.2\|\. We sweep GRL strengthλ∈\{0,0\.02,0\.05,0\.10,0\.20\}\\lambda\\in\\\{0,0\.02,0\.05,0\.10,0\.20\\\}with the same representation type \(lasttok\)\.
#### Results and conclusions\.
Table[16](https://arxiv.org/html/2605.26433#A17.T16)reports the fullλ\\lambdasweep on our balanced Discharge Me benchmark\. Across this dataset construction shift, we observe the same qualitative behavior as in our primary corpus: a small adversarial strength \(λ=0\.02\\lambda=0\.02\) yields the best trade\-off, substantially reducing extractable race signal on the last\-token representation while maintaining competitive ROUGE\-L\. In particular,λ=0\.02\\lambda=0\.02reaches near\-chance probe performance \(best validationProbeAcc≈0\.2066\\textit\{ProbeAcc\}\\approx 0\.2066; final testProbeAcc=0\.2000\\textit\{ProbeAcc\}=0\.2000\) with only a modest ROUGE\-L decrease relative toλ=0\\lambda=0\. Largerλ\\lambdavalues degrade utility substantially without consistently improving probe results, suggesting that over\-regularization can harm generation while not guaranteeing more stable representation sanitization\.
#### Access and compliance\.
Discharge Me is a credentialed\-access PhysioNet dataset; we accessed the data under the PhysioNet Credentialed Health Data Use Agreement\. We did not send any dataset content to third\-party APIs, consistent with the dataset rules\.Similar Articles
State Contamination in Memory-Augmented LLM Agents
This paper identifies and studies 'memory laundering' in LLM agents, where toxic or adversarial context compressed into memory summaries evades standard toxicity detectors while still influencing future generations. It introduces the sub-threshold propagation gap (SPG) to measure hidden downstream influence and shows that sanitizing toxic state before summarization is more effective than post-hoc cleaning.
SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector
Proposes SAGE, a post-hoc method to sanitize the final unlearning vector in LLMs, improving the retain-forget trade-off without rerunning the unlearning pipeline.
Surrogate modeling for interpreting black-box LLMs in medical predictions
Researchers propose a surrogate modeling framework to quantify and interpret latent medical knowledge encoded in black-box LLMs, revealing both valid associations and persistent racial biases.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
Systematic study shows LLM-based dense retrievers outperform BERT baselines on typos and poisoning but remain vulnerable to semantic perturbations, with embedding geometry predicting robustness.
Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law
This paper empirically studies LLMs' legal reasoning in tax law, showing that data contamination inflates performance and that neuro-symbolic hybrid systems offer more reliable and robust generalization than monolithic LLMs.