Membership Inference Attacks on Discrete Diffusion Language Models

arXiv cs.LG Papers

Summary

This paper studies membership inference attacks (MIA) on fine-tuned masked diffusion language models (MDLMs). It proposes a white-box attack using a 46-dimensional feature vector from the model's reconstruction loss at varying masking ratios, achieving high AUC scores and showing MDLMs are more vulnerable than previously thought.

arXiv:2605.16445v1 Announce Type: new Abstract: Masked Diffusion Language Models MDLMs replace autoregressive generation with iterative demasking and their privacy properties are largely unstudied. We study membership inference attacks MIA on fine tuned MDLMs and show they are significantly more vulnerable than current grey box baselines suggest. We extract a 46 dimensional feature vector from the models reconstruction loss at four masking ratios and train XGBoost and MLP classifiers on top. On the MIMIR benchmark across six text domains XGBoost achieves mean AUC 0.878 peaking at 0.930 on Pile CC and beats the SAMA grey box baseline by 0.062 AUC on average. A leave one signal out ablation shows that the ELBO trajectory alone drives most of this with a mean drop of 0.130 when removed while attention features add almost nothing below 0.003. We also design a shadow model transfer attack where K equals 3 surrogate MDLMs trained on data from unrelated domains generate classifier labels with no access to the target domain. This achieves 0.858 mean AUC within 0.020 of the white box oracle and establishes shadow model transfer as a practical and near equally effective attack path.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:43 AM

# Membership Inference Attacks on Discrete Diffusion Language Models
Source: [https://arxiv.org/html/2605.16445](https://arxiv.org/html/2605.16445)
###### Abstract

Masked Diffusion Language Models \(MDLMs\) replace autoregressive generation with iterative demasking, and their privacy properties are largely unstudied\. We study membership inference attacks \(MIA\) on fine\-tuned MDLMs and show they are significantly more vulnerable than current grey\-box baselines suggest\. We extract a 46\-dimensional feature vector from the model’s reconstruction loss at four masking ratios and train XGBoost and MLP classifiers on top\. On the MIMIR benchmark across six text domains, XGBoost achieves mean AUC0\.878\(±\\pm0\.046\), peaking at0\.930on Pile\-CC, and beats the SAMA grey\-box baseline by\+0\.062AUC on average\. A leave\-one\-signal\-out ablation shows that the ELBO trajectory alone drives most of this \(mean drop 0\.130 when removed\), while attention features add almost nothing \(<<0\.003\)\. We also design a shadow\-model transfer attack whereK=3K=3surrogate MDLMs, trained on data from unrelated domains, generate classifier labels with no access to the target domain\. This achieves0\.858mean AUC, within 0\.020 of the white\-box oracle, and establishes shadow\-model transfer as a practical and near\-equally effective attack path\.

## 1Introduction

Large language models trained or fine\-tuned on private data can memorise parts of that data\(Carliniet al\.,[2021](https://arxiv.org/html/2605.16445#bib.bib7); Shokriet al\.,[2017](https://arxiv.org/html/2605.16445#bib.bib9)\)\. Membership inference attacks \(MIA\) try to detect this: given a text and some access to a model, can an attacker tell whether that text was in the training set? This is the standard empirical test for privacy leakage, and it matters practically whenever models are fine\-tuned on sensitive corpora like medical records, legal documents, or private codebases\(Carliniet al\.,[2022](https://arxiv.org/html/2605.16445#bib.bib8)\)\.

Most MIA work targets autoregressive \(AR\) models, where a per\-token log\-probability is directly available and serves as a natural score\(Duanet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib12); Yeomet al\.,[2018](https://arxiv.org/html/2605.16445#bib.bib10)\)\. Masked Diffusion Language Models\(Austinet al\.,[2021](https://arxiv.org/html/2605.16445#bib.bib1); Sahooet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib2); Nieet al\.,[2025](https://arxiv.org/html/2605.16445#bib.bib4)\)work differently: they learn to reconstruct randomly masked tokens rather than predict the next one\. There is no single unconditional likelihood to read off\. Instead, the loss depends on which tokens are masked and at what ratio, and this dependency turns out to carry membership information that scalar\-based attacks miss\.

We focus on the ELBO trajectory: how the masked reconstruction loss changes as the masking ratioα\\alphaincreases from 0\.05 to 0\.50\. A model that has memorised a text reconstructs it cheaply at any masking level, producing a lower and more concave loss curve than for unseen text\. Collecting these values at four masking levels and adding a few supporting signals \(predictive entropy, hidden\-state norms\) gives a 46\-dimensional feature vector that a gradient\-boosted classifier can use to separate members from non\-members\.

Prior work on MDLMs includes SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\), a grey\-box attack that aggregates NLL scores across random token subsets\. Our white\-box trajectory features beat it by \+0\.062 AUC on average\. For autoregressive models,Duanet al\.\([2024](https://arxiv.org/html/2605.16445#bib.bib12)\)show that standard loss\-based attacks largely fail on pretrained models but remain viable after fine\-tuning\. MIA for diffusion image models is studied inDockhornet al\.\([2023](https://arxiv.org/html/2605.16445#bib.bib13)\)\. No prior work applies white\-box trajectory extraction to discrete diffusion LMs\.

We make four contributions:

\(1\)White\-box trajectory MIA\.We extract a 46\-dim feature vector at four masking ratios and train XGBoost, LightGBM, and MLP classifiers via 5\-fold CV\. Mean AUC of 0\.878 across six MIMIR domains, peaking at 0\.930\.

\(2\)Shadow\-model transfer\.We train classifiers onK=3K=3shadow MDLMs fine\-tuned on surrogate data from different domains, then apply them directly to the target domain with no retraining\. This achieves 0\.858 mean AUC with no target\-domain labels, closing within 0\.020 of the white\-box oracle\. Shadow transfer is a practical, near\-equally effective attack path\.

\(3\)Feature ablation\.A LOSO analysis across all six domains isolates the ELBO trajectory as the primary signal \(mean drop 0\.130 when removed\)\. All four attention\-based signal groups contribute less than 0\.003 each\.

\(4\)Attention features do not transfer\.When only attention features are used in the shadow\-model setting, AUC drops to 0\.525, confirming they are domain\-specific and not reusable across domains\. ELBO\-based signals are\.

## 2Background

### 2\.1Masked Diffusion Language Models

An MDLM defines a forward process that masks each token of a clean sequencex0x\_\{0\}independently with probabilitytt, wheret∼𝒰​\(ε,1\)t\\sim\\mathcal\{U\}\(\\varepsilon,1\):

q​\(xt∣x0\)=∏i=1L\[t⋅𝟏​\[xt\(i\)=\[M\]\]\+\(1−t\)⋅𝟏​\[xt\(i\)=x0\(i\)\]\]\.q\(x\_\{t\}\\mid x\_\{0\}\)=\\prod\_\{i=1\}^\{L\}\\bigl\[t\\cdot\\mathbf\{1\}\[x\_\{t\}^\{\(i\)\}=\\texttt\{\[M\]\}\]\+\(1\-t\)\\cdot\\mathbf\{1\}\[x\_\{t\}^\{\(i\)\}=x\_\{0\}^\{\(i\)\}\]\\bigr\]\.\(1\)The modelpθp\_\{\\theta\}learns to fill in the masked positions, trained by minimising

ℒθ​\(x0\)=𝔼t,xt∼q\(⋅∣x0\)​\[−log⁡pθ​\(x0∣xt\)masked\]\.\\mathcal\{L\}\_\{\\theta\}\(x\_\{0\}\)=\\mathbb\{E\}\_\{t,\\,x\_\{t\}\\sim q\(\\cdot\\mid x\_\{0\}\)\}\\\!\\left\[\-\\log p\_\{\\theta\}\(x\_\{0\}\\mid x\_\{t\}\)\_\{\\mathrm\{masked\}\}\\right\]\.\(2\)The model used throughout isdllm\-hub/Qwen3\-0\.6B\-diffusion\-mdlm\-v0\.1, a 0\.6B\-parameter MDLM on the Qwen3 backbone trained under the MDLM framework\(Sahooet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib2); Qwen Team,[2025](https://arxiv.org/html/2605.16445#bib.bib14)\)\.

### 2\.2Membership Inference Attacks

An MIA is a binary classifier that decides whether a textxxwas in the training set, given some access to the model\. Access level determines the attack category:

Black\-boxattacks use only scalar outputs\. Loss\(Yeomet al\.,[2018](https://arxiv.org/html/2605.16445#bib.bib10)\)thresholds on NLL directly\. Zlib normalises NLL by compressed text length\. Ratio divides the fine\-tuned model’s NLL by the base model’s NLL\.

Grey\-boxattacks use model NLL for arbitrary masked inputs but not internal states\. SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\)falls here\.

White\-boxattacks have full access to weights, hidden states, and attention maps\. Our attack is white\-box on the target model, but we also study a shadow\-model variant that requires no target\-domain labels\.

We evaluate on the MIMIR benchmark\(Duanet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib12)\), which provides membership\-labelled samples across six text domains with a controlled n\-gram split to prevent trivial lexical overlap\.

### 2\.3SAMA Baseline

SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\)runs in two phases\. Phase I applies progressive cumulative masking overTTsteps\. At each stepssit drawsNNrandom token subsets\{Un\}\\\{U\_\{n\}\\\}and computesβs=1N​∑n𝟏​\[ℓR​\(Un\)\>ℓT​\(Un\)\]\\beta\_\{s\}=\\frac\{1\}\{N\}\\sum\_\{n\}\\mathbf\{1\}\[\\ell\_\{R\}\(U\_\{n\}\)\>\\ell\_\{T\}\(U\_\{n\}\)\], whereℓR\\ell\_\{R\}andℓT\\ell\_\{T\}are subset NLL from the base and fine\-tuned models\. Phase II aggregates with harmonic weights:

ϕ​\(x\)=∑s=1T1/s∑j=1T1/j​βs\.\\phi\(x\)=\\sum\_\{s=1\}^\{T\}\\frac\{1/s\}\{\\sum\_\{j=1\}^\{T\}1/j\}\\,\\beta\_\{s\}\.\(3\)We run SAMA withT=4T=4,N=128N=128,M=10M=10tokens per subset\. It is designed for moderate memorisation; Section[5](https://arxiv.org/html/2605.16445#S5)shows where it falls short compared to trajectory\-based features\.

## 3Method

### 3\.1Threat Model

We consider two attack settings\. The first is full white\-box: the attacker has the fine\-tuned model weights and can run forward passes with any masking pattern, reading hidden states, attention maps, and gradient tensors\. This models a scenario where weights are published or leaked, which is realistic for models shared on public hubs\. The second is shadow\-model transfer: the attacker does not have the target model’s membership labels, but does have the architecture and fine\-tuning recipe, and can train surrogate models on unrelated data\. Both settings are evaluated in Section[5](https://arxiv.org/html/2605.16445#S5)\.

### 3\.2Memorisation Verification

Before running attacks we check that fine\-tuning has actually induced memorisation\. We compute the ELBO gap:

Δmem​\(x\)=ℒbase​\(x\)−ℒFT​\(x\),\\Delta\_\{\\mathrm\{mem\}\}\(x\)=\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(x\)\-\\mathcal\{L\}\_\{\\mathrm\{FT\}\}\(x\),\(4\)whereℒbase\\mathcal\{L\}\_\{\\mathrm\{base\}\}is the base model ELBO andℒFT\\mathcal\{L\}\_\{\\mathrm\{FT\}\}is the fine\-tuned model ELBO on the same text\. A positive gap means fine\-tuning lowered the reconstruction cost for that text\. Across our six domains, member gaps run from 2\.19 to 2\.75 nats and non\-member gaps from 1\.39 to 2\.12 nats \(Figure[1](https://arxiv.org/html/2605.16445#S3.F1)\)\. The separation confirms memorisation is present and measurable\.

### 3\.3Diffusion\-Trajectory Feature Extraction

Each challenge textxxis evaluated at masking ratios𝜶=\{0\.05,0\.20,0\.35,0\.50\}\\boldsymbol\{\\alpha\}=\\\{0\.05,\\,0\.20,\\,0\.35,\\,0\.50\\\}\(T=4T=4levels\), withK=8K=8independent random masks per level to reduce variance\. At each levelαj\\alpha\_\{j\}we compute 11–12 scalar signals, giving a 46\-dimensional feature vectorϕ​\(x\)\\phi\(x\)after concatenation\. The main signal groups are:

ELBO trajectory\(4 dims\): the masked NLLℒ​\(x;αj\)\\mathcal\{L\}\(x;\\alpha\_\{j\}\)at each ratio\. Members have lower values and a more concave curve\.

ELBO derivatives\(8 dims\): finite\-difference estimates ofd​L/d​tdL/dtandd2​L/d​t2d^\{2\}L/dt^\{2\}\.

Predictive entropy\(4 dims\): mean−∑vpθ​\(v\|xαj\)​log⁡pθ​\(v\|xαj\)\-\\sum\_\{v\}p\_\{\\theta\}\(v\|x\_\{\\alpha\_\{j\}\}\)\\log p\_\{\\theta\}\(v\|x\_\{\\alpha\_\{j\}\}\)over masked positions at each level\.

Mask consistency\(4 dims\): agreement between two independent mask realisations at the sameαj\\alpha\_\{j\}\.

Hidden\-state statistics\(8 dims\): meanℓ2\\ell\_\{2\}\-norm and cosine similarity between the fine\-tuned and base model hidden states\.

Attention signals\(16 dims\): attention entropy, cross\-layer correlation, transport barycenter, and attention perturbation at each masking level\.

Cross\-model cosine\(1 dim\) andELBO variance\(1 dim\) round out the 46 dimensions\. The full breakdown is in Appendix[C](https://arxiv.org/html/2605.16445#A3)\.

### 3\.4Attack Classifiers

We train three classifiers onϕ​\(x\)\\phi\(x\): XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2605.16445#bib.bib18)\)\(200 trees, depth 4, lr 0\.05, subsample 0\.8\), LightGBM \(same settings\), and an MLP with hidden layers \(256, 128, 64\), dropout 0\.3, and early stopping on a 20% validation split\. All three use 5\-fold stratified cross\-validation; out\-of\-fold probabilities feed the final AUC computation\. Bootstrap CIs use 1000 resamples\.

### 3\.5Shadow\-Model Transfer

For settings where the attacker has no membership labels in the target domain, we trainKs=3K\_\{s\}=3shadow MDLMs on surrogate data from thengram\_13\_0\.2MIMIR split, pooling across all six domains \(≈\\approx5,695 members total, disjoint from the evaluation set\)\. Features are extracted from each shadow model’s training and held\-out sequences in the same 46\-dimensional space, giving labelled pairs\(ϕshadow,yshadow\)\(\\phi\_\{\\mathrm\{shadow\}\},y\_\{\\mathrm\{shadow\}\}\)\. An XGBoost classifier trained on this shadow data is then applied directly to features from the target fine\-tuned model, with no retraining\.

We evaluate five conditions to map out the attack surface \(Table[2](https://arxiv.org/html/2605.16445#S5.T2)\):

A \(Oracle\):5\-fold CV using target membership labels\. This is the white\-box upper bound\.

B \(Shadow\-46\):Full 46\-dim shadow transfer, no target labels\.

C \(ELBO\+H\):8\-dim subset: ELBO and entropy features only\.

D \(Attn\-only\):16\-dim attention features only \(negative control\)\.

E \(Pruned\-30\):46\-dim minus the 16 attention dimensions\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x1.png)Figure 1:Memorisation verification: mean ELBO trajectory for member and non\-member texts across six domains, evaluated at masking ratiosα∈\{0\.05,0\.20,0\.35,0\.50\}\\alpha\\in\\\{0\.05,0\.20,0\.35,0\.50\\\}\. Member curves sit consistently lower and more concave than non\-member curves, confirming that fine\-tuning induced measurable memorisation in all six domains\.

## 4Experimental Setup

#### Dataset\.

We use the MIMIR benchmark\(Duanet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib12)\),ngram\_13\_0\.8split, across six domains: arXiv, GitHub, HackerNews, Pile\-CC, PubMed Central, and Wikipedia\. This split enforces that members and non\-members share at most 20% of their unique 13\-grams, which prevents trivial lexical overlap from inflating scores\. For each domain we use 300 members and 300 non\-members as the classifier evaluation set, with 1,000 members used for fine\-tuning\. Texts are tokenised with Qwen2Tokenizer and truncated to 256 tokens\.

#### Fine\-tuning\.

We fine\-tunedllm\-hub/Qwen3\-0\.6B\-diffusion\-mdlm\-v0\.1separately per domain using AdamW \(η=10−4\\eta=10^\{\-4\}, weight decay 0\.01\), 5 epochs, batch size 8, bfloat16\. Training runs on a single NVIDIA L4 GPU \(24 GB\) and takes roughly 6–7 minutes per domain\. We verify memorisation via the ELBO gap before running attacks \(Section[3\.2](https://arxiv.org/html/2605.16445#S3.SS2)\)\.

#### Baselines\.

Black\-box: Loss \(fine\-tuned NLL\), Zlib \(NLL / compressed length\), Ratio \(NLLFT/ NLLbase\)\. Grey\-box: SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\)\(T=4T=4steps,N=128N=128subsets,M=10M=10tokens\)\.

#### Metrics\.

AUC\-ROC with 95% bootstrap CIs \(1,000 resamples\), and TPR at fixed FPR thresholds \{0\.1%, 1%, 10%\}\. Low\-FPR TPR is the practically relevant metric: a real auditor wants a small false\-alarm rate while finding as many members as possible\.

## 5Results

### 5\.1TPR at Low FPR

Figure[2](https://arxiv.org/html/2605.16445#S5.F2)shows TPR at 0\.1% and 1% FPR across all six domains\. These are the operationally relevant thresholds: a real privacy auditor needs near\-zero false\-alarm rate while catching as many actual training samples as possible\.

At1% FPR, XGBoost recovers 15–40% of members depending on domain, compared to 3–10% for SAMA\. On Pile\-CC, XGBoost identifies 37\.7% of members at this threshold; on HackerNews it reaches 40\.1%\. The improvement over SAMA is 4–8×\\timesacross domains\.

At0\.1% FPR, XGBoost still recovers 4–18% of members\. This threshold means one false alarm per 1,000 non\-members tested\. SAMA cannot operate reliably at this resolution withN=128N=128subsets, while the trajectory classifier does so without degradation\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x2.png)Figure 2:TPR at 0\.1% FPR \(left\) and 1% FPR \(right\) by domain and method\. XGBoost and MLP substantially outperform SAMA and all black\-box methods at both thresholds\. SAMA is absent from the 0\.1% panel because its score distribution does not resolve at that resolution\.
### 5\.2AUC\-ROC Overview

Table[1](https://arxiv.org/html/2605.16445#S5.T1)and Figure[3](https://arxiv.org/html/2605.16445#S5.F3)give the aggregate picture\. XGBoost achieves mean AUC0\.878\(±\\pm0\.046\) and MLP achieves0\.882, both well above the SAMA grey\-box baseline at 0\.816 \(±\\pm0\.037\)\. The \+0\.062 gap over SAMA comes from capturing the full shape of the ELBO curve rather than a scalar per\-subset NLL comparison\. Black\-box attacks \(Loss, Zlib, Ratio\) cluster between 0\.625 and 0\.722 and gain nothing from the diffusion structure\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x3.png)Figure 3:AUC\-ROC across six methods and six MIMIR domains\. XGBoost and MLP \(bottom two rows\) consistently reach the highest values\. SAMA \(fourth row\) sits between black\-box and white\-box attacks\.Table 1:AUC\-ROC and TPR@1%FPR across six MIMIR domains\. 95% bootstrap CIs shown for XGBoost only \(1000 resamples\)\.Bold: best AUC per domain\. Black\-box: Loss, Zlib, Ratio\. Grey\-box: SAMA\. White\-box \(ours\): XGBoost, MLP\.
### 5\.3Shadow\-Model Transfer

Table 2:Shadow\-model transfer results by condition and domain\.A: Oracle \(white\-box upper bound, 5\-fold CV on target labels\)\.B: Shadow\-46 \(full 46\-dim transfer, no target labels\)\.C: ELBO\+Entropy \(8\-dim subset only\)\.D: Attention\-only \(16\-dim, negative control\)\.E: Pruned\-30 \(attention dims removed\)\. Mean±\\pmstd computed across six domains\.“–” for SAMA at 0\.1% FPR: SAMA’s score distribution does not resolve below 1% FPR reliably withN=128N=128subsets\.

The shadow\-model attack achieves results close to the white\-box oracle across all six domains \(Figure[4](https://arxiv.org/html/2605.16445#S5.F4), Table[2](https://arxiv.org/html/2605.16445#S5.T2)\)\. Oracle AUC averages 0\.878; Shadow\-46 \(Condition B\) averages 0\.858, a gap of only 0\.020\. More strikingly, at 1% FPR the shadow attack recovers a mean of 19\.5% of members with no target\-domain labels, compared to the oracle’s 27\.4%\. The gap in absolute detection rate narrows in the domains where ELBO signal is strongest: on Pile\-CC, Shadow\-46 actually exceeds the oracle’s TPR@1%FPR \(40\.4% vs\. 37\.7%\), an artefact of the cross\-domain calibration\.

Reducing to just 8 ELBO\+entropy dimensions \(Condition C\) costs only 0\.015 AUC and stays within 0\.2 percentage points of Shadow\-46 at 1% FPR on average, while extracting features roughly 6×\\timesfaster\. Pruned\-30 \(Condition E, no attention dims\) closely matches Shadow\-46 throughout\.

Condition D \(attention only\) collapses to AUC 0\.525 and TPR@1%FPR of 1\.6% across domains – essentially random\. Attention patterns are domain\-specific and carry no transferable membership signal\. ELBO\-based signals transfer; attention signals do not\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x4.png)Figure 4:Shadow\-model AUC by condition and domain\. Dashed lines show the domain\-specific SAMA baseline\. Oracle \(A\) and Shadow\-46 \(B\) are within 0\.020 of each other in every domain\. Attn\-only \(D\) falls to near\-random\.
### 5\.4Domain Analysis

![Refer to caption](https://arxiv.org/html/2605.16445v1/x5.png)Figure 5:Aggregate ROC curves \(mean±\\pmstd across six domains\)\. XGBoost and MLP dominate across the full FPR range\.Attack difficulty varies by domain \(Figure[5](https://arxiv.org/html/2605.16445#S5.F5)\)\. Pile\-CC \(AUC 0\.930, TPR@1%FPR 37\.7%\) and HackerNews \(0\.928, 40\.1%\) are most vulnerable: high lexical diversity means the fine\-tuned model’s reconstruction advantage concentrates on its training texts\. Wikipedia and arXiv are intermediate\. GitHub \(AUC 0\.801, TPR@1%FPR 15\.9%\) is hardest: source code is structurally regular, so member and non\-member files produce similar ELBO values even though the raw member gap is the largest of all domains \(2\.75 nats\)\. Per\-domain ROC and PR curves are in Appendix[F](https://arxiv.org/html/2605.16445#A6)\.

### 5\.5Feature Ablation

![Refer to caption](https://arxiv.org/html/2605.16445v1/x6.png)Figure 6:Left: mean LOSO AUC drop per feature group across six domains\. Right: mean solo AUC per group using only that group\. The ELBO trajectory dominates; all attention groups are near\-zero\.Figure[6](https://arxiv.org/html/2605.16445#S5.F6)shows the leave\-one\-signal\-out \(LOSO\) analysis\. Removing the ELBO trajectory group drops mean AUC by0\.130, exceeding the entire gap between XGBoost and SAMA\. Predictive entropy is the next most useful group \(LOSO drop 0\.043\)\. All four attention groups contribute less than 0\.006 each\. Used alone, the ELBO trajectory achieves solo AUC between 0\.67 and 0\.84 across domains, approaching SAMA’s performance without any subset enumeration\.

The shadow conditions reinforce this: ELBO\+H \(8 dims\) reaches 0\.843 AUC and 19\.2% TPR@1%FPR, while attention\-only collapses to 0\.525 AUC and 1\.6% TPR\. Full per\-domain tables are in Appendix[D](https://arxiv.org/html/2605.16445#A4)\.

## 6Conclusion

Fine\-tuned MDLMs are significantly more vulnerable to membership inference than grey\-box baselines suggest\. Extracting a 46\-dimensional diffusion\-trajectory feature vector at four masking levels and training gradient\-boosted classifiers on top achieves mean AUC 0\.878 across six MIMIR domains, \+0\.062 over SAMA\. The ELBO trajectory drives most of this: removing it alone drops AUC by 0\.130, while all attention features together contribute less than 0\.006\.

The shadow\-model attack makes this threat practical\. WithK=3K=3surrogate MDLMs trained on data from different domains, we reach 0\.858 mean AUC with no target\-domain labels at all\. The oracle\-to\-shadow gap of 0\.020 is small enough to consider shadow transfer a viable standalone attack, not just an approximation\. Reducing to 8 ELBO\+entropy dimensions \(Condition C\) costs only 0\.015 more AUC and makes extraction roughly 6×\\timesfaster, offering a good trade\-off for resource\-constrained auditing\.

#### Limitations\.

All experiments use a single model family \(Qwen3\-0\.6B MDLM\) in a fine\-tuned setting with small training sets \(300–1,000 samples per domain\)\. Behaviour on larger models or in the pretraining regime may differ\. The white\-box threat model is also optimistic; deployments that restrict internal access would limit the attack to the shadow\-model path\.

#### Future directions\.

On the defence side, differential privacy during fine\-tuning\(Abadiet al\.,[2016](https://arxiv.org/html/2605.16445#bib.bib17)\)and explicit regularisation of the ELBO trajectory are natural candidates\. On the attack side, extending to LLaDA and other MDLM variants, and estimating the ELBO trajectory from generation\-time outputs for fully black\-box settings, are open problems\.

## References

- M\. Abadi, A\. Chu, I\. Goodfellow, H\. B\. McMahan, I\. Mironov, K\. Talwar, and L\. Zhang \(2016\)Deep learning with differential privacy\.InACM CCS,Cited by:[§6](https://arxiv.org/html/2605.16445#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. van den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 17981–17993\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1)\.
- N\. Carlini, S\. Chien, M\. Nasr, S\. Song, A\. Terzis, and F\. Tramèr \(2022\)Membership inference attacks from first principles\.InIEEE Symposium on Security and Privacy,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p1.1)\.
- N\. Carlini, F\. Tramèr, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. B\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel \(2021\)Extracting training data from large language models\.InUSENIX Security Symposium,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p1.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: a scalable tree boosting system\.arXiv preprint arXiv:1603\.02754\.Cited by:[§3\.4](https://arxiv.org/html/2605.16445#S3.SS4.p1.1)\.
- Y\. Chen, K\. Zhang, Y\. Du, E\. Stoppa, C\. Fleming, A\. Kundu, B\. Ribeiro, and N\. Li \(2026\)Membership inference attacks against fine\-tuned diffusion language models\.InInternational Conference on Learning Representations,Note:arXiv:2601\.20125Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.16445#S2.SS2.p3.1),[§2\.3](https://arxiv.org/html/2605.16445#S2.SS3.p1.7),[§4](https://arxiv.org/html/2605.16445#S4.SS0.SSS0.Px3.p1.5)\.
- T\. Dockhorn, T\. Cao, A\. Vahdat, and K\. Krause \(2023\)Differentially private diffusion models\.InTransactions on Machine Learning Research,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p4.1)\.
- M\. Duan, A\. Suri, N\. Mireshghallah, S\. Min, W\. Shi, L\. Zettlemoyer, Y\. Tsvetkov, Y\. Choi, D\. Evans, and H\. Hajishirzi \(2024\)Do membership inference attacks work on large language models?\.InConference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1),[§1](https://arxiv.org/html/2605.16445#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.16445#S2.SS2.p5.1),[§4](https://arxiv.org/html/2605.16445#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)LLaDA: large language diffusion with masking\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint\.Cited by:[§2\.1](https://arxiv.org/html/2605.16445#S2.SS1.p1.5)\.
- S\. S\. Sahoo, M\. Arriola, A\. Gokaslan, E\. Marroquin, A\. M\. Rush, Y\. Schiff, J\. T\. Chiu, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.16445#S2.SS1.p1.5)\.
- R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov \(2017\)Membership inference attacks against machine learning models\.InIEEE Symposium on Security and Privacy,pp\. 3–18\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p1.1)\.
- S\. Yeom, I\. Giacomelli, M\. Fredrikson, and S\. Jha \(2018\)Privacy risk in machine learning: analyzing the connection to overfitting\.InIEEE Computer Security Foundations Symposium,pp\. 268–282\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.16445#S2.SS2.p2.1)\.

## Appendix AProof\-of\-Concept: Run 1 \(Single\-Domain Overfitting\)

Before running the full six\-domain pipeline we validated the attack framework on a single controlled experiment: intentionally overfit the MDLM on a small dataset and verify that white\-box trajectory features can recover membership under extreme memorisation\.

### Setup

We fine\-tunedQwen3\-0\.6B\-diffusion\-mdlm\-v0\.1on 300 GitHub member texts for15 epochs\(AdamW,η=10−4\\eta=10^\{\-4\}, batch 8, bfloat16\)\. The intent was to push memorisation hard and stress\-test the feature extraction pipeline before scaling to six domains\.

### Memorisation Verification

The mean member ELBO gap reached4\.625 nats: the fine\-tuned model reconstructs training texts at almost no reconstruction cost at any masking level\. The non\-member gap was−0\.130\-0\.130nats \(base and fine\-tuned model are essentially equivalent on unseen text\)\. Figure[7](https://arxiv.org/html/2605.16445#A1.F7)shows the distribution\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x7.png)Figure 7:ELBO gap distribution from Run 1 \(GitHub, 15 epochs\)\. Member gap mean: 4\.625 nats; non\-member gap mean:−\-0\.130 nats\. The two distributions are nearly fully separated\.
### Attack Results

### Discussion

White\-box trajectory features cleanly recover membership even under extreme overfitting\. The ELBO curve is consistently lower and more concave for member texts regardless of how low the absolute reconstruction values get, so the classifier can separate the two populations even when the member gap is nearly 5 nats above zero\.

This experiment confirmed that the feature extraction pipeline is sound\. It also showed that 15 epochs is too many for the main experiments: the model over\-memorises in a way that collapses the discrimination into a trivial task and moves the operating point far outside the regime of practical deployments\. The main experiments use 5 epochs \(see Appendix[B](https://arxiv.org/html/2605.16445#A2)\), which produces member ELBO gaps of 2\.19–2\.75 nats across domains and keeps the attack in a more realistic range\.

## Appendix BHyperparameter Search: Run 2

Run 2 extended the proof\-of\-concept by introducing all five attack methods \(Loss, Zlib, Ratio, SAMA, XGBoost, MLP\) on the GitHub domain and varying the number of fine\-tuning epochs to find the operating point with the best membership discriminability\.

### Epoch Ablation

The central finding is that discriminability is*non\-monotone*in training epochs\. With too few epochs the model has not memorised sufficiently to create an ELBO gap; with too many epochs \(15 as in Run 1\), over\-memorisation causes the SAMA baseline to collapse and makes the domain generally “easy” regardless of membership status\. The 5\-epoch operating point produced ELBO member gaps of 2\.19–2\.75 nats across domains—within SAMA’s designed operating range—while maintaining strong XGBoost discriminability\.

### Final Hyperparameter Configuration

All experiments in Run 3 and Run 4 use these exact hyperparameters\.

## Appendix CFull Feature Taxonomy \(46 Dimensions\)

Table[3](https://arxiv.org/html/2605.16445#A3.T3)lists all 46 features in the order they appear in the feature vectorϕ​\(x\)\\phi\(x\)\. Features are extracted at each of four masking ratiosαj∈\{0\.05,0\.20,0\.35,0\.50\}\\alpha\_\{j\}\\in\\\{0\.05,0\.20,0\.35,0\.50\\\}unless otherwise noted \(marked “global”\)\. The*Expected direction*column indicates the sign of the differenceϕmember−ϕnon\-member\\phi\_\{\\text\{member\}\}\-\\phi\_\{\\text\{non\-member\}\}predicted by the memorisation hypothesis; this is used as a sanity check but not enforced during training\.

Table 3:Complete 46\-dimensional feature vector description\.K=8K=8independent random masks are averaged per\(feature,αj\)\(\\text\{feature\},\\alpha\_\{j\}\)pair\.#### Notes\.

Features in the ELBO Variance,d​L/d​tdL/dt, andd2​L/d​t2d^\{2\}L/dt^\{2\}groups show near\-zero LOSO drops \(Table[4](https://arxiv.org/html/2605.16445#A4.T4)\) despite representing distinct mathematical quantities\. This is because atT=4T=4masking levels, the finite\-difference estimates of derivatives are nearly collinear with the raw ELBO trajectory values at those four points; removing either set leaves the other to compensate almost perfectly\. Extending toT≥8T\\geq 8levels may break this collinearity and reveal independent signal in the derivative features\.

## Appendix DFull Ablation Study \(Run 3\)

### D\.1LOSO AUC Drop: All Domains

Table 4:Leave\-One\-Signal\-Out \(LOSO\) AUC drop per feature group and domain\. Positive values indicate removing that group hurts performance\.Bold: largest drop \(most important\)\. “–” denotes negligible or negative drop \(<<0\.001\)\. Full\-model AUC: arXiv 0\.877, GitHub 0\.801, HackerNews 0\.928, Pile\-CC 0\.930, PubMed 0\.842, Wikipedia 0\.892\.“–” denotes negligible or negative drop \(<0\.001<0\.001\), indicating the feature group is redundant or slightly harmful\.

Table[4](https://arxiv.org/html/2605.16445#A4.T4)shows per\-domain LOSO drops alongside the mean\. Full\-model AUC baseline: arXiv 0\.877, GitHub 0\.801, HackerNews 0\.928, Pile\-CC 0\.930, PubMed 0\.842, Wikipedia 0\.892\.

### D\.2Solo AUC: All Domains

Table 5:Solo AUC per feature group \(using only that group’s dimensions\) per domain\. Random baseline is 0\.500\.
### D\.3Discussion

Derivative features show near\-zero LOSO drop\.d​L/d​tdL/dtandd2​L/d​t2d^\{2\}L/dt^\{2\}are finite\-difference estimates computed from the same four ELBO values as the trajectory group\. AtT=4T=4levels,d​L/d​t\|tj≈ℒ​\(tj\+1\)−ℒ​\(tj−1\)dL/dt\|\_\{t\_\{j\}\}\\approx\\mathcal\{L\}\(t\_\{j\+1\}\)\-\\mathcal\{L\}\(t\_\{j\-1\}\), which is a linear combination of the trajectory values\. The two groups are nearly collinear, so removing either one leaves the other to compensate\. Extending toT≥8T\\geq 8masking levels would break this collinearity\.

Attention features are domain\-specific\.Solo AUC for all attention groups is 0\.47–0\.53 across most domains, near random\. In MDLMs the masked denoising objective produces attention patterns that depend heavily on the specific masking configuration and domain vocabulary, making them inconsistent across theK=8K=8mask realisations\. This is why they also fail under shadow\-model transfer \(Condition D, AUC 0\.525\): they contain no signal that generalises across domains\.

Minimal effective attack\.ELBO trajectory and predictive entropy together \(8 dimensions\) achieve mean AUC 0\.843 in shadow transfer\. This is cheaper than the full 46\-dim vector and still above SAMA\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x8.png)Figure 8:LOSO AUC drop \(left\) and solo AUC \(right\), averaged across six domains\.

## Appendix EShadow Model: Full Results \(Run 4\)

### E\.1Experimental Details

Three shadow MDLMs \(Ks=3K\_\{s\}=3\) are fine\-tuned on surrogate data from thengram\_13\_0\.2MIMIR split, which is disjoint from thengram\_13\_0\.8evaluation set\. The surrogate pool covers all six domains and contains roughly 5,695 member sequences total, with about 1,000 per shadow model\. Non\-member texts come from Wikipedia and Common Crawl subsets not used in evaluation\. Each shadow model uses the same hyperparameters as the target: 5 epochs, AdamWη=10−4\\eta=10^\{\-4\}, bfloat16\.

A single XGBoost classifier is trained on features pooled from all three shadow models and applied to the target domain without any retraining\. No target\-domain labels are used\.

### E\.2Full Condition×\\timesDomain Results

Table[2](https://arxiv.org/html/2605.16445#S5.T2)in the main text gives the complete AUC matrix\. Figure[9](https://arxiv.org/html/2605.16445#A5.F9)shows the same data as a heatmap\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x9.png)Figure 9:Shadow\-model AUC by condition and domain\. Condition D \(attention only\) is near\-random universally\. Conditions B, C, E all sit above SAMA in most domains\.
### E\.3Per\-Group Transfer AUC

Figure[10](https://arxiv.org/html/2605.16445#A5.F10)shows the shadow transfer AUC when training on one feature group at a time, analogous to the solo AUC analysis in Appendix[D](https://arxiv.org/html/2605.16445#A4)but in the transfer setting\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x10.png)Figure 10:Per\-group shadow transfer AUC \(mean±\\pmstd across six domains\)\. ELBO trajectory is the only group that transfers above 0\.70\. All attention groups sit near 0\.50\.
### E\.4Discussion

Why attention features do not transfer\.Attention heads in MDLMs specialise to local syntactic patterns of each domain: mathematical notation in arXiv, Python indentation in GitHub, informal pronouns in HackerNews\. A classifier trained on these patterns in the shadow domain applies to the wrong pattern space in the target\. ELBO trajectory, by contrast, depends mainly on the model’s memorisation degree, which is set by the training recipe and generalises across text domains when the architecture is fixed\.

ELBO\+H as a minimal attack\.Condition C \(ELBO\+H, 8 dimensions\) achieves 0\.843 mean AUC with roughly 6×\\timesless extraction cost than the full 46\-dim vector\. For practical auditing this offers a strong accuracy\-to\-cost ratio: 0\.843 AUC with no target labels, compared to SAMA’s 0\.816 AUC which does require black\-box model access per sample\.

## Appendix FPer\-Domain ROC and PR Curves

![Refer to caption](https://arxiv.org/html/2605.16445v1/x11.png)Figure 11:Per\-domain ROC curves for all five evaluated methods\. The random baseline diagonal is dotted\. AUC values appear in the legend of the arXiv panel; method colours are consistent across panels\.![Refer to caption](https://arxiv.org/html/2605.16445v1/x12.png)Figure 12:Per\-domain precision\-recall curves\. The horizontal dotted line is the random classifier baseline \(class balance = 0\.5\)\.#### Domain notes\.

arXiv\.XGBoost and MLP track closely above SAMA at all FPR thresholds\.

GitHub\.SAMA \(AUC 0\.795\) nearly matches XGBoost \(0\.801\)\. This is the only domain where SAMA closes to within 0\.01 of the white\-box classifier, likely because source code has low perplexity and SAMA’s NLL comparison works reasonably well\.

HackerNews\.Largest raw gap: XGBoost 0\.928 vs SAMA 0\.833\. Short, informal sentences with distinctive vocabulary produce particularly clean ELBO trajectories for member texts\.

Pile\-CC\.XGBoost achieves its highest TPR at 10% FPR \(0\.803\) here\. SAMA also performs well \(0\.885\) relative to other domains\.

PubMed Central\.Highly formulaic biomedical text reduces the ELBO curvature difference between members and non\-members\.

Wikipedia\.MLP slightly outperforms XGBoost \(0\.902 vs 0\.892\)\. This is the only domain where MLP takes the top spot; encyclopedic text’s consistent structure may favour the MLP’s learned feature interactions\.

![Refer to caption](https://arxiv.org/html/2605.16445v1/x13.png)Figure 13:KDE of the four most discriminative features for arXiv\. ELBO values at low masking ratios show the clearest separation\.

Similar Articles