Membership Inference Attacks on Discrete Diffusion Language Models
Summary
This paper studies membership inference attacks (MIA) on fine-tuned masked diffusion language models (MDLMs). It proposes a white-box attack using a 46-dimensional feature vector from the model's reconstruction loss at varying masking ratios, achieving high AUC scores and showing MDLMs are more vulnerable than previously thought.
View Cached Full Text
Cached at: 05/19/26, 06:43 AM
# Membership Inference Attacks on Discrete Diffusion Language Models
Source: [https://arxiv.org/html/2605.16445](https://arxiv.org/html/2605.16445)
###### Abstract
Masked Diffusion Language Models \(MDLMs\) replace autoregressive generation with iterative demasking, and their privacy properties are largely unstudied\. We study membership inference attacks \(MIA\) on fine\-tuned MDLMs and show they are significantly more vulnerable than current grey\-box baselines suggest\. We extract a 46\-dimensional feature vector from the model’s reconstruction loss at four masking ratios and train XGBoost and MLP classifiers on top\. On the MIMIR benchmark across six text domains, XGBoost achieves mean AUC0\.878\(±\\pm0\.046\), peaking at0\.930on Pile\-CC, and beats the SAMA grey\-box baseline by\+0\.062AUC on average\. A leave\-one\-signal\-out ablation shows that the ELBO trajectory alone drives most of this \(mean drop 0\.130 when removed\), while attention features add almost nothing \(<<0\.003\)\. We also design a shadow\-model transfer attack whereK=3K=3surrogate MDLMs, trained on data from unrelated domains, generate classifier labels with no access to the target domain\. This achieves0\.858mean AUC, within 0\.020 of the white\-box oracle, and establishes shadow\-model transfer as a practical and near\-equally effective attack path\.
## 1Introduction
Large language models trained or fine\-tuned on private data can memorise parts of that data\(Carliniet al\.,[2021](https://arxiv.org/html/2605.16445#bib.bib7); Shokriet al\.,[2017](https://arxiv.org/html/2605.16445#bib.bib9)\)\. Membership inference attacks \(MIA\) try to detect this: given a text and some access to a model, can an attacker tell whether that text was in the training set? This is the standard empirical test for privacy leakage, and it matters practically whenever models are fine\-tuned on sensitive corpora like medical records, legal documents, or private codebases\(Carliniet al\.,[2022](https://arxiv.org/html/2605.16445#bib.bib8)\)\.
Most MIA work targets autoregressive \(AR\) models, where a per\-token log\-probability is directly available and serves as a natural score\(Duanet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib12); Yeomet al\.,[2018](https://arxiv.org/html/2605.16445#bib.bib10)\)\. Masked Diffusion Language Models\(Austinet al\.,[2021](https://arxiv.org/html/2605.16445#bib.bib1); Sahooet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib2); Nieet al\.,[2025](https://arxiv.org/html/2605.16445#bib.bib4)\)work differently: they learn to reconstruct randomly masked tokens rather than predict the next one\. There is no single unconditional likelihood to read off\. Instead, the loss depends on which tokens are masked and at what ratio, and this dependency turns out to carry membership information that scalar\-based attacks miss\.
We focus on the ELBO trajectory: how the masked reconstruction loss changes as the masking ratioα\\alphaincreases from 0\.05 to 0\.50\. A model that has memorised a text reconstructs it cheaply at any masking level, producing a lower and more concave loss curve than for unseen text\. Collecting these values at four masking levels and adding a few supporting signals \(predictive entropy, hidden\-state norms\) gives a 46\-dimensional feature vector that a gradient\-boosted classifier can use to separate members from non\-members\.
Prior work on MDLMs includes SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\), a grey\-box attack that aggregates NLL scores across random token subsets\. Our white\-box trajectory features beat it by \+0\.062 AUC on average\. For autoregressive models,Duanet al\.\([2024](https://arxiv.org/html/2605.16445#bib.bib12)\)show that standard loss\-based attacks largely fail on pretrained models but remain viable after fine\-tuning\. MIA for diffusion image models is studied inDockhornet al\.\([2023](https://arxiv.org/html/2605.16445#bib.bib13)\)\. No prior work applies white\-box trajectory extraction to discrete diffusion LMs\.
We make four contributions:
\(1\)White\-box trajectory MIA\.We extract a 46\-dim feature vector at four masking ratios and train XGBoost, LightGBM, and MLP classifiers via 5\-fold CV\. Mean AUC of 0\.878 across six MIMIR domains, peaking at 0\.930\.
\(2\)Shadow\-model transfer\.We train classifiers onK=3K=3shadow MDLMs fine\-tuned on surrogate data from different domains, then apply them directly to the target domain with no retraining\. This achieves 0\.858 mean AUC with no target\-domain labels, closing within 0\.020 of the white\-box oracle\. Shadow transfer is a practical, near\-equally effective attack path\.
\(3\)Feature ablation\.A LOSO analysis across all six domains isolates the ELBO trajectory as the primary signal \(mean drop 0\.130 when removed\)\. All four attention\-based signal groups contribute less than 0\.003 each\.
\(4\)Attention features do not transfer\.When only attention features are used in the shadow\-model setting, AUC drops to 0\.525, confirming they are domain\-specific and not reusable across domains\. ELBO\-based signals are\.
## 2Background
### 2\.1Masked Diffusion Language Models
An MDLM defines a forward process that masks each token of a clean sequencex0x\_\{0\}independently with probabilitytt, wheret∼𝒰\(ε,1\)t\\sim\\mathcal\{U\}\(\\varepsilon,1\):
q\(xt∣x0\)=∏i=1L\[t⋅𝟏\[xt\(i\)=\[M\]\]\+\(1−t\)⋅𝟏\[xt\(i\)=x0\(i\)\]\]\.q\(x\_\{t\}\\mid x\_\{0\}\)=\\prod\_\{i=1\}^\{L\}\\bigl\[t\\cdot\\mathbf\{1\}\[x\_\{t\}^\{\(i\)\}=\\texttt\{\[M\]\}\]\+\(1\-t\)\\cdot\\mathbf\{1\}\[x\_\{t\}^\{\(i\)\}=x\_\{0\}^\{\(i\)\}\]\\bigr\]\.\(1\)The modelpθp\_\{\\theta\}learns to fill in the masked positions, trained by minimising
ℒθ\(x0\)=𝔼t,xt∼q\(⋅∣x0\)\[−logpθ\(x0∣xt\)masked\]\.\\mathcal\{L\}\_\{\\theta\}\(x\_\{0\}\)=\\mathbb\{E\}\_\{t,\\,x\_\{t\}\\sim q\(\\cdot\\mid x\_\{0\}\)\}\\\!\\left\[\-\\log p\_\{\\theta\}\(x\_\{0\}\\mid x\_\{t\}\)\_\{\\mathrm\{masked\}\}\\right\]\.\(2\)The model used throughout isdllm\-hub/Qwen3\-0\.6B\-diffusion\-mdlm\-v0\.1, a 0\.6B\-parameter MDLM on the Qwen3 backbone trained under the MDLM framework\(Sahooet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib2); Qwen Team,[2025](https://arxiv.org/html/2605.16445#bib.bib14)\)\.
### 2\.2Membership Inference Attacks
An MIA is a binary classifier that decides whether a textxxwas in the training set, given some access to the model\. Access level determines the attack category:
Black\-boxattacks use only scalar outputs\. Loss\(Yeomet al\.,[2018](https://arxiv.org/html/2605.16445#bib.bib10)\)thresholds on NLL directly\. Zlib normalises NLL by compressed text length\. Ratio divides the fine\-tuned model’s NLL by the base model’s NLL\.
Grey\-boxattacks use model NLL for arbitrary masked inputs but not internal states\. SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\)falls here\.
White\-boxattacks have full access to weights, hidden states, and attention maps\. Our attack is white\-box on the target model, but we also study a shadow\-model variant that requires no target\-domain labels\.
We evaluate on the MIMIR benchmark\(Duanet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib12)\), which provides membership\-labelled samples across six text domains with a controlled n\-gram split to prevent trivial lexical overlap\.
### 2\.3SAMA Baseline
SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\)runs in two phases\. Phase I applies progressive cumulative masking overTTsteps\. At each stepssit drawsNNrandom token subsets\{Un\}\\\{U\_\{n\}\\\}and computesβs=1N∑n𝟏\[ℓR\(Un\)\>ℓT\(Un\)\]\\beta\_\{s\}=\\frac\{1\}\{N\}\\sum\_\{n\}\\mathbf\{1\}\[\\ell\_\{R\}\(U\_\{n\}\)\>\\ell\_\{T\}\(U\_\{n\}\)\], whereℓR\\ell\_\{R\}andℓT\\ell\_\{T\}are subset NLL from the base and fine\-tuned models\. Phase II aggregates with harmonic weights:
ϕ\(x\)=∑s=1T1/s∑j=1T1/jβs\.\\phi\(x\)=\\sum\_\{s=1\}^\{T\}\\frac\{1/s\}\{\\sum\_\{j=1\}^\{T\}1/j\}\\,\\beta\_\{s\}\.\(3\)We run SAMA withT=4T=4,N=128N=128,M=10M=10tokens per subset\. It is designed for moderate memorisation; Section[5](https://arxiv.org/html/2605.16445#S5)shows where it falls short compared to trajectory\-based features\.
## 3Method
### 3\.1Threat Model
We consider two attack settings\. The first is full white\-box: the attacker has the fine\-tuned model weights and can run forward passes with any masking pattern, reading hidden states, attention maps, and gradient tensors\. This models a scenario where weights are published or leaked, which is realistic for models shared on public hubs\. The second is shadow\-model transfer: the attacker does not have the target model’s membership labels, but does have the architecture and fine\-tuning recipe, and can train surrogate models on unrelated data\. Both settings are evaluated in Section[5](https://arxiv.org/html/2605.16445#S5)\.
### 3\.2Memorisation Verification
Before running attacks we check that fine\-tuning has actually induced memorisation\. We compute the ELBO gap:
Δmem\(x\)=ℒbase\(x\)−ℒFT\(x\),\\Delta\_\{\\mathrm\{mem\}\}\(x\)=\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(x\)\-\\mathcal\{L\}\_\{\\mathrm\{FT\}\}\(x\),\(4\)whereℒbase\\mathcal\{L\}\_\{\\mathrm\{base\}\}is the base model ELBO andℒFT\\mathcal\{L\}\_\{\\mathrm\{FT\}\}is the fine\-tuned model ELBO on the same text\. A positive gap means fine\-tuning lowered the reconstruction cost for that text\. Across our six domains, member gaps run from 2\.19 to 2\.75 nats and non\-member gaps from 1\.39 to 2\.12 nats \(Figure[1](https://arxiv.org/html/2605.16445#S3.F1)\)\. The separation confirms memorisation is present and measurable\.
### 3\.3Diffusion\-Trajectory Feature Extraction
Each challenge textxxis evaluated at masking ratios𝜶=\{0\.05,0\.20,0\.35,0\.50\}\\boldsymbol\{\\alpha\}=\\\{0\.05,\\,0\.20,\\,0\.35,\\,0\.50\\\}\(T=4T=4levels\), withK=8K=8independent random masks per level to reduce variance\. At each levelαj\\alpha\_\{j\}we compute 11–12 scalar signals, giving a 46\-dimensional feature vectorϕ\(x\)\\phi\(x\)after concatenation\. The main signal groups are:
ELBO trajectory\(4 dims\): the masked NLLℒ\(x;αj\)\\mathcal\{L\}\(x;\\alpha\_\{j\}\)at each ratio\. Members have lower values and a more concave curve\.
ELBO derivatives\(8 dims\): finite\-difference estimates ofdL/dtdL/dtandd2L/dt2d^\{2\}L/dt^\{2\}\.
Predictive entropy\(4 dims\): mean−∑vpθ\(v\|xαj\)logpθ\(v\|xαj\)\-\\sum\_\{v\}p\_\{\\theta\}\(v\|x\_\{\\alpha\_\{j\}\}\)\\log p\_\{\\theta\}\(v\|x\_\{\\alpha\_\{j\}\}\)over masked positions at each level\.
Mask consistency\(4 dims\): agreement between two independent mask realisations at the sameαj\\alpha\_\{j\}\.
Hidden\-state statistics\(8 dims\): meanℓ2\\ell\_\{2\}\-norm and cosine similarity between the fine\-tuned and base model hidden states\.
Attention signals\(16 dims\): attention entropy, cross\-layer correlation, transport barycenter, and attention perturbation at each masking level\.
Cross\-model cosine\(1 dim\) andELBO variance\(1 dim\) round out the 46 dimensions\. The full breakdown is in Appendix[C](https://arxiv.org/html/2605.16445#A3)\.
### 3\.4Attack Classifiers
We train three classifiers onϕ\(x\)\\phi\(x\): XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2605.16445#bib.bib18)\)\(200 trees, depth 4, lr 0\.05, subsample 0\.8\), LightGBM \(same settings\), and an MLP with hidden layers \(256, 128, 64\), dropout 0\.3, and early stopping on a 20% validation split\. All three use 5\-fold stratified cross\-validation; out\-of\-fold probabilities feed the final AUC computation\. Bootstrap CIs use 1000 resamples\.
### 3\.5Shadow\-Model Transfer
For settings where the attacker has no membership labels in the target domain, we trainKs=3K\_\{s\}=3shadow MDLMs on surrogate data from thengram\_13\_0\.2MIMIR split, pooling across all six domains \(≈\\approx5,695 members total, disjoint from the evaluation set\)\. Features are extracted from each shadow model’s training and held\-out sequences in the same 46\-dimensional space, giving labelled pairs\(ϕshadow,yshadow\)\(\\phi\_\{\\mathrm\{shadow\}\},y\_\{\\mathrm\{shadow\}\}\)\. An XGBoost classifier trained on this shadow data is then applied directly to features from the target fine\-tuned model, with no retraining\.
We evaluate five conditions to map out the attack surface \(Table[2](https://arxiv.org/html/2605.16445#S5.T2)\):
A \(Oracle\):5\-fold CV using target membership labels\. This is the white\-box upper bound\.
B \(Shadow\-46\):Full 46\-dim shadow transfer, no target labels\.
C \(ELBO\+H\):8\-dim subset: ELBO and entropy features only\.
D \(Attn\-only\):16\-dim attention features only \(negative control\)\.
E \(Pruned\-30\):46\-dim minus the 16 attention dimensions\.
Figure 1:Memorisation verification: mean ELBO trajectory for member and non\-member texts across six domains, evaluated at masking ratiosα∈\{0\.05,0\.20,0\.35,0\.50\}\\alpha\\in\\\{0\.05,0\.20,0\.35,0\.50\\\}\. Member curves sit consistently lower and more concave than non\-member curves, confirming that fine\-tuning induced measurable memorisation in all six domains\.
## 4Experimental Setup
#### Dataset\.
We use the MIMIR benchmark\(Duanet al\.,[2024](https://arxiv.org/html/2605.16445#bib.bib12)\),ngram\_13\_0\.8split, across six domains: arXiv, GitHub, HackerNews, Pile\-CC, PubMed Central, and Wikipedia\. This split enforces that members and non\-members share at most 20% of their unique 13\-grams, which prevents trivial lexical overlap from inflating scores\. For each domain we use 300 members and 300 non\-members as the classifier evaluation set, with 1,000 members used for fine\-tuning\. Texts are tokenised with Qwen2Tokenizer and truncated to 256 tokens\.
#### Fine\-tuning\.
We fine\-tunedllm\-hub/Qwen3\-0\.6B\-diffusion\-mdlm\-v0\.1separately per domain using AdamW \(η=10−4\\eta=10^\{\-4\}, weight decay 0\.01\), 5 epochs, batch size 8, bfloat16\. Training runs on a single NVIDIA L4 GPU \(24 GB\) and takes roughly 6–7 minutes per domain\. We verify memorisation via the ELBO gap before running attacks \(Section[3\.2](https://arxiv.org/html/2605.16445#S3.SS2)\)\.
#### Baselines\.
Black\-box: Loss \(fine\-tuned NLL\), Zlib \(NLL / compressed length\), Ratio \(NLLFT/ NLLbase\)\. Grey\-box: SAMA\(Chenet al\.,[2026](https://arxiv.org/html/2605.16445#bib.bib5)\)\(T=4T=4steps,N=128N=128subsets,M=10M=10tokens\)\.
#### Metrics\.
AUC\-ROC with 95% bootstrap CIs \(1,000 resamples\), and TPR at fixed FPR thresholds \{0\.1%, 1%, 10%\}\. Low\-FPR TPR is the practically relevant metric: a real auditor wants a small false\-alarm rate while finding as many members as possible\.
## 5Results
### 5\.1TPR at Low FPR
Figure[2](https://arxiv.org/html/2605.16445#S5.F2)shows TPR at 0\.1% and 1% FPR across all six domains\. These are the operationally relevant thresholds: a real privacy auditor needs near\-zero false\-alarm rate while catching as many actual training samples as possible\.
At1% FPR, XGBoost recovers 15–40% of members depending on domain, compared to 3–10% for SAMA\. On Pile\-CC, XGBoost identifies 37\.7% of members at this threshold; on HackerNews it reaches 40\.1%\. The improvement over SAMA is 4–8×\\timesacross domains\.
At0\.1% FPR, XGBoost still recovers 4–18% of members\. This threshold means one false alarm per 1,000 non\-members tested\. SAMA cannot operate reliably at this resolution withN=128N=128subsets, while the trajectory classifier does so without degradation\.
Figure 2:TPR at 0\.1% FPR \(left\) and 1% FPR \(right\) by domain and method\. XGBoost and MLP substantially outperform SAMA and all black\-box methods at both thresholds\. SAMA is absent from the 0\.1% panel because its score distribution does not resolve at that resolution\.
### 5\.2AUC\-ROC Overview
Table[1](https://arxiv.org/html/2605.16445#S5.T1)and Figure[3](https://arxiv.org/html/2605.16445#S5.F3)give the aggregate picture\. XGBoost achieves mean AUC0\.878\(±\\pm0\.046\) and MLP achieves0\.882, both well above the SAMA grey\-box baseline at 0\.816 \(±\\pm0\.037\)\. The \+0\.062 gap over SAMA comes from capturing the full shape of the ELBO curve rather than a scalar per\-subset NLL comparison\. Black\-box attacks \(Loss, Zlib, Ratio\) cluster between 0\.625 and 0\.722 and gain nothing from the diffusion structure\.
Figure 3:AUC\-ROC across six methods and six MIMIR domains\. XGBoost and MLP \(bottom two rows\) consistently reach the highest values\. SAMA \(fourth row\) sits between black\-box and white\-box attacks\.Table 1:AUC\-ROC and TPR@1%FPR across six MIMIR domains\. 95% bootstrap CIs shown for XGBoost only \(1000 resamples\)\.Bold: best AUC per domain\. Black\-box: Loss, Zlib, Ratio\. Grey\-box: SAMA\. White\-box \(ours\): XGBoost, MLP\.
### 5\.3Shadow\-Model Transfer
Table 2:Shadow\-model transfer results by condition and domain\.A: Oracle \(white\-box upper bound, 5\-fold CV on target labels\)\.B: Shadow\-46 \(full 46\-dim transfer, no target labels\)\.C: ELBO\+Entropy \(8\-dim subset only\)\.D: Attention\-only \(16\-dim, negative control\)\.E: Pruned\-30 \(attention dims removed\)\. Mean±\\pmstd computed across six domains\.“–” for SAMA at 0\.1% FPR: SAMA’s score distribution does not resolve below 1% FPR reliably withN=128N=128subsets\.
The shadow\-model attack achieves results close to the white\-box oracle across all six domains \(Figure[4](https://arxiv.org/html/2605.16445#S5.F4), Table[2](https://arxiv.org/html/2605.16445#S5.T2)\)\. Oracle AUC averages 0\.878; Shadow\-46 \(Condition B\) averages 0\.858, a gap of only 0\.020\. More strikingly, at 1% FPR the shadow attack recovers a mean of 19\.5% of members with no target\-domain labels, compared to the oracle’s 27\.4%\. The gap in absolute detection rate narrows in the domains where ELBO signal is strongest: on Pile\-CC, Shadow\-46 actually exceeds the oracle’s TPR@1%FPR \(40\.4% vs\. 37\.7%\), an artefact of the cross\-domain calibration\.
Reducing to just 8 ELBO\+entropy dimensions \(Condition C\) costs only 0\.015 AUC and stays within 0\.2 percentage points of Shadow\-46 at 1% FPR on average, while extracting features roughly 6×\\timesfaster\. Pruned\-30 \(Condition E, no attention dims\) closely matches Shadow\-46 throughout\.
Condition D \(attention only\) collapses to AUC 0\.525 and TPR@1%FPR of 1\.6% across domains – essentially random\. Attention patterns are domain\-specific and carry no transferable membership signal\. ELBO\-based signals transfer; attention signals do not\.
Figure 4:Shadow\-model AUC by condition and domain\. Dashed lines show the domain\-specific SAMA baseline\. Oracle \(A\) and Shadow\-46 \(B\) are within 0\.020 of each other in every domain\. Attn\-only \(D\) falls to near\-random\.
### 5\.4Domain Analysis
Figure 5:Aggregate ROC curves \(mean±\\pmstd across six domains\)\. XGBoost and MLP dominate across the full FPR range\.Attack difficulty varies by domain \(Figure[5](https://arxiv.org/html/2605.16445#S5.F5)\)\. Pile\-CC \(AUC 0\.930, TPR@1%FPR 37\.7%\) and HackerNews \(0\.928, 40\.1%\) are most vulnerable: high lexical diversity means the fine\-tuned model’s reconstruction advantage concentrates on its training texts\. Wikipedia and arXiv are intermediate\. GitHub \(AUC 0\.801, TPR@1%FPR 15\.9%\) is hardest: source code is structurally regular, so member and non\-member files produce similar ELBO values even though the raw member gap is the largest of all domains \(2\.75 nats\)\. Per\-domain ROC and PR curves are in Appendix[F](https://arxiv.org/html/2605.16445#A6)\.
### 5\.5Feature Ablation
Figure 6:Left: mean LOSO AUC drop per feature group across six domains\. Right: mean solo AUC per group using only that group\. The ELBO trajectory dominates; all attention groups are near\-zero\.Figure[6](https://arxiv.org/html/2605.16445#S5.F6)shows the leave\-one\-signal\-out \(LOSO\) analysis\. Removing the ELBO trajectory group drops mean AUC by0\.130, exceeding the entire gap between XGBoost and SAMA\. Predictive entropy is the next most useful group \(LOSO drop 0\.043\)\. All four attention groups contribute less than 0\.006 each\. Used alone, the ELBO trajectory achieves solo AUC between 0\.67 and 0\.84 across domains, approaching SAMA’s performance without any subset enumeration\.
The shadow conditions reinforce this: ELBO\+H \(8 dims\) reaches 0\.843 AUC and 19\.2% TPR@1%FPR, while attention\-only collapses to 0\.525 AUC and 1\.6% TPR\. Full per\-domain tables are in Appendix[D](https://arxiv.org/html/2605.16445#A4)\.
## 6Conclusion
Fine\-tuned MDLMs are significantly more vulnerable to membership inference than grey\-box baselines suggest\. Extracting a 46\-dimensional diffusion\-trajectory feature vector at four masking levels and training gradient\-boosted classifiers on top achieves mean AUC 0\.878 across six MIMIR domains, \+0\.062 over SAMA\. The ELBO trajectory drives most of this: removing it alone drops AUC by 0\.130, while all attention features together contribute less than 0\.006\.
The shadow\-model attack makes this threat practical\. WithK=3K=3surrogate MDLMs trained on data from different domains, we reach 0\.858 mean AUC with no target\-domain labels at all\. The oracle\-to\-shadow gap of 0\.020 is small enough to consider shadow transfer a viable standalone attack, not just an approximation\. Reducing to 8 ELBO\+entropy dimensions \(Condition C\) costs only 0\.015 more AUC and makes extraction roughly 6×\\timesfaster, offering a good trade\-off for resource\-constrained auditing\.
#### Limitations\.
All experiments use a single model family \(Qwen3\-0\.6B MDLM\) in a fine\-tuned setting with small training sets \(300–1,000 samples per domain\)\. Behaviour on larger models or in the pretraining regime may differ\. The white\-box threat model is also optimistic; deployments that restrict internal access would limit the attack to the shadow\-model path\.
#### Future directions\.
On the defence side, differential privacy during fine\-tuning\(Abadiet al\.,[2016](https://arxiv.org/html/2605.16445#bib.bib17)\)and explicit regularisation of the ELBO trajectory are natural candidates\. On the attack side, extending to LLaDA and other MDLM variants, and estimating the ELBO trajectory from generation\-time outputs for fully black\-box settings, are open problems\.
## References
- M\. Abadi, A\. Chu, I\. Goodfellow, H\. B\. McMahan, I\. Mironov, K\. Talwar, and L\. Zhang \(2016\)Deep learning with differential privacy\.InACM CCS,Cited by:[§6](https://arxiv.org/html/2605.16445#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. van den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 17981–17993\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1)\.
- N\. Carlini, S\. Chien, M\. Nasr, S\. Song, A\. Terzis, and F\. Tramèr \(2022\)Membership inference attacks from first principles\.InIEEE Symposium on Security and Privacy,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p1.1)\.
- N\. Carlini, F\. Tramèr, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. B\. Brown, D\. Song, Ú\. Erlingsson, A\. Oprea, and C\. Raffel \(2021\)Extracting training data from large language models\.InUSENIX Security Symposium,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p1.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: a scalable tree boosting system\.arXiv preprint arXiv:1603\.02754\.Cited by:[§3\.4](https://arxiv.org/html/2605.16445#S3.SS4.p1.1)\.
- Y\. Chen, K\. Zhang, Y\. Du, E\. Stoppa, C\. Fleming, A\. Kundu, B\. Ribeiro, and N\. Li \(2026\)Membership inference attacks against fine\-tuned diffusion language models\.InInternational Conference on Learning Representations,Note:arXiv:2601\.20125Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.16445#S2.SS2.p3.1),[§2\.3](https://arxiv.org/html/2605.16445#S2.SS3.p1.7),[§4](https://arxiv.org/html/2605.16445#S4.SS0.SSS0.Px3.p1.5)\.
- T\. Dockhorn, T\. Cao, A\. Vahdat, and K\. Krause \(2023\)Differentially private diffusion models\.InTransactions on Machine Learning Research,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p4.1)\.
- M\. Duan, A\. Suri, N\. Mireshghallah, S\. Min, W\. Shi, L\. Zettlemoyer, Y\. Tsvetkov, Y\. Choi, D\. Evans, and H\. Hajishirzi \(2024\)Do membership inference attacks work on large language models?\.InConference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1),[§1](https://arxiv.org/html/2605.16445#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.16445#S2.SS2.p5.1),[§4](https://arxiv.org/html/2605.16445#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)LLaDA: large language diffusion with masking\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint\.Cited by:[§2\.1](https://arxiv.org/html/2605.16445#S2.SS1.p1.5)\.
- S\. S\. Sahoo, M\. Arriola, A\. Gokaslan, E\. Marroquin, A\. M\. Rush, Y\. Schiff, J\. T\. Chiu, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.16445#S2.SS1.p1.5)\.
- R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov \(2017\)Membership inference attacks against machine learning models\.InIEEE Symposium on Security and Privacy,pp\. 3–18\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p1.1)\.
- S\. Yeom, I\. Giacomelli, M\. Fredrikson, and S\. Jha \(2018\)Privacy risk in machine learning: analyzing the connection to overfitting\.InIEEE Computer Security Foundations Symposium,pp\. 268–282\.Cited by:[§1](https://arxiv.org/html/2605.16445#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.16445#S2.SS2.p2.1)\.
## Appendix AProof\-of\-Concept: Run 1 \(Single\-Domain Overfitting\)
Before running the full six\-domain pipeline we validated the attack framework on a single controlled experiment: intentionally overfit the MDLM on a small dataset and verify that white\-box trajectory features can recover membership under extreme memorisation\.
### Setup
We fine\-tunedQwen3\-0\.6B\-diffusion\-mdlm\-v0\.1on 300 GitHub member texts for15 epochs\(AdamW,η=10−4\\eta=10^\{\-4\}, batch 8, bfloat16\)\. The intent was to push memorisation hard and stress\-test the feature extraction pipeline before scaling to six domains\.
### Memorisation Verification
The mean member ELBO gap reached4\.625 nats: the fine\-tuned model reconstructs training texts at almost no reconstruction cost at any masking level\. The non\-member gap was−0\.130\-0\.130nats \(base and fine\-tuned model are essentially equivalent on unseen text\)\. Figure[7](https://arxiv.org/html/2605.16445#A1.F7)shows the distribution\.
Figure 7:ELBO gap distribution from Run 1 \(GitHub, 15 epochs\)\. Member gap mean: 4\.625 nats; non\-member gap mean:−\-0\.130 nats\. The two distributions are nearly fully separated\.
### Attack Results
### Discussion
White\-box trajectory features cleanly recover membership even under extreme overfitting\. The ELBO curve is consistently lower and more concave for member texts regardless of how low the absolute reconstruction values get, so the classifier can separate the two populations even when the member gap is nearly 5 nats above zero\.
This experiment confirmed that the feature extraction pipeline is sound\. It also showed that 15 epochs is too many for the main experiments: the model over\-memorises in a way that collapses the discrimination into a trivial task and moves the operating point far outside the regime of practical deployments\. The main experiments use 5 epochs \(see Appendix[B](https://arxiv.org/html/2605.16445#A2)\), which produces member ELBO gaps of 2\.19–2\.75 nats across domains and keeps the attack in a more realistic range\.
## Appendix BHyperparameter Search: Run 2
Run 2 extended the proof\-of\-concept by introducing all five attack methods \(Loss, Zlib, Ratio, SAMA, XGBoost, MLP\) on the GitHub domain and varying the number of fine\-tuning epochs to find the operating point with the best membership discriminability\.
### Epoch Ablation
The central finding is that discriminability is*non\-monotone*in training epochs\. With too few epochs the model has not memorised sufficiently to create an ELBO gap; with too many epochs \(15 as in Run 1\), over\-memorisation causes the SAMA baseline to collapse and makes the domain generally “easy” regardless of membership status\. The 5\-epoch operating point produced ELBO member gaps of 2\.19–2\.75 nats across domains—within SAMA’s designed operating range—while maintaining strong XGBoost discriminability\.
### Final Hyperparameter Configuration
All experiments in Run 3 and Run 4 use these exact hyperparameters\.
## Appendix CFull Feature Taxonomy \(46 Dimensions\)
Table[3](https://arxiv.org/html/2605.16445#A3.T3)lists all 46 features in the order they appear in the feature vectorϕ\(x\)\\phi\(x\)\. Features are extracted at each of four masking ratiosαj∈\{0\.05,0\.20,0\.35,0\.50\}\\alpha\_\{j\}\\in\\\{0\.05,0\.20,0\.35,0\.50\\\}unless otherwise noted \(marked “global”\)\. The*Expected direction*column indicates the sign of the differenceϕmember−ϕnon\-member\\phi\_\{\\text\{member\}\}\-\\phi\_\{\\text\{non\-member\}\}predicted by the memorisation hypothesis; this is used as a sanity check but not enforced during training\.
Table 3:Complete 46\-dimensional feature vector description\.K=8K=8independent random masks are averaged per\(feature,αj\)\(\\text\{feature\},\\alpha\_\{j\}\)pair\.#### Notes\.
Features in the ELBO Variance,dL/dtdL/dt, andd2L/dt2d^\{2\}L/dt^\{2\}groups show near\-zero LOSO drops \(Table[4](https://arxiv.org/html/2605.16445#A4.T4)\) despite representing distinct mathematical quantities\. This is because atT=4T=4masking levels, the finite\-difference estimates of derivatives are nearly collinear with the raw ELBO trajectory values at those four points; removing either set leaves the other to compensate almost perfectly\. Extending toT≥8T\\geq 8levels may break this collinearity and reveal independent signal in the derivative features\.
## Appendix DFull Ablation Study \(Run 3\)
### D\.1LOSO AUC Drop: All Domains
Table 4:Leave\-One\-Signal\-Out \(LOSO\) AUC drop per feature group and domain\. Positive values indicate removing that group hurts performance\.Bold: largest drop \(most important\)\. “–” denotes negligible or negative drop \(<<0\.001\)\. Full\-model AUC: arXiv 0\.877, GitHub 0\.801, HackerNews 0\.928, Pile\-CC 0\.930, PubMed 0\.842, Wikipedia 0\.892\.“–” denotes negligible or negative drop \(<0\.001<0\.001\), indicating the feature group is redundant or slightly harmful\.
Table[4](https://arxiv.org/html/2605.16445#A4.T4)shows per\-domain LOSO drops alongside the mean\. Full\-model AUC baseline: arXiv 0\.877, GitHub 0\.801, HackerNews 0\.928, Pile\-CC 0\.930, PubMed 0\.842, Wikipedia 0\.892\.
### D\.2Solo AUC: All Domains
Table 5:Solo AUC per feature group \(using only that group’s dimensions\) per domain\. Random baseline is 0\.500\.
### D\.3Discussion
Derivative features show near\-zero LOSO drop\.dL/dtdL/dtandd2L/dt2d^\{2\}L/dt^\{2\}are finite\-difference estimates computed from the same four ELBO values as the trajectory group\. AtT=4T=4levels,dL/dt\|tj≈ℒ\(tj\+1\)−ℒ\(tj−1\)dL/dt\|\_\{t\_\{j\}\}\\approx\\mathcal\{L\}\(t\_\{j\+1\}\)\-\\mathcal\{L\}\(t\_\{j\-1\}\), which is a linear combination of the trajectory values\. The two groups are nearly collinear, so removing either one leaves the other to compensate\. Extending toT≥8T\\geq 8masking levels would break this collinearity\.
Attention features are domain\-specific\.Solo AUC for all attention groups is 0\.47–0\.53 across most domains, near random\. In MDLMs the masked denoising objective produces attention patterns that depend heavily on the specific masking configuration and domain vocabulary, making them inconsistent across theK=8K=8mask realisations\. This is why they also fail under shadow\-model transfer \(Condition D, AUC 0\.525\): they contain no signal that generalises across domains\.
Minimal effective attack\.ELBO trajectory and predictive entropy together \(8 dimensions\) achieve mean AUC 0\.843 in shadow transfer\. This is cheaper than the full 46\-dim vector and still above SAMA\.
Figure 8:LOSO AUC drop \(left\) and solo AUC \(right\), averaged across six domains\.
## Appendix EShadow Model: Full Results \(Run 4\)
### E\.1Experimental Details
Three shadow MDLMs \(Ks=3K\_\{s\}=3\) are fine\-tuned on surrogate data from thengram\_13\_0\.2MIMIR split, which is disjoint from thengram\_13\_0\.8evaluation set\. The surrogate pool covers all six domains and contains roughly 5,695 member sequences total, with about 1,000 per shadow model\. Non\-member texts come from Wikipedia and Common Crawl subsets not used in evaluation\. Each shadow model uses the same hyperparameters as the target: 5 epochs, AdamWη=10−4\\eta=10^\{\-4\}, bfloat16\.
A single XGBoost classifier is trained on features pooled from all three shadow models and applied to the target domain without any retraining\. No target\-domain labels are used\.
### E\.2Full Condition×\\timesDomain Results
Table[2](https://arxiv.org/html/2605.16445#S5.T2)in the main text gives the complete AUC matrix\. Figure[9](https://arxiv.org/html/2605.16445#A5.F9)shows the same data as a heatmap\.
Figure 9:Shadow\-model AUC by condition and domain\. Condition D \(attention only\) is near\-random universally\. Conditions B, C, E all sit above SAMA in most domains\.
### E\.3Per\-Group Transfer AUC
Figure[10](https://arxiv.org/html/2605.16445#A5.F10)shows the shadow transfer AUC when training on one feature group at a time, analogous to the solo AUC analysis in Appendix[D](https://arxiv.org/html/2605.16445#A4)but in the transfer setting\.
Figure 10:Per\-group shadow transfer AUC \(mean±\\pmstd across six domains\)\. ELBO trajectory is the only group that transfers above 0\.70\. All attention groups sit near 0\.50\.
### E\.4Discussion
Why attention features do not transfer\.Attention heads in MDLMs specialise to local syntactic patterns of each domain: mathematical notation in arXiv, Python indentation in GitHub, informal pronouns in HackerNews\. A classifier trained on these patterns in the shadow domain applies to the wrong pattern space in the target\. ELBO trajectory, by contrast, depends mainly on the model’s memorisation degree, which is set by the training recipe and generalises across text domains when the architecture is fixed\.
ELBO\+H as a minimal attack\.Condition C \(ELBO\+H, 8 dimensions\) achieves 0\.843 mean AUC with roughly 6×\\timesless extraction cost than the full 46\-dim vector\. For practical auditing this offers a strong accuracy\-to\-cost ratio: 0\.843 AUC with no target labels, compared to SAMA’s 0\.816 AUC which does require black\-box model access per sample\.
## Appendix FPer\-Domain ROC and PR Curves
Figure 11:Per\-domain ROC curves for all five evaluated methods\. The random baseline diagonal is dotted\. AUC values appear in the legend of the arXiv panel; method colours are consistent across panels\.Figure 12:Per\-domain precision\-recall curves\. The horizontal dotted line is the random classifier baseline \(class balance = 0\.5\)\.#### Domain notes\.
arXiv\.XGBoost and MLP track closely above SAMA at all FPR thresholds\.
GitHub\.SAMA \(AUC 0\.795\) nearly matches XGBoost \(0\.801\)\. This is the only domain where SAMA closes to within 0\.01 of the white\-box classifier, likely because source code has low perplexity and SAMA’s NLL comparison works reasonably well\.
HackerNews\.Largest raw gap: XGBoost 0\.928 vs SAMA 0\.833\. Short, informal sentences with distinctive vocabulary produce particularly clean ELBO trajectories for member texts\.
Pile\-CC\.XGBoost achieves its highest TPR at 10% FPR \(0\.803\) here\. SAMA also performs well \(0\.885\) relative to other domains\.
PubMed Central\.Highly formulaic biomedical text reduces the ELBO curvature difference between members and non\-members\.
Wikipedia\.MLP slightly outperforms XGBoost \(0\.902 vs 0\.892\)\. This is the only domain where MLP takes the top spot; encyclopedic text’s consistent structure may favour the MLP’s learned feature interactions\.
Figure 13:KDE of the four most discriminative features for arXiv\. ELBO values at low masking ratios show the clearest separation\.Similar Articles
Extracting Training Data from Diffusion Language Models via Infilling
This paper introduces infilling extraction, a new method for extracting training data from diffusion language models by using arbitrary binary masks, showing that such models are more vulnerable to memorization attacks than previously thought.
Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]
This paper proposes using Masked Diffusion Language Models (MDLMs) as text-based world models for agentic reinforcement learning, showing that their any-order denoising objective avoids prefix mode collapse and leads to stronger performance than autoregressive baselines.
DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Introduces DLLM-JEPA, a JEPA formulation for masked diffusion language models that constructs two views from a single input via the diffusion noise schedule, reducing training FLOPs by 33% relative to LLM-JEPA and improving fine-tuning performance on tasks like GSM8K.
Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs
This paper analyzes the reconstruction-concealment tradeoff in intent-obfuscation jailbreak attacks on Multimodal Large Language Models (MLLMs). It proposes concealment-aware variant construction and keyword-related distractor images to exploit model vulnerabilities more effectively.
TUBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models
Introduces TUBE, a variational upper bound on log-likelihood for discrete diffusion language models, enabling better evaluation and revealing that masked diffusion models still underperform autoregressive models.