On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics

arXiv cs.LG 05/11/26, 04:00 AM Papers
privacy-leakage tabular-data diffusion-models membership-inference synthetic-data ai-security
Summary
This research paper investigates privacy leakage in tabular diffusion models, quantifying how training setups, synthesis choices, and attacker knowledge impact privacy risks. It reveals that adversaries can succeed without perfect knowledge or massive resources and highlights pitfalls in heuristic privacy metrics.
arXiv:2605.06835v1 Announce Type: new Abstract: Tabular data plays an important role in many fields and industries, including those with elevated privacy considerations and risks. As such, there is a rising interest in generating high-quality synthetic proxies for real tabular data as a means of reducing privacy risk and proprietary data exposure. With tabular diffusion models (TDMs) demonstrating leading performance in synthesizing such data, understanding and measuring the privacy risks associated with these models is imperative. Leveraging state-of-the-art membership inference attacks for TDMs in both black- and white-box settings, this work quantifies the impact of training setup, synthesis choices, and attacker knowledge on privacy leakage. Moreover, the results demonstrate that adversaries need not have perfect knowledge of the training setup, identical data distributions, or massive compute resources to construct successful attacks. Finally, the pitfalls associated with applying heuristic privacy metrics, such as distance-to-closest record, are revealed.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 06:56 AM
# On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics
Source: [https://arxiv.org/html/2605.06835](https://arxiv.org/html/2605.06835)
Masoumeh Shafieinejad1,D\. B\. Emerson1,∗Behnoosh Zamanlooy2,222 Elaheh Bassak3,222Fatemeh Tavakoli1Sara Kodeiri1Marcelo Lotif1Xi He1,4 1Vector Institute2McMaster University3University of Toronto4University of Waterloo \{masoumeh,david\.emerson,fatemeh\.tavakoli\}@vectorinstitute\.ai \{sara\.kodeiri,marcelo\.lotif\}@vectorinstitute\.ai zamanlob@mcmaster\.cae\.bassak@mail\.utoronto\.caxi\.he@uwaterloo\.ca

###### Abstract

Tabular data plays an important role in many fields and industries, including those with elevated privacy considerations and risks\. As such, there is a rising interest in generating high\-quality synthetic proxies for real tabular data as a means of reducing privacy risk and proprietary data exposure\. With tabular diffusion models \(TDMs\) demonstrating leading performance in synthesizing such data, understanding and measuring the privacy risks associated with these models is imperative\. Leveraging state\-of\-the\-art membership inference attacks for TDMs in both black\- and white\-box settings, this work quantifies the impact of training setup, synthesis choices, and attacker knowledge on privacy leakage\. Moreover, the results demonstrate that adversaries need not have perfect knowledge of the training setup, identical data distributions, or massive compute resources to construct successful attacks\. Finally, the pitfalls associated with applying heuristic privacy metrics, such as distance\-to\-closest record, are revealed\.

## 1Introduction

With00footnotetext:Work done while at The Vector Institute\.the accelerating importance of machine learning \(ML\) models in all aspects of society, there has been a corresponding rise in interest in using advanced models to generate structured synthetic data\[[25](https://arxiv.org/html/2605.06835#bib.bib77),[54](https://arxiv.org/html/2605.06835#bib.bib76),[19](https://arxiv.org/html/2605.06835#bib.bib75),[50](https://arxiv.org/html/2605.06835#bib.bib74)\]\. One of the most prominent modalities is tabular data, given its ubiquity across industries\. High\-fidelity data synthesis has many applications, including training data augmentation\[[18](https://arxiv.org/html/2605.06835#bib.bib56),[14](https://arxiv.org/html/2605.06835#bib.bib81),[63](https://arxiv.org/html/2605.06835#bib.bib80),[28](https://arxiv.org/html/2605.06835#bib.bib79),[12](https://arxiv.org/html/2605.06835#bib.bib78)\], improving model robustness\[[5](https://arxiv.org/html/2605.06835#bib.bib82),[61](https://arxiv.org/html/2605.06835#bib.bib83),[44](https://arxiv.org/html/2605.06835#bib.bib84),[39](https://arxiv.org/html/2605.06835#bib.bib85)\], and exploring hypothetical or novel circumstances\. However, one of the primary drivers of interest in synthetic tabular data is its potential as a proxy for real data with reduced privacy risks\. In many settings, existing and emerging privacy legislation, such as GDPR, HIPAA, and PIPEDA\[[17](https://arxiv.org/html/2605.06835#bib.bib51),[42](https://arxiv.org/html/2605.06835#bib.bib44),[55](https://arxiv.org/html/2605.06835#bib.bib52)\], regulate the collection, use, and release of sensitive information and datasets\. Moreover, many institutions strictly manage internal risk associated with privacy breach or data misuse\. As such, synthetically generated data is regarded as a prospective solution to privacy risks\. While synthetic data has many advantages and can increase barriers to privacy breach, it is not intrinsically privacy preserving\[[52](https://arxiv.org/html/2605.06835#bib.bib3)\], motivating significant research in privacy\-preserving synthetic generation and privacy auditing techniques\[[51](https://arxiv.org/html/2605.06835#bib.bib25),[31](https://arxiv.org/html/2605.06835#bib.bib86),[59](https://arxiv.org/html/2605.06835#bib.bib87),[47](https://arxiv.org/html/2605.06835#bib.bib88),[22](https://arxiv.org/html/2605.06835#bib.bib89)\]\.

Recent work has shown that tabular diffusion models \(TDMs\) achieve leading performance in tabular synthesis along a number of metrics\[[40](https://arxiv.org/html/2605.06835#bib.bib6),[32](https://arxiv.org/html/2605.06835#bib.bib16),[49](https://arxiv.org/html/2605.06835#bib.bib57)\]\. While some research suggests that diffusion models are robust to overfitting\[[6](https://arxiv.org/html/2605.06835#bib.bib58)\], other work has shown that TDMs still leak information\[[58](https://arxiv.org/html/2605.06835#bib.bib59)\], sometimes at higher rates than their GAN\-based counterparts\[[67](https://arxiv.org/html/2605.06835#bib.bib60)\]\. Differentially private \(DP\) training can reduce privacy risks\[[29](https://arxiv.org/html/2605.06835#bib.bib63),[4](https://arxiv.org/html/2605.06835#bib.bib64)\]but often comes with steep utility and efficiency tradeoffs, especially without large\-scale public pretraining\[[13](https://arxiv.org/html/2605.06835#bib.bib61),[35](https://arxiv.org/html/2605.06835#bib.bib62),[67](https://arxiv.org/html/2605.06835#bib.bib60)\]\. As such, it is important to identify and quantify the mechanisms that impact privacy leakage in TDMs to guide practical and safe use of such models\.

In this work, state\-of\-the\-art membership inference attacks \(MIAs\) for TDMs are applied as a lens to quantify privacy leakage\. Such attacks have emerged as a primary privacy auditing technique for ML models\[[7](https://arxiv.org/html/2605.06835#bib.bib21),[24](https://arxiv.org/html/2605.06835#bib.bib28),[37](https://arxiv.org/html/2605.06835#bib.bib32),[45](https://arxiv.org/html/2605.06835#bib.bib30),[51](https://arxiv.org/html/2605.06835#bib.bib25)\]\. Moreover, successful MIAs can be used to build even deeper privacy attacks\[[46](https://arxiv.org/html/2605.06835#bib.bib65),[3](https://arxiv.org/html/2605.06835#bib.bib66)\]\. Leveraging leading white\- and black\-box MIAs, this research investigates the impact of training choices, architecture configurations, and generation decisions on privacy leakage\. Furthermore, the influence of attacker knowledge, compute power, and data access on the success of MIAs is examined\. Finally, the efficacy of distance\-to\-closest\-record \(DCR\)\[[34](https://arxiv.org/html/2605.06835#bib.bib24)\], and other widely used pseudo\-privacy metrics, in identifying rising or falling privacy risk is compared to MIAs\.

The results show that some traditional levers, such as training steps or dataset size, remain important for privacy in TDMs\. Others depend on the MIA technique, reinforcing the need to model a diversity of approaches, and some are surprisingly unimportant\. We also show that such models remain highly vulnerable to attacks even when attackers have limited compute, imperfect knowledge, or corrupted data\. Lastly, experiments demonstrate a nuanced and unreliable relationship between pseudo\-privacy metrics, such as DCR, and MIA success\. For example, in certain settings, DCR is well positioned to estimate rising MIA risk, while in others, it fails entirely\. These findings have broad implications for the practical use of TDMs and reinforce the importance of MIA auditing in synthetic data generation\.

## 2Related Work

MIAs have been established as a foundational technique for quantifying privacy risk for data used to train ML models\[[51](https://arxiv.org/html/2605.06835#bib.bib25),[24](https://arxiv.org/html/2605.06835#bib.bib28),[45](https://arxiv.org/html/2605.06835#bib.bib30),[37](https://arxiv.org/html/2605.06835#bib.bib32),[8](https://arxiv.org/html/2605.06835#bib.bib31),[7](https://arxiv.org/html/2605.06835#bib.bib21)\]\. The success of diffusion models has driven advances in MIAs specifically designed for such models\[[15](https://arxiv.org/html/2605.06835#bib.bib34)\]\. In particular, MIAs for TDMs have grown considerably\. Successful, model\-agnostic, black\-box methods include DOMIAS\[[56](https://arxiv.org/html/2605.06835#bib.bib22)\], RMIA\[[65](https://arxiv.org/html/2605.06835#bib.bib92)\], EPT\[[21](https://arxiv.org/html/2605.06835#bib.bib67)\], and others\[[58](https://arxiv.org/html/2605.06835#bib.bib59),[43](https://arxiv.org/html/2605.06835#bib.bib26)\]\. For white\-box settings, the SecMI approach\[[15](https://arxiv.org/html/2605.06835#bib.bib34)\]has been adapted for tabular models\[[9](https://arxiv.org/html/2605.06835#bib.bib68),[60](https://arxiv.org/html/2605.06835#bib.bib37),[62](https://arxiv.org/html/2605.06835#bib.bib49)\]\. In this work, we focus on state\-of\-the\-art approaches from the MIDST Challenge\[[48](https://arxiv.org/html/2605.06835#bib.bib33)\], ensuring the experimental framework leverages strong MIAs in both white\- and black\-box settings\.

Some existing work has considered factors that impact MIA success rates and utility for ML models more broadly and generative models for tabular data in controlled settings in particular\. Salem et al\.\[[45](https://arxiv.org/html/2605.06835#bib.bib30)\]propose dropout and model stacking as defenses against MIAs for classification models\. Therein, for the simple class of models studied, it is shown that shadow models need not have identical architectures or data distributions for successful MIAs\. Another study\[[23](https://arxiv.org/html/2605.06835#bib.bib35)\], investigates the effect of input\-output ratio on the statistical similarity and predictive utility of synthetic data\. In\[[26](https://arxiv.org/html/2605.06835#bib.bib36)\], overfitting is examined as a key driver of MIA success across different models, including GAN\- and VAE\-based models\. A taxonomy of existing defenses is provided, with the majority of defenses for generative models falling into DP techniques\. They also exhibit that traditional methods that combat overfitting are not necessarily enough\. Finally, for simple classification models, studies have shown that training data diversity plays an important role, with MIA success decreasing with growing training data\[[51](https://arxiv.org/html/2605.06835#bib.bib25)\]\. The present work focuses on a wide range of mechanisms and attack regimes that influence privacy leakage in TDMs, going well beyond previous studies\.

Several studies investigate the interplay of DP and MIA success in the context of general ML models\. Theoretical results demonstrate that small\-ϵ\\epsilonDP provides an upper bound on MIA success\[[16](https://arxiv.org/html/2605.06835#bib.bib69),[64](https://arxiv.org/html/2605.06835#bib.bib70)\], while empirical studies suggest that largerϵ\\epsilonvalues offer practical MIA protection\[[36](https://arxiv.org/html/2605.06835#bib.bib71),[30](https://arxiv.org/html/2605.06835#bib.bib72)\]\. In\[[27](https://arxiv.org/html/2605.06835#bib.bib20)\], the impact of data dependencies on the success of DP in protecting data against MIAs for ML classifiers is investigated\. The results indicate that DP does not fully protect against MIAs due to the correlative nature of the data distribution\. While DP can provide privacy leakage protection, it can also significantly degrade synthetic data quality in diffusion models\[[67](https://arxiv.org/html/2605.06835#bib.bib60)\]\. Moreover, understanding the mechanisms that reduce leakage in TDMs, as undertaken herein, provides a path to combining the benefits of high\-ϵ\\epsilonDP with such levers to minimize privacy risk while maintaining maximum utility\.

The impact of training decisions and attacker configurations on proxy metrics for privacy risks in synthetic data, especially distance\-based measures, have been considered for TDMs\. Experiments in\[[20](https://arxiv.org/html/2605.06835#bib.bib55)\]investigate the influence of factors such as training epochs, dataset size, and models type on a thresholded form of NNDR\[[53](https://arxiv.org/html/2605.06835#bib.bib73)\]\. Other work has examined whether DCR is a faithful indicator of privacy risk for such models\[[62](https://arxiv.org/html/2605.06835#bib.bib49)\]\. It is demonstrated that, for a limited setup, DCR\-based tests fail to reveal leakage uncovered by MIAs\. The experiments in this work significantly expand these studies in several directions\. We investigate the relationship of several heuristic privacy metrics to MIA success across a variety of setups, revealing a nuanced and unreliable relationship where in some settings these metrics reflect rising privacy risk and dramatically fail in others\. Further, the impact of model training, attack configuration, and adversary knowledge are explored with respect to MIA success\. Finally, other black\-box MIAs that do not require internal parameters are considered\.

## 3Methodology

This research focuses on single\-table synthesis and the ClavaDDPM model\[[40](https://arxiv.org/html/2605.06835#bib.bib6)\], a state\-of\-the\-art adaptation of TabDDPM\[[32](https://arxiv.org/html/2605.06835#bib.bib16)\], and representative of data\-space diffusion generators\. The primary lens through which privacy is assessed is that of MIAs\. Provided a set of data points and a trained target model, the goal of an MIA is to accurately distinguish between points that were members of the model’s training dataset and those that were not\. In the domain of synthetic data, such methods are commonly broken into two main categories, white\- and black\-box attacks\. Black\-box attacks strictly operate on the synthetic outputs of the target model, potentially with access to related public data\. White\-box attacks, on the other hand, are afforded full\-model access, including weights\. Experiments are conducted with the Tartan Federer \(TF\)\[[60](https://arxiv.org/html/2605.06835#bib.bib37)\]and Ensemble\[[33](https://arxiv.org/html/2605.06835#bib.bib90)\]approaches from the MIDST Challenge\[[48](https://arxiv.org/html/2605.06835#bib.bib33)\]\. The TF method won the competition in both white\- and black\-box settings, while the Ensemble technique placed second for black\-box attacks\.111Model training and attack code:[https://github\.com/VectorInstitute/midst\-toolkit](https://github.com/VectorInstitute/midst-toolkit)

The Ensemble and TF attacks are fundamentally different in kind\. The Ensemble method combines several traditional distance\-based approaches, probability mass estimations in DOMIAS, and pairwise likelihood ratio tests in RMIA\. The approach only requires access to synthetic data generated by the target model and real, similarly distributed background data as a reference\. As such, it is not directly tied to diffusion architectures\. However, its success can rely heavily on auxiliary data availability\. Alternatively, whether applied as a black\- or white\-box attack, the TF approach is specifically tailored to diffusion models and adapts the SecMI algorithm\. Briefly, given a data pointx0x\_\{0\}, a forward process producing noisy samples at timettsuch thatxt=αtx0\+1−αtϵx\_\{t\}=\\sqrt\{\\alpha\_\{t\}\}x\_\{0\}\+\\sqrt\{1\-\\alpha\_\{t\}\}\\epsilonfor some sampled Gaussian noiseϵ∼𝒩\(0,I\)\\epsilon\\sim\\mathcal\{N\}\(0,I\), and a backward process model,mθ\(xt,t\)m\_\{\\theta\}\(x\_\{t\},t\)that predicts and removes noise inxtx\_\{t\}, the TF attack extracts loss values for a candidatex0x\_\{0\}as∥\(\(mθ\(xt,t\)−x0\)−ϵ∥22\\\|\(\(m\_\{\\theta\}\(x\_\{t\},t\)\-x\_\{0\}\)\-\\epsilon\\\|\_\{2\}^\{2\}across wide range ofϵ\\epsilonandttvalues\. Leveraging shadow models, these losses are used to train a dense neural network \(DNN\) for membership classification\. In the white\-box setting, the target model is directly available to produce such loss values at inference time\. For the black\-box attack, proxies are constructed by training shadow models on the target’s synthetic data from which the loss values are then extracted\.

### 3\.1Attack Setup and Metrics

For a given dataset𝒟\\mathcal\{D\}, points are split into two subsets\. The first, denoted𝒟t=⋃i=1nt\{𝒟ti\}\\mathcal\{D\}\_\{t\}=\\bigcup\_\{i=1\}^\{n\_\{t\}\}\\\{\\mathcal\{D\}^\{i\}\_\{t\}\\\}, is used to train a set of target models,\{mθi\}i=1nt\\\{m\_\{\\theta\}^\{i\}\\\}\_\{i=1\}^\{n\_\{t\}\}\. This dataset is not known or available to an attacker\. The second,𝒟h\\mathcal\{D\}\_\{h\}, is a set of reference data available, in some form, to the attacker for the development of shadow models or in calculating statistics between synthetic and real data\. In both black\- and white\-box setups, attackers are provided with synthetic datasets,𝒟si\\mathcal\{D\}\_\{s\}^\{i\}, generated bymθim\_\{\\theta\}^\{i\}\. Adversaries are also assumed to have knowledge of target model architecture and training procedure, including hyperparameters\. Traditionally, oracle knowledge of training and architecture elements is assumed\. However, in the experiments to follow, settings with imperfect knowledge are also considered\. In the white\-box framework, all internal parameters for each target modeliiare also revealed to attackers\.

For each target model, a set of membership challenge points,𝒞i\\mathcal\{C\}\_\{i\}, are drawn from𝒟\\mathcal\{D\}, half of which are part of𝒟ti\\mathcal\{D\}\_\{t\}^\{i\}and half of which are not\. An MIA aims to provide an accurate distinction between such points across all target models\. Following\[[7](https://arxiv.org/html/2605.06835#bib.bib21)\], MIA success is measured by computing the true\-positive rate \(TPR\) at a false\-positive rate \(FPR\) of0\.10\.1\. Throughout, this is also referred to as MIA success rate and privacy risk more generally\. Expected random classifier success in this metric is0\.10\.1\. Several pseudo\-privacy metrics are also measured, including DCR, Nearest\-Neighbor Distance Ratio \(NNDR\), Hitting Rate \(HR\), and Epsilon Identifiability Risk \(EIR\)\. DCR is reported in the main results as a representative metric, while definitions and results for others appear in Appendix[E](https://arxiv.org/html/2605.06835#A5)\. Define𝒟hi⊂𝒟h\\mathcal\{D\}^\{i\}\_\{h\}\\subset\\mathcal\{D\}\_\{h\}, a holdout datasets for modelii\. Letxs∈𝒟six\_\{s\}\\in\\mathcal\{D\}\_\{s\}^\{i\}, and denote byxt∈𝒟tix\_\{t\}\\in\\mathcal\{D\}^\{i\}\_\{t\}andxh∈𝒟hix\_\{h\}\\in\\mathcal\{D\}^\{i\}\_\{h\}points in the respective subsets closest, in some measured\(⋅,⋅\)d\(\\cdot,\\cdot\), toxsx\_\{s\}\. DCR is computed as the proportion ofxsx\_\{s\}whered\(xs,xt\)<d\(xs,xh\)d\(x\_\{s\},x\_\{t\}\)<d\(x\_\{s\},x\_\{h\}\)\. An ideal DCR is given as\|𝒟ti\|/\|𝒟ti∪𝒟hi\|\|\\mathcal\{D\}^\{i\}\_\{t\}\|/\|\\mathcal\{D\}^\{i\}\_\{t\}\\cup\\mathcal\{D\}^\{i\}\_\{h\}\|\.

In addition to the privacy metrics, several synthetic tabular\-data quality metrics are computed, includingα\\alpha\-precision andβ\\beta\-recall\[[2](https://arxiv.org/html/2605.06835#bib.bib54)\], Kolmogorov\-Smirnov statistics and Total Variation Distance\[[11](https://arxiv.org/html/2605.06835#bib.bib95)\], correlation and mutual information differences\[[68](https://arxiv.org/html/2605.06835#bib.bib93),[41](https://arxiv.org/html/2605.06835#bib.bib94)\], and machine\-learning efficacy\[[38](https://arxiv.org/html/2605.06835#bib.bib96)\]\. Detailed definitions appear in Appendix[F](https://arxiv.org/html/2605.06835#A6)\. The primary function of these metrics is to consider utility degradation as a factor in MIA success and confirm that synthetic data quality does not collapse\.

### 3\.2Datasets and Default Settings

Below, two large\-scale tabular datasets are used to explore the key factors in MIA success\. The first is the Transactions table from the Berka dataset\[[57](https://arxiv.org/html/2605.06835#bib.bib38)\]consisting of financial records from a Czech Bank, collected in 1999\. The table has four numerical and four categorical columns with over10610^\{6\}rows corresponding to53005300individuals\. The second table is the Diabetes dataset\[[10](https://arxiv.org/html/2605.06835#bib.bib97)\]incorporating records derived from 130 different US hospitals from 1999\-2008\. Each row represents a patient diagnosed with diabetes and includes both numerical and categorical columns associated with demographic information, clinical tests, and other health data\. The dataset is smaller than Berka, containing just over10510^\{5\}records, but incorporates more features,4747in total\. Meaningfully, these datasets are larger than many investigated in previous work, providing a challenging setting for attack success\. Preprocessing details appear in Appendix[G](https://arxiv.org/html/2605.06835#A7)\.

Throughout the experiments, a default configuration, adapted from\[[40](https://arxiv.org/html/2605.06835#bib.bib6)\], for each dataset is varied to demonstrate, for example, the impact of training choices, model architectures, and attacker knowledge on MIA success\. For both datasets, it is assumed that𝒟h\\mathcal\{D\}\_\{h\}is sampled from the same distribution as𝒟t\\mathcal\{D\}\_\{t\}, and that adversaries have oracle knowledge of model architecture and training recipe, unless otherwise specified\. The target models are DNN\-based diffusion architectures with six layers and dimension sequence \[512512,10241024,10241024,10241024,10241024,512512\]\. Each model is trained for200000200000steps with a batch size of40964096\. The number of diffusion timesteps is20002000\. For Berka, each target model is trained on2000020000distinct data points and generates an equal number of synthetic points such that\|𝒟ti\|=\|𝒟si\|=20000\|\\mathcal\{D\}\_\{t\}^\{i\}\|=\|\\mathcal\{D\}\_\{s\}^\{i\}\|=20000\. A total ofnt=10n\_\{t\}=10target models are trained, each with a unique challenge set of size\|𝒞i\|=200\|\\mathcal\{C\}\_\{i\}\|=200\. Alternatively, for the Diabetes dataset, target model training and synthesized datasets are comprised of1000010000samples\. There are three target models, each with\|𝒞i\|=1000\|\\mathcal\{C\}\_\{i\}\|=1000\.

In the TF attack default setting, adversaries train2020shadow models using2000020000points drawn without replacement from𝒟h\\mathcal\{D\}\_\{h\}for the Berka dataset\. For Diabetes, seven shadow models are trained using1000010000points each\. By default, the Ensemble attack trains11target shadow model, and1616RMIA shadow models\. It draws samples of2000020000points without replacement from𝒟h\\mathcal\{D\}\_\{h\}for Berka and1000010000samples for Diabetes to train the attack classifier\. Additional details for the TF and Ensemble attacks are found in Appendices[B](https://arxiv.org/html/2605.06835#A2)and[C](https://arxiv.org/html/2605.06835#A3), respectively\.

### 3\.3Experiment Configurations: Training, Synthesis, and Attacker Knowledge

For both datasets and MIA methods, the influence of training setup, generation choices, and attacker knowledge on privacy risk is quantified through an extensive set of experiments\. Target model hyperparameters associated with training iterations, diffusion timesteps, and batch size are independently varied from the default settings above\. In addition, modifications to training set size and target model architecture are also applied\. Specifically, the DNN backbone is modified such that the hidden dimensions of the network are smaller or larger \(narrow or wide\), the depth of the network is more shallow or deeper, or the target model is subject to fewer or more training steps than the default \(short or long\)\. For details of these adjustments, see Appendix[A](https://arxiv.org/html/2605.06835#A1)\. The ratio of data synthesis volume to training set size with respect to black\-box attack success is also evaluated\. In these experiments, target models generate synthetic data at various multiples of their training data, providing growing pools of synthetic data to attackers compared to the model’s training set size\.

Each of the setups above considers a standard MIA framework wherein attackers are assumed to have the resources necessary to train many large shadow models, perfect knowledge of the target model training setup for shadow model training, and access to identically distributed data\. In the experiments to follow, these assumptions are weakened in various ways\. First, attack success as a function of the number of trained shadow models is considered, simulating resource constraints\. In the second set of experiments, attackers train shadow models with configurations that differ from the target model in several ways\. Shadow models are trained for fewer or more iterations, incorporate fewer diffusion timesteps, or differ in size compared to the target models\.

Finally, we consider settings where the data available to the adversary differs statistically from the target model training data\. For Berka, four scenarios are explored\. In the first, adversary and target model data is drawn from disjoint user accounts\. For the second, the adversary observes data from a later time period, while target models are trained on earlier transactions\. In the third scenario, adversary data is corrupted such that half of the categorical and numerical features are replaced with draws from a uniform distribution across their respective ranges\. In the final setup, independent sampling from one\-way marginal distributions is simulated to constructed attacker data\. For Diabetes, the final two scenarios are again used to corrupt adversary data\. Appendix[A](https://arxiv.org/html/2605.06835#A1)provides additional details about the shadow model and dataset mismatch experiments, including quantifying the statistical divergence of the adversary datasets\.

Beyond the settings above, two aspects specific to the Ensemble attack are also analyzed\. Various components of the Ensemble attack are ablated to understand which most heavily contribute to success\. In addition, variations in RMIA shadow model training are performed to understand the sensitivity of the algorithm to such choices\. Results are reported in Appendix[D](https://arxiv.org/html/2605.06835#A4)\.

## 4Main Results

The main results for each dataset and MIA technique are reported in this section\. The experiments are structured to systematically probe the relationship between privacy leakage and training setup, attacker knowledge, and compute power\. Where appropriate, the connection, or lack thereof, between MIA success and DCR is also exhibited\. Unless otherwise specified, when varying different aspects, such as training steps, all other settings remained fixed to the default values discussed in Section[3](https://arxiv.org/html/2605.06835#S3)\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x1.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x2.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x3.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x4.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x5.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x6.png)

Figure 1:Training and synthesis levers that influence TF MIA success and DCR for the Berka dataset in white\-box \(WB\) and black\-box \(BB\) settings\.### 4\.1Berka and Tartan Federer Results

Along the top row of Figure[1](https://arxiv.org/html/2605.06835#S4.F1), the number of training iterations, size of the training dataset, and diffusion model architecture are varied\. Despite previous research suggesting that diffusion models are more robust to memorization\[[6](https://arxiv.org/html/2605.06835#bib.bib58)\], the number of training steps and size of training data have a stark impact on privacy leakage\. Increasing the number of training steps increases MIA success nearly monotonically and larger training data collections markedly reduce it\. This relationship is also well captured by DCR, where it diverges from and converges to ideal values in each setting\. In line with previous work on overfitting, model architectures incorporating more parameters and longer training runs produce rising MIA success\. However, this is decidedly not captured by DCR\. As the model and training time scale, DCR remains largely unchanged despite rapidly growing privacy risk\.

In the bottom row of Figure[1](https://arxiv.org/html/2605.06835#S4.F1), the number of diffusion timesteps, training batch size, and ratios of synthetic data to training data are varied\. As the number of timesteps increases, MIA success also tends to follow, though it eventually plateaus\. On the other hand, the DCR metric in this experiment is fairly misleading\. It does rise for the initial set of values, but then drops, suggesting that any risk may have modulated\. Variations in batch size or the size of data synthetically generated by the model have little affect on DCR\. However, while not as impactful as other levers, larger batch sizes do increase MIA success\. Moreover, synthetic data size is strongly tied to model training size\. For models trained on small data pools, synthesizing1010times more data than they were trained on produces a drastic increase in privacy risk, whereas models with larger training sets do not suffer the same scaling\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x7.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x8.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x9.png)

Figure 2:Variations in attacker computing power \(left\), shadow model mismatch \(middle\) and shadow\-model data mismatch \(right\) vs\. white\- \(WB\) and black\-box \(BB\) TF success for Berka\.Figure[2](https://arxiv.org/html/2605.06835#S4.F2)displays the impact of variations in attacker compute power and imperfect knowledge or data access\. Surprisingly, the TF attack is quite robust to reductions in the number of shadow models trained\. Training a single shadow model reduces success in both the white\- and black\-box settings by less than0\.040\.04\. MIA success is also minimally affected by mismatches in shadow model size compared to the target model\. Training shadow models for significantly fewer iterations or with much fewer timesteps does noticeably reduce attack effectiveness\. However, training such models for more iterations or with only marginally misaligned timesteps has little impact\. The final experiment shows that the TF attack is effective even when an attacker’s data distribution differs in substantial ways\. The black\-box version of the attack shows only small degradations in any of the scenarios, while the white\-box attack is marginally impacted by the statistical synthesis and noisy data modifications\.

### 4\.2Berka and Ensemble Results

The results of applying the Ensemble attack to the Berka dataset reinforce those of Section[4\.1](https://arxiv.org/html/2605.06835#S4.SS1)\. In Figure[3](https://arxiv.org/html/2605.06835#S4.F3), as training steps increase, Ensemble MIA success rises\. On the other hand, larger training pools reduce privacy risk\. Ensemble MIA success rates are also impacted similarly by target model variations\. That is, larger models, trained for longer, tend towards greater privacy leakage\. This latter trend is, again, not reflected in the DCR measurements, despite a closer relationship to the attack construction\. In the bottom row of Figure[3](https://arxiv.org/html/2605.06835#S4.F3), increasing the number of diffusion steps, batch size, or amount of synthetic data generated relative to training size all increase MIA success above random guess\. As in the TF attack, this relationship is generally not reflected in the DCR metric\. The growth in Ensemble MIA success with respect to changes in diffusion steps is more tempered than the TF attack and is somewhat more aligned with stagnant DCR\. This is likely due to the mechanism of loss reconstruction in the TF attack, which is heavily influenced by the timestep scale of the diffusion model, whereas the Ensemble method operates solely on the properties of the generated data\. Recall that the Ensemble technique is a black\-box and data\-drive approach with a non\-trivial reliance on distance\-, density\-, and canary\-based measures\. As such, the results demonstrate that, while DCR and other pseudo\-privacy measures provides a flawed estimate of privacy risk, see Appendix[E](https://arxiv.org/html/2605.06835#A5), more sophisticated use of distance\-based metrics can produce more meaningful and reliable privacy leakage evaluations\. This is also reflected in the Ensemble ablation studies reported in Appendix[D](https://arxiv.org/html/2605.06835#A4)\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x10.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x11.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x12.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x13.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x14.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x15.png)

Figure 3:Training and synthesis levers vs\. Ensemble MIA success and DCR for the Berka dataset\.The default configuration for the Ensemble attack constructs a single shadow model for meta\-classifier training\. Figure[4](https://arxiv.org/html/2605.06835#S4.F4)demonstrates that this setup is sufficient and attackers need not have significant compute resources to train many shadow models to produce successful MIAs\. Similar to the TF attack, perfect knowledge of target model training and architecture are not critical for Ensemble attack success\. Longer training, fewer timesteps, or larger shadow models minimally degrade MIA success\. Finally, Figure[4](https://arxiv.org/html/2605.06835#S4.F4)shows that the Ensemble attack is robust to certain kinds of distribution divergence, especially disjoint accounts and statistically synthesized data\. However, as a data\-driven approach, it is more sensitive than the TF method with respect to temporal shifts and data noise\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x16.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x17.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x18.png)

Figure 4:Variations in attacker computing power \(left\), shadow model mismatch \(middle\) and shadow\-model data mismatch \(right\) vs\. Ensemble MIA success for the Berka dataset\.
### 4\.3Diabetes and Tartan Federer Results

Figure[5](https://arxiv.org/html/2605.06835#S4.F5)presents the changes in TF attack success when varying training and synthesis configurations for the Diabetes dataset\. There are a number of similarities between these results and those of Section[4\.1](https://arxiv.org/html/2605.06835#S4.SS1)\. Increasing training iterations increases attack success\. Generally, this is reflected in the DCR metric, but it fails to identify the dramatic inflection point of the white\-box attack\. On the other hand, all metrics agree that scaling training set size reduces privacy leakage and larger models, trained for longer, leak more information\. Along the bottom of the figure, white\-box attack success, while elevated, is surprisingly unaffected by the number of diffusion steps\. However the black\-box setting is quite sensitive to step counts, which is not captured with DCR\. In a departure from the Berka results, DCR appears to be well calibrated to the rising privacy risk associated with the black\-box attack as batch size increases\. It does not, however, capture the sharp increase early in the white\-box setting\. Finally, synthesizing data well above the training set size also scales privacy leakage, despite DCR remaining flat\. As with Berka, the impact of over\-synthesizing tempers as training size expands\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x19.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x20.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x21.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x22.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x23.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x24.png)

Figure 5:Training and synthesis levers that influence TF MIA success and DCR for the Diabetes dataset in white\-box \(WB\) and black\-box \(BB\) settings\.The results in Figure[6](https://arxiv.org/html/2605.06835#S4.F6)largely agree with the analogous Berka results\. That is, attackers need not apply heavy compute to construct highly successful TF attacks\. Even with one shadow model, both white\- and black\-box attacks are nearly equivalent\. Furthermore, attackers need not have exact knowledge of model architecture or training setup to perform successful membership inference\. Finally, the attack is quite insensitive to perturbations in adversary data distribution similarity\. As in the Berka setting, noisy adversary data and one\-way marginal sampling simulation show only mild, if any, degradation in attack success\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x25.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x26.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x27.png)

Figure 6:Variations in attacker computing power \(left\), shadow model mismatch \(middle\) and shadow\-model data mismatch \(right\) vs\. white\- \(WB\) and black\-box \(BB\) TF success for Diabetes\.
### 4\.4Diabetes and Ensemble Results

For the Diabetes dataset, Ensemble attack success and DCR, shown in Figure[7](https://arxiv.org/html/2605.06835#S4.F7), are actually fairly well correlated across training steps, train size, diffusion steps, batch size, and model variation, which is not always the case in previous results\. However, the DCR metric remains notably deficient in capturing the drastic increases in privacy leakage when synthesizing increasingly large quantities of data\. As in the Berka results, Figure[8](https://arxiv.org/html/2605.06835#S4.F8)shows that a single target shadow model is sufficient for Ensemble attack success and that the attack is quite robust to incomplete knowledge of target model size or other training configurations\. As with previous experiments, the approach is more sensitive to undershooting training time or model size\. Finally, the attack is resilient to certain kinds of data agreement, but is impacted by heavily noised data\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x28.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x29.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x30.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x31.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x32.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x33.png)

Figure 7:Training and synthesis levers vs\. Ensemble MIA success and DCR for the Diabetes dataset\.![Refer to caption](https://arxiv.org/html/2605.06835v1/x34.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x35.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x36.png)

Figure 8:Variations in attacker computing power \(left\), shadow model mismatch \(middle\) and shadow\-model data mismatch \(right\) vs\. Ensemble MIA success for the Diabetes dataset\.

## 5Conclusions, Limitations, and Future Work

In this work, factors that influence privacy leakage in TDMs are quantified through controlled experimentation\. Such factors include training choices, synthesis volume, and limitations on attacker compute, knowledge, and data access\. Through the lens of leading white\- and black\-box MIAs specifically designed for TDMs, we show that access to large compute resources, perfect knowledge of hyperparameters and model architecture, and access to identically distributed data are not prerequisites to constructing successful attacks\. Moreover, DCR measurements, and other widely used pseudo\-privacy risk metrics, are shown to be ineffective gauges of privacy risk in many scenarios, going well beyond previous studies\. A potential limitation of this work is the focus on ClavaDDPM for evaluating privacy leakage\. That said, ClavaDDPM is a state\-of\-the\-art model, and the mechanisms exploited by the attacks are shared across data\-space TDMs\. The development of strong MIAs for latent\-space diffusion models like TabSyn remains an open\[[9](https://arxiv.org/html/2605.06835#bib.bib68),[66](https://arxiv.org/html/2605.06835#bib.bib17),[48](https://arxiv.org/html/2605.06835#bib.bib33)\], making evaluation non\-trivial and beyond the scope of this work\. Future work will focus on developing such attacks and conducting a similarly extensive analysis of privacy leakage for these models\. Further, we plan to apply the insights from this work to improve the utility\-privacy tradeoffs in DP\-training of diffusion models\.

## References

- \[1\]T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, and M\. Koyama\(2019\)Optuna: a next\-generation hyperparameter optimization framework\.External Links:1907\.10902,[Link](https://arxiv.org/abs/1907.10902)Cited by:[Appendix C](https://arxiv.org/html/2605.06835#A3.p3.1)\.
- \[2\]A\. M\. Alaa, F\. Van Breugel, E\. Saveliev, and M\. van der Schaar\(2022\)How faithful is your synthetic data? sample\-level metrics for evaluating and auditing generative models\.InProceedings of the 39th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.162,pp\. 290–306\.Cited by:[Appendix F](https://arxiv.org/html/2605.06835#A6.p2.9),[§3\.1](https://arxiv.org/html/2605.06835#S3.SS1.p3.2)\.
- \[3\]M\. S\. M\. S\. Annamalai, A\. Gadotti, and L\. Rocher\(2024\)A linear reconstruction approach for attribute inference attacks against synthetic data\.InProceedings of the 33rd USENIX Conference on Security Symposium,SEC ’24,USA\.External Links:ISBN 978\-1\-939133\-44\-1Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p3.1)\.
- \[4\]M\. S\. M\. S\. Annamalai, G\. Ganev, and E\. De Cristofaro\(2024\)"What do you want from theory alone?" Experimenting with tight auditing of differentially private synthetic data generation\.InProceedings of the 33rd USENIX Conference on Security Symposium,SEC ’24,USA\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1)\.
- \[5\]M\. Bartolo, T\. Thrush, R\. Jia, S\. Riedel, P\. Stenetorp, and D\. Kiela\(2021\-11\)Improving question answering model robustness with synthetic adversarial data generation\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Online and Punta Cana, Dominican Republic,pp\. 8830–8848\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[6\]T\. Bonnaire, R\. Urfin, G\. Biroli, and M\. Mézard\(2026\)Why diffusion models don’t memorize: the role of implicit dynamical regularization in training\.InProceedings of the 38th International Conference on Neural Information Processing Systems,Red Hook, NY, USA\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.06835#S4.SS1.p1.1)\.
- \[7\]N\. Carlini, S\. Chien, M\. Nasr, S\. Song, A\. Terzis, and F\. Tramèr\(2022\)Membership inference attacks from first principles\.In2022 IEEE Symposium on Security and Privacy \(SP\),Vol\.,pp\. 1897–1914\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p3.1),[§2](https://arxiv.org/html/2605.06835#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.06835#S3.SS1.p2.15)\.
- \[8\]D\. Chen, N\. Yu, Y\. Zhang, and M\. Fritz\(2020\)GAN\-leaks: a taxonomy of membership inference attacks against generative models\.InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security,CCS ’20,New York, NY, USA,pp\. 343–362\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[9\]P\. Cheng and A\. Bahmani\(2025\)Membership inference over diffusion\-models\-based synthetic tabular data\.External Links:2510\.16037,[Link](https://arxiv.org/abs/2510.16037)Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1),[§5](https://arxiv.org/html/2605.06835#S5.p1.1)\.
- \[10\]J\. Clore, K\. Cios, J\. DeShazo, and B\. Strack\(2014\)Diabetes 130\-US Hospitals for Years 1999\-2008\.Note:UCI Machine Learning RepositoryCited by:[§3\.2](https://arxiv.org/html/2605.06835#S3.SS2.p1.4)\.
- \[11\]F\. K\. Dankar, M\. K\. Ibrahim, and L\. Ismail\(2022\)A multi\-dimensional evaluation of synthetic data generators\.IEEE Access10\(\),pp\. 11147–11158\.Cited by:[§3\.1](https://arxiv.org/html/2605.06835#S3.SS1.p3.2)\.
- \[12\]F\. K\. Dankar and M\. Ibrahim\(2021\)Fake it till you make it: guidelines for effective synthetic data generation\.Applied Sciences11\(5\)\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[13\]T\. Dockhorn, T\. Cao, A\. Vahdat, and K\. Kreis\(2023\)Differentially Private Diffusion Models\.Transactions on Machine Learning Research\.External Links:[Link](https://openreview.net/forum?id=ZPpQk7FJXF)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1)\.
- \[14\]M\. Du, F\. Wang, W\. Yan, J\. Guo, L\. Liu, P\. Lv, Y\. He, X\. Feng, and Y\. Wang\(2025\)Improving food safety: synthetic data augmentation for accurate mushroom species identification in complex environments\.Applied Food Research5\(1\),pp\. 101039\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[15\]J\. Duan, F\. Kong, S\. Wang, X\. Shi, and K\. Xu\(2023\)Are diffusion models vulnerable to membership inference attacks?\.InProceedings of the 40th International Conference on Machine Learning,ICML’23\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[16\]Ú\. Erlingsson, I\. Mironov, A\. Raghunathan, and S\. Song\(2020\)That which we call private\.External Links:1908\.03566,[Link](https://arxiv.org/abs/1908.03566)Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p3.3)\.
- \[17\]European Parliament and Council of the European Union\(2016\)Regulation \(EU\) 2016/679 of the European Parliament and of the Council\.OJ L 119, 4\.5\.2016, p\. 1–88\.External Links:[Link](https://data.europa.eu/eli/reg/2016/679/oj)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[18\]V\. A\. Fajardo, D\. Findlay, C\. Jaiswal, X\. Yin, R\. Houmanfar, H\. Xie, J\. Liang, X\. She, and D\.B\. Emerson\(2021\)On oversampling imbalanced data with deep conditional generative models\.Expert Systems with Applications169,pp\. 114463\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[19\]X\. Fang, W\. Xu, F\. A\. Tan, Z\. Hu, J\. Zhang, Y\. Qi, S\. H\. Sengamedu, and C\. Faloutsos\(2024\)Large language models \(LLMs\) on tabular data: prediction, generation, and understanding \- a survey\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=IZnrCGF9WI)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[20\]Z\. Fang, Z\. Jiang, H\. Chen, X\. Li, and J\. Li\(2025\)Understanding and mitigating memorization in diffusion models for tabular data\.InProceedings of the 42nd International Conference on Machine Learning \(ICML 2025\),Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p4.1)\.
- \[21\]E\. German, D\. Samira, Y\. Elovici, and A\. Shabtai\(2025\)MIA\-EPT: membership inference attack via error prediction for tabular data\.External Links:2509\.13046,[Link](https://arxiv.org/abs/2509.13046)Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[22\]D\. Ghatak and K\. Sakurai\(2022\)A survey on privacy preserving synthetic data generation and a discussion on a privacy\-utility trade\-off problem\.InScience of Cyber Security \- SciSec 2022 Workshops,C\. Su and K\. Sakurai \(Eds\.\),Singapore,pp\. 167–180\.External Links:ISBN 978\-981\-19\-7769\-5Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[23\]C\. D\. Gobbo\(2025\)A comparative study of open\-source libraries for synthetic tabular data generation: SDV vs\. SynthCity\.External Links:2506\.17847Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p2.1)\.
- \[24\]J\. Hayes, L\. Melis, G\. Danezis, and E\. D\. Cristofaro\(2019\)LOGAN: membership inference attacks against generative models\.Proceedings on Privacy Enhancing Technologies2019\(1\),pp\. 133–152\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p3.1),[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[25\]M\. Hernandez, P\. A\. Osorio\-Marulanda, M\. Catalina, L\. Loinaz, G\. Epelde, and N\. Aginako\(2025\)Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees\.\.Front Digit Health7,pp\. 1576290\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[26\]H\. Hu, Z\. Salcic, L\. Sun, G\. Dobbie, P\. S\. Yu, and X\. Zhang\(2022\-09\)Membership inference attacks on machine learning: a survey\.ACM Comput\. Surv\.54\(11s\)\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p2.1)\.
- \[27\]T\. Humphries, S\. Oya, L\. Tulloch, M\. Rafuse, I\. Goldberg, U\. Hengartner, and F\. Kerschbaum\(2020\)Investigating membership inference attacks under data dependencies\.2023 IEEE 36th Computer Security Foundations Symposium \(CSF\),pp\. 473–488\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p3.3)\.
- \[28\]H\. Ismail Fawaz, G\. Forestier, J\. Weber, L\. Idoumghar, and P\. Muller\(2018\)Data augmentation using synthetic data for time series classification with deep residual networks\.InInternational Workshop on Advanced Analytics and Learning on Temporal Data, ECML PKDD,Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[29\]B\. Jayaraman and D\. Evans\(2019\)Evaluating differentially private machine learning in practice\.InProceedings of the 28th USENIX Conference on Security Symposium,SEC’19,USA,pp\. 1895–1912\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1)\.
- \[30\]B\. Jayaraman, L\. Wang, D\. E\. Evans, and Q\. Gu\(2020\)Revisiting membership inference under realistic assumptions\.Proceedings on Privacy Enhancing Technologies2021,pp\. 348–368\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p3.3)\.
- \[31\]M\. Kazmi, H\. Lautraite, A\. Akbari, Q\. Tang, M\. Soroco, T\. Wang, S\. Gambs, and M\. Lécuyer\(2024\)PANORAMIA: privacy auditing of machine learning models without retraining\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=5atraF1tbg)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[32\]A\. Kotelnikov, D\. Baranchuk, I\. Rubachev, and A\. Babenko\(2020\)Tabddpm: modelling tabular data with diffusion models\.InInternational Conference on Machine Learning,pp\. 473–488\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1),[§3](https://arxiv.org/html/2605.06835#S3.p1.1)\.
- \[33\]H\. Lautraite, L\. Herbault, Y\. Qi, J\. Rajotte, and S\. Gambs\(2025\)Ensemble\-mia\.GitHub\.Note:[https://github\.com/CRCHUM\-CITADEL/ensemble\-mia](https://github.com/CRCHUM-CITADEL/ensemble-mia)Cited by:[§3](https://arxiv.org/html/2605.06835#S3.p1.1)\.
- \[34\]A\. D\. Lautrup, T\. Hyrup, A\. Zimek, and P\. Schneider\-Kamp\(2024\)SynthEval: a framework for detailed utility and privacy evaluation of tabular synthetic data\.Note:[https://arxiv\.org/abs/2404\.15821](https://arxiv.org/abs/2404.15821)Cited by:[Appendix F](https://arxiv.org/html/2605.06835#A6.p2.9),[Appendix G](https://arxiv.org/html/2605.06835#A7.p2.3),[§1](https://arxiv.org/html/2605.06835#S1.p3.1)\.
- \[35\]B\. Liu, P\. Wang, and S\. Ge\(2024\)Learning differentially private diffusion models via stochastic adversarial distillation\.InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VII,Berlin, Heidelberg,pp\. 55–71\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1)\.
- \[36\]A\. Lowy, Z\. Li, J\. Liu, T\. Koike\-Akino, K\. Parsons, and Y\. Wang\(2024\)Why does differential privacy with large epsilon defend against practical membership inference attacks?\.ArXivabs/2402\.09540\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p3.3)\.
- \[37\]P\. Lu, P\. Wang, and C\. Yu\(2019\)Empirical evaluation on synthetic data generation with generative adversarial network\.InProceedings of the 9th International Conference on Web Intelligence, Mining and Semantics,WIMS2019,New York, NY, USA\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p3.1),[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[38\]G\. Maheshwari, D\. Ivanov, and K\. E\. Haddad\(2024\)Efficacy of synthetic data as a benchmark\.External Links:2409\.11968,[Link](https://arxiv.org/abs/2409.11968)Cited by:[§3\.1](https://arxiv.org/html/2605.06835#S3.SS1.p3.2)\.
- \[39\]S\. L\. Moroianu, C\. Bluethgen, P\. Chambon, M\. Cherti, J\. Delbrouck, M\. Paschali, B\. Price, J\. Gichoya, J\. Jitsev, C\. P\. Langlotz, and A\. S\. Chaudhari\(2025\)Improving performance, robustness, and fairness of radiographic ai models with finely\-controllable synthetic data\.External Links:2508\.16783,[Link](https://arxiv.org/abs/2508.16783)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[40\]W\. Pang, M\. Shafieinejad, L\. Liu, S\. Hazlewood, and X\. He\(2025\)ClavaDDPM: multi\-relational data synthesis with cluster\-guided diffusion models\.InProceedings of the 38th International Conference on Neural Information Processing Systems,NIPS ’24,Red Hook, NY, USA\.External Links:ISBN 9798331314385Cited by:[Appendix F](https://arxiv.org/html/2605.06835#A6.p3.8),[§1](https://arxiv.org/html/2605.06835#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.06835#S3.SS2.p2.17),[§3](https://arxiv.org/html/2605.06835#S3.p1.1)\.
- \[41\]H\. Ping, J\. Stoyanovich, and B\. Howe\(2017\)DataSynthesizer: privacy\-preserving synthetic datasets\.InProceedings of the 29th International Conference on Scientific and Statistical Database Management,SSDBM ’17,New York, NY, USA\.External Links:ISBN 9781450352826Cited by:[§3\.1](https://arxiv.org/html/2605.06835#S3.SS1.p3.2)\.
- \[42\]G\. of Canada\(2000\)Canada’s Personal Information Protection and Electronic Documents Act \(PIPEDA\)\.Government of Canada\.Note:[https://laws\-lois\.justice\.gc\.ca/eng/acts/P\-8\.6/](https://laws-lois.justice.gc.ca/eng/acts/P-8.6/)Statutes of Canada 2000, c\. 5External Links:[Link](https://laws-lois.justice.gc.ca/eng/acts/P-8.6/)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[43\]M\. Platzer and T\. Reutterer\(2021\)Holdout\-based empirical assessment of mixed\-type synthetic data\.Frontiers in Big Data4,pp\. 679939\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[44\]R\. Ruiz\-Torrubiano, G\. Kormann\-Hainzl, and S\. Paudel\(2024\-06\)Using synthetic data for improving robustness and resilience in ml\-based smart services\.InSmart Services Summit,Progress in IS, Vol\.None,pp\. 3–13\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[45\]A\. Salem, Y\. Zhang, M\. Humbert, M\. Fritz, and M\. Backes\(2019\-February 24–27\)ML\-leaks: model and data independent membership inference attacks and defenses on machine learning models\.InNetwork and Distributed Systems Security Symposium \(NDSS\) 2019,California, USA\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p3.1),[§2](https://arxiv.org/html/2605.06835#S2.p1.1),[§2](https://arxiv.org/html/2605.06835#S2.p2.1)\.
- \[46\]A\. Salem, G\. Cherubin, D\. Evans, B\. Kopf, A\. Paverd, A\. Suri, S\. Tople, and S\. Zanella\-Beguelin\(2023\-05\)SoK: Let the Privacy Games Begin\! A Unified Treatment of Data Inference Privacy in Machine Learning\.In2023 IEEE Symposium on Security and Privacy \(SP\),Vol\.,Los Alamitos, CA, USA,pp\. 327–345\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p3.1)\.
- \[47\]P\. Sanchez\-Serrano, R\. Rios, and I\. Agudo\(2025\)A decision framework for privacy\-preserving synthetic data generation\.Computers and Electrical Engineering126,pp\. 110468\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[48\]M\. Shafieinejad, X\. He, M\. Alinoori, J\. Jewell, S\. Ayromlou, W\. Pang, V\. Chatrath, G\. Sharma, and D\. Pandya\(2026\)MIDST Challenge at SaTML 2025: Membership Inference over Diffusion\-models\-based Synthetic Tabular data\.arXiv:2603\.19185\.Cited by:[Appendix C](https://arxiv.org/html/2605.06835#A3.p1.6),[Appendix C](https://arxiv.org/html/2605.06835#A3.p2.3),[§2](https://arxiv.org/html/2605.06835#S2.p1.1),[§3](https://arxiv.org/html/2605.06835#S3.p1.1),[§5](https://arxiv.org/html/2605.06835#S5.p1.1)\.
- \[49\]J\. Shi, M\. Xu, H\. Hua, H\. Zhang, S\. Ermon, and J\. Leskovec\(2025\)TabDiff: a mixed\-type diffusion model for tabular data generation\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1)\.
- \[50\]R\. Shi, Y\. Wang, M\. Du, X\. Shen, Y\. Chang, and X\. Wang\(2025\)A comprehensive survey of synthetic tabular data generation\.External Links:2504\.16506,[Link](https://arxiv.org/abs/2504.16506)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[51\]R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov\(2017\)Membership inference attacks against machine learning models\.In2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22\-26, 2017,pp\. 3–18\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1),[§1](https://arxiv.org/html/2605.06835#S1.p3.1),[§2](https://arxiv.org/html/2605.06835#S2.p1.1),[§2](https://arxiv.org/html/2605.06835#S2.p2.1)\.
- \[52\]T\. Stadler, B\. Oprisanu, and C\. Troncoso\(2022\)Synthetic Data–Anonymisation Groundhog Day\.In31st USENIX Security Symposium \(USENIX Security 22\),USA,pp\. 1451–1468\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[53\]A\. Steier, L\. Ramaswamy, A\. Manoel, and A\. Haushalter\(2025\)Synthetic data privacy metrics\.External Links:2501\.03941,[Link](https://arxiv.org/abs/2501.03941)Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p4.1)\.
- \[54\]M\. C\. Stoian, E\. Giunchiglia, and T\. Lukasiewicz\(2026\)A survey on deep learning approaches for tabular data generation: utility, alignment, fidelity, privacy, diversity, and beyond\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=RoShSRQQ67)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[55\]U\.S\. Congress\(1996\)Health insurance portability and accountability act of 1996 \(hipaa\)\.Note:Public Law 104–191, 110 Stat\. 1936Enacted August 21, 1996External Links:[Link](https://www.govinfo.gov/content/pkg/PLAW-104publ191/pdf/PLAW-104publ191.pdf)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[56\]B\. van Breugel, H\. Sun, Z\. Qian, and M\. van der Schaar\(2023\)Membership inference attacks against synthetic data through overfitting detection\.InInternational Conference on Artificial Intelligence and Statistics, 25\-27 April 2023, Palau de Congressos, Valencia, Spain,F\. J\. R\. Ruiz, J\. G\. Dy, and J\. van de Meent \(Eds\.\),Proceedings of Machine Learning Research, Vol\.206,pp\. 3493–3514\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[57\]M\. Ventura\(2020\)The berka dataset\.Note:[https://www\.kaggle\.com/datasets/marceloventura/the\-berka\-dataset](https://www.kaggle.com/datasets/marceloventura/the-berka-dataset)Accessed: 2025\-12\-10Cited by:[§3\.2](https://arxiv.org/html/2605.06835#S3.SS2.p1.4)\.
- \[58\]J\. Ward, Y\. Yang, C\. Wang, and G\. Cheng\(2026\)Ensembling membership inference attacks against tabular generative models\.InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security,AISec ’25,New York, NY, USA,pp\. 182–193\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1),[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[59\]C\. Wen, Y\. Yue, and Z\. Wang\(2025\)The application of membership inference in privacy auditing of large language models based on fine\-tuning method\.InProceedings of the 2025 2nd International Conference on Generative Artificial Intelligence and Information Security,GAIIS ’25,New York, NY, USA,pp\. 473–479\.External Links:ISBN 9798400713453Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[60\]X\. Wu, Y\. Pang, T\. Liu, and S\. Wu\(2025\)Winning the MIDST challenge: new membership inference attacks on diffusion models for tabular data synthesis\.arXiv preprint\.External Links:2503\.12008,[Link](https://arxiv.org/abs/2503.12008)Cited by:[Appendix B](https://arxiv.org/html/2605.06835#A2.p1.12),[§2](https://arxiv.org/html/2605.06835#S2.p1.1),[§3](https://arxiv.org/html/2605.06835#S3.p1.1)\.
- \[61\]Y\. Yang, C\. Malaviya, J\. Fernandez, S\. Swayamdipta, R\. Le Bras, J\. Wang, C\. Bhagavatula, Y\. Choi, and D\. Downey\(2020\-11\)Generative data augmentation for commonsense reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 1008–1025\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[62\]Z\. Yao, N\. Krčo, G\. Ganev, and Y\. de Montjoye\(2025\)The DCR delusion: measuring the privacy risk of synthetic data\.External Links:2505\.01524,[Link](https://arxiv.org/abs/2505.01524)Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1),[§2](https://arxiv.org/html/2605.06835#S2.p4.1)\.
- \[63\]M\. Ye\-Bin, N\. Hyeon\-Woo, W\. Choi, N\. Kim, S\. Kwak, and T\. Oh\(2025\)SYNAuG: exploiting synthetic data for data imbalance problems\.Pattern Recognition Letters193,pp\. 115–121\.Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p1.1)\.
- \[64\]S\. Yeom, I\. Giacomelli, M\. Fredrikson, and S\. Jha\(2017\)Privacy risk in machine learning: analyzing the connection to overfitting\.2018 IEEE 31st Computer Security Foundations Symposium \(CSF\),pp\. 268–282\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p3.3)\.
- \[65\]S\. Zarifzadeh, P\. Liu, and R\. Shokri\(2024\-21–27 Jul\)Low\-cost high\-power membership inference attacks\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 58244–58282\.Cited by:[§2](https://arxiv.org/html/2605.06835#S2.p1.1)\.
- \[66\]H\. Zhang, J\. Zhang, B\. Srinivasan, Z\. Shen, X\. Qin, C\. Faloutsos, H\. Rangwala, and G\. Karypis\(2024\)Mixed\-type tabular data synthesis with score\-based diffusion in latent space\.InThe twelfth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.06835#S5.p1.1)\.
- \[67\]C\. Zhu, J\. Tang, J\. F\. Pérez, M\. van Dijk, and L\. Y\. Chen\(2025\)DP\-TLDM: differentially private tabular latent diffusion model\.External Links:2403\.07842,[Link](https://arxiv.org/abs/2403.07842)Cited by:[§1](https://arxiv.org/html/2605.06835#S1.p2.1),[§2](https://arxiv.org/html/2605.06835#S2.p3.3)\.
- \[68\]Y\. Zhu, Z\. Zhao, R\. Birke, and L\. Y\. Chen\(2022\)Permutation\-invariant tabular data synthesis\.In2022 IEEE International Conference on Big Data \(Big Data\),Vol\.,pp\. 5855–5864\.External Links:[Document](https://dx.doi.org/10.1109/BigData55660.2022.10020639)Cited by:[§3\.1](https://arxiv.org/html/2605.06835#S3.SS1.p3.2)\.

## Appendix ATraining and Attacker Configuration Experiment Details

In experiments modifying the diffusion model architecture, the changes for each setting are as follows\. The “narrow, shallow, short” setting uses a DNN with layers and dimensions \[256, 512, 512, 256\] and trains for 100,000 iterations\. The “narrow, shallow” setting uses the aforementioned DNN but trained for 200,000 iterations\. For the “wider, deeper, long” configuration the DNN is increased to have layers and dimensions \[1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024\] and trains for 300,000 steps\. The “wider, deeper” experiment uses the larger DNN, but returns to 200,000 training steps\.

For experiments incorporating mismatches between target and shadow model setups, shorter training times correspond to just 5,000 steps, while longer training consume 300,000 steps\. When varying diffusion timesteps, the model with the “least” steps uses only1010and the “less” setting has100100\. The small shadow models are DNN\-based architectures with layers and dimensions \[256, 512, 512, 256\], and the large models are \[1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024\]\.

Table 1:Comparison of the statistical distributions of the training data for target model and the adversary data usef for training the shadow models under different distribution mismatch scenarios for the Berka dataset\. “Corr\.Δ\\Delta” is the Frobenius\-norm difference between correlation matrices\. “MIΔ\\Delta” is the Frobenius\-norm difference between mutual\-information matrices \(categorical features only\)\. “% Diff\.” is the fraction of feature columns whose empirical marginals differ significantly \(α=0\.05\\alpha=0\.05\)\. For numerical columns, distributions are compared using a Kolmogorov\-Smirnov test , while categorical columns are compared using Total Variation Distance with significance established via a permutation test \(10001000permutations\)\.For the Berka dataset, when simulating imperfect adversary data for access via disjoint accounts, accounts, and their corresponding transactions, are randomly assigned to the attacker or target model datasets such that there is no overlap\. When segmenting transactions temporally, target models are trained on transactions occurring on or before 1997\-04\-10 and adversary data is drawn from transactions occurring thereafter\.

The scenario of statistics\-based synthesis considers the setting where the adversary only possesses knowledge of population\-level marginal feature distributions and can sample from them\. To simulate this, we randomly permute the rows of all columns independently\. This preserves the marginal distribution of each column but removes all inter\-feature dependencies\.

Finally, the noisy adversary data scenario assumes the adversary has significantly less information: they know the unique values in each column but have no knowledge of the population\-level marginal feature distributions\. To simulate this, we replace each numerical column with draws from a uniform distribution over that column’s observed range, and each categorical column with uniform draws over its observed categories\. These perturbations induce substantial changes to the correlation structure, as reflected in the large correlation and MI matrix differences in Table[1](https://arxiv.org/html/2605.06835#A1.T1)\.

For the Diabetes dataset, adversary data mismatches are induced via the same noisy and statistics\-based synthesis strategies used for Berka\. For both attacks, adversary data is modified by randomly selecting half of the numerical columns and half of the categorical columns and either applying the noisy or statistics\-based synthesis strategies\. Changes to the correlation structure for these manipulations are reported in Table[2](https://arxiv.org/html/2605.06835#A1.T2)\. In addition, for the TF attack, we also consider the setting of noising2525% of each column type or statistics\-based synthesis for100100% of the columns\.

Table 2:Comparison of the statistical distributions of the training data for target model and the adversary data usef for training the shadow models under different distribution mismatch scenarios for the Diabetes dataset\. “Corr\.Δ\\Delta” is the Frobenius\-norm difference between correlation matrices\. “MIΔ\\Delta” is the Frobenius\-norm difference between mutual\-information matrices \(categorical features only\)\. “% Diff\.” is the fraction of feature columns whose empirical marginals differ significantly \(α=0\.05\\alpha=0\.05\)\. For numerical columns, distributions are compared using a Kolmogorov\-Smirnov test, while categorical columns are compared using Total Variation Distance with significance established via a permutation test \(10001000permutations\)\.In experiments varying training size for the Diabetes dataset, there is an important caveat to the experiments when the training sizes were1500015000and2000020000\. Because the size of the Diabetes set is limited, only four and three shadow models are trained for attack development, while two models are still present as target models\. This does imply that, in these settings, the attack has access to fewer shadow models\. However, the results varying the number of shadow models for TF attack training in Figure[6](https://arxiv.org/html/2605.06835#S4.F6)demonstrate that this does not dramatically undermine attack success\.

## Appendix BTartan Federer Attack: Default Settings and Other Details

As described in Section[3](https://arxiv.org/html/2605.06835#S3), the TF attack constructs features for each input,x0x\_\{0\}, by computing the loss∥\(\(mθ\(xt,t\)−x0\)−ϵ∥22\\\|\(\(m\_\{\\theta\}\(x\_\{t\},t\)\-x\_\{0\}\)\-\\epsilon\\\|\_\{2\}^\{2\}for a fixed set ofϵ\\epsilonandttvalues\. For the default number of diffusion steps,20002000, following\[[60](https://arxiv.org/html/2605.06835#bib.bib37)\], a static set of300300initial noise values,ϵ\\epsilon, are sampled from a standard Gaussian and the timesteps aret∈\{5,10,20,30,40,50,100\}t\\in\\\{5,10,20,30,40,50,100\\\}\. This yields21002100features in total to be processed by the DNN for membership inference\. When varying the number of timesteps in model architectures, models with timesteps in\{500,1000,2000,3000,4000\}\\\{500,1000,2000,3000,4000\\\}use the aforementioned set of timesteps\. For models with\{10,20,50,80,100\}\\\{10,20,50,80,100\\\}timesteps,t∈\{3,4,5,6,7,8,9\}t\\in\\\{3,4,5,6,7,8,9\\\}\. These settings are used for both datasets\.

In all settings, the DNN used for membership classification has two hidden layers, each with dimension200200\. Hidden layers havetanh\\tanhactivations and the output layer is a sigmoid\. Each shadow model contributes30003000member and30003000non\-member samples\. As such, in the Berka experiments with2020shadow models, this yields120000120000training points for the network, whereas the Diabetes setting has seven shadow models, producing4200042000training samples\. The classifier is trained with standard binary cross\-entropy loss, an Adam optimizer with default parameters, and a learning rate of 1e\-04\. Training proceeds for50005000steps with batch sizes of60006000\.

## Appendix CEnsemble Attack: Default Settings and Other Details

By default, the attack classifier uses2000020000samples from the Berka dataset or1000010000samples from the Diabetes dataset drawn from𝒟h\\mathcal\{D\}\_\{h\}to which the attacker has access\. The entire sampled subset is used to train the target shadow model\. Further, half of the sample set is then used to train each of the RMIA shadow models\. More specifically, each sample is randomly assigned to be used as training data for half of the RMIA shadow models\. To construct the DOMIAS features, the full population data,𝒟h\\mathcal\{D\}\_\{h\}, available to the attacker is used as the reference dataset, in line with the attack setup for the MIDST competition\[[48](https://arxiv.org/html/2605.06835#bib.bib33)\]\. For Berka, this consists of approximately800000800000samples and7000070000samples for Diabetes dataset\.

In the default settings, 16 different RMIA shadow models are trained, in agreement with the algorithm from\[[48](https://arxiv.org/html/2605.06835#bib.bib33)\]\. These are split into models with pre\-train and fine\-tune phases and those strictly with a single training phase\. The first category of models are pre\-trained on a subset of𝒟h\\mathcal\{D\}\_\{h\}not used to train the primary target shadow model constructed for the Ensemble attack\. For Berka, this set is of size6000060000, and for Diabetes it has size5000050000\. Thereafter, they are fine\-tuned on the constructed subset of member points, as described above, from the training data of the target shadow model\. For all experiments, except those in the results of Appendix[D](https://arxiv.org/html/2605.06835#A4), two models are pre\-trained and four models are fine\-tuned from each of these base models, for a total of eight\. In the second category, eight RMIA shadow models are also simply trained directly on a subset of data points drawn from the training data of the primary target shadow model with no pre\-training phase\.

The Ensemble attack meta\-classifier is an XGBoost model trained on the features constructed through distance\-based, DOMIAS\-based, and RMIA\-based extraction\. The hyperparameters of the XGBoost model are optimized with 5\-fold validation through Optuna\[[1](https://arxiv.org/html/2605.06835#bib.bib98)\]\. The parameters optimized are

- •eta=trial\.suggest\_float\("eta", 0\.0001, 0\.1, log=True\),
- •max\_depth=trial\.suggest\_int\("max\_depth", 3, 10\),
- •subsample=trial\.suggest\_float\("subsample", 0\.1, 1\),
- •colsample\_bytree=trial\.suggest\_float\("colsample\_bylevel", 0\.5, 1\),
- •reg\_alpha=trial\.suggest\_categorical\("reg\_alpha", \[0, 0\.1, 0\.5, 1, 5, 10\]\),
- •reg\_lambda=trial\.suggest\_categorical\("reg\_lambda", \[0, 0\.1, 0\.5, 1, 5, 10, 100\]\)\.

For the Berka dataset, the meta\-classifier training set is a random selection of1000010000points from the target shadow model training as member points and an equal number of samples drawn from𝒟h\\mathcal\{D\}\_\{h\}constituting non\-members\. The same is true for the Diabetes dataset, but with50005000points for each\.

## Appendix DEnsemble Attack Ablation Results

In this section, experimental results associated with ablations of the Ensemble MIA are shared\. As previously discussed, the Ensemble attack combines three components in RMIA, DOMIAS, and Distance\-based features through a trained meta\-classifier\. To understand the contributions of each component and their interactions, attacks are constructed excluding features associated with different combinations of the ensemble\. The results are shown in the top row of Figure[9](https://arxiv.org/html/2605.06835#A4.F9)\. For the Berka dataset, there is a clear synergy associated with the components comprising the ensemble, as each one in isolation is insufficient to produce an MIA with significantly better than random performance\. Interestingly, for the Diabetes dataset, the distance\-based features are all that is necessary to produce a successful attack, whereas the RMIA and DOMIAS features alone are not effective\. These results highlight that dataset composition plays a role in attack success and the strength of leveraging an ensemble\-based approach to capture diverse inputs\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x37.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x38.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x39.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x40.png)

Figure 9:Ensemble attack ablations \(top\) and RMIA shadow model variations \(bottom\) for the Berka \(left\) and Diabetes \(right\) datasets\. In the RMIA shadow experiments, the first number is single\-phase training models, and the remaining are counts of fine\-tuned models for each pre\-trained base model\. For example, 4\+2\+2 denotes four single\-phase models and two sets of two models, each set trained on a distinct pre\-trained base\.As detailed in Section[C](https://arxiv.org/html/2605.06835#A3), the default construction of the RMIA approach is training 16 shadow models, eight single\-stage and eight two\-stage models\. We experiment with alternative configurations to demonstrate robustness to this choice\. The results appear in the bottom row of Figure[9](https://arxiv.org/html/2605.06835#A4.F9)\. While the best arrangement for both datasets is the 8\+4\+4 split, other shadow\-model compositions perform well above random chance even with half the number of models and a single stage of training\.

## Appendix EPrivacy Heuristics: DCR, NNDR, HR, and EIR

In this appendix, definitions of the auxiliary pseudo\-privacy risk metrics and computation details are provided\. When computing DCR metrics,\|𝒟hi\|=10000\|\\mathcal\{D\}\_\{h\}^\{i\}\|=10000for both the Berka and Diabetes datasets\. The average DCR across all target models is reported\. The distance measure used for DCR is the Euclidean norm with categorical variables one\-hot encoded\.

Nearest Neighbor Distance Ratio \(NNDR\)is the ratio of theℓ2\\ell\_\{2\}distance between each synthetic data point and its two nearest neighbors in the training data \(closest vs\. second\-closest\)\. This measure is intended to capture whether a synthetic sample lies in a sparsely populated region of the training distribution, with values closer to 0 indicating this\. Using a holdout dataset, comprising real data points not used to train the generative model, we recompute the NNDR and compare the two ratios\. The difference between the train\-based and holdout\-based ratios yields a “privacy loss” score\. In our experiments we report 1\-NNDR values so that, similar to the other metrics, the lower the value, the more the synthetic data may reveal about the original training set\.

Hitting Rate \(HR\)measures how often real data points are “replicated” in synthetic data, based on whether numerical values fall within a specified percentage of the real data’s value range and categorical values match exactly; the hitting \(exact match\) rate is the percentage of real points that meet these criteria, and lower rates indicate better privacy protection\.

Epsilon Identifiability Risk \(EIR\)measures the percentage of training points for which a synthetic data point is closer than any other real point, indicating potential overfitting of the model that produced the synthetic data\. Values closer to zero reflect lower privacy risk\. When a holdout set is available, the same ratio is computed for holdout points, and the difference between the training and holdout ratios is reported, ideally near zero or negative showing that synthetic data is not memorizing training examples\. The metric also weights features by their inverse entropy, giving greater influence to rare attributes\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x41.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x42.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x43.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x44.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x45.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x46.png)

Figure 10:Training and synthesis levers that influence NNDR, HR, and EIR in comparison with TF MIA success for the Berka dataset\.Figures[10](https://arxiv.org/html/2605.06835#A5.F10)and[11](https://arxiv.org/html/2605.06835#A5.F11)show the effects of training and synthesis levers over NNDR, HR, and EIR privacy heuristics in comparison with TF MIA for Berka and Diabetes datasets respectively\. Note that for all of these metrics, lower values indicate lower privacy risk\. The plots indicate that, similar to DCR, none of these privacy heuristics are reliable as a standalone metric for privacy leakage\. Several observations support this argument\. Some heuristics report high\-privacy risk even when the white\-box MIA indicates no privacy risk, as HR does in most of the experiments of Figure[10](https://arxiv.org/html/2605.06835#A5.F10)\. Some follow a completely different pattern than the WB and BB MIA, as EIR does in theTrain Sizeexperiment on Diabetes dataset in Figure[10](https://arxiv.org/html/2605.06835#A5.F10)\. Furthermore, some do not show any changes in the privacy leakage with parameters variations unlike WB and BB MIA, as NNDR does inBatch SizeandSynthetic Sizeexperiments over Diabetes dataset as well as HR in almost all experiments of Figure[11](https://arxiv.org/html/2605.06835#A5.F11)\.

![Refer to caption](https://arxiv.org/html/2605.06835v1/x47.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x48.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x49.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x50.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x51.png)
![Refer to caption](https://arxiv.org/html/2605.06835v1/x52.png)

Figure 11:Training and synthesis levers that influence NNDR, HR, and EIR in comparison with TF MIA success for the Diabetes dataset\.
## Appendix FQuality Metrics and Results

In this appendix, more information is provided on the quality metrics used in this work\. Further, the quality metrics associated with the target models considered in this work are reported\. The metrics offer insight into the experimental results of Section[4](https://arxiv.org/html/2605.06835#S4)in several ways\. First, they showcase the relationship of synthetic\-data quality to the various configuration changes and privacy risk\. For example, increasing training iterations improves quality with diminishing returns while privacy risk continues to scale\. Second, they demonstrate that the observed changes in MIA success are generally not tied to precipitous falloffs in synthetic quality\. In most cases, generated data quality shows, at most, moderate and smooth variation with changes in the experiment configurations\.

When computing quality metrics, for modelii, evaluation compares the synthetic data,𝒟si\\mathcal\{D\}\_\{s\}^\{i\}, to the training data,𝒟ti\\mathcal\{D\}\_\{t\}^\{i\}, to measure how closely the synthetic data resembles the training data it is meant to mimic\. For each setting below, five target models are chosen at random for quality measurement and the resulting metric values averaged\. Note that, depending on the experimental configurations,\|𝒟si\|\|\\mathcal\{D\}\_\{s\}^\{i\}\|is not always equal to\|𝒟ti\|\|\\mathcal\{D\}\_\{t\}^\{i\}\|\. As fidelity and diversity measures of synthetic tabular data,α\\alpha\-precision andβ\\beta\-coverage \(\-recall\) metrics from\[[2](https://arxiv.org/html/2605.06835#bib.bib54)\]are computed\. The former measures how frequently generated samples fall within high\-density regions of the real\-data distribution\. The latter measures the extent to which the synthetic distribution spans the diversity of the real data\. To assess the extent to which generated data captures the statistical structure of the real data, differences in the one\- and two\-way marginals are computed\. For numerical columns, a Kolmogorov\-Smirnov \(KS\) test statistic is computed, comparing the synthetic and real dataset distributions for each column separately\. Test statistics are then averaged over all columns\. To compare real and synthetic categorical column distributions, total variation distance \(TVD\) is computed, also yielding a test statistics, which is averaged over all columns\. In both cases, larger statistics imply wider differences in column\-wise distributions\. As a measure of column dependency preservation, pair\-wise correlation and mutual information \(MI\) matrices are computed for synthetic and real tables\. The Frobenius norm of the difference between these matrices is calculated\. Correlation matrices are computed only for numerical column pairs, while MI matrices include both column types and are computed using SynthEval\[[34](https://arxiv.org/html/2605.06835#bib.bib24)\]\. Metrics are computed against holdout sets of size2020K and1010K for the Berka and Diabetes datasets, respectively, where required\.

The final two metrics computed are so called machine learning efficacy \(MLE\) measures\. These measures estimate how well synthetic data preserves dataset utility by training models on both synthetic and real data and comparing their task performance\. Following\[[40](https://arxiv.org/html/2605.06835#bib.bib6)\], separate models are trained to predict each column value from the others\. For numerical columns, a random forest \(RF\) regressor is trained and an RF classifier is constructed for categorical columns\. For modelmθim\_\{\\theta\}^\{i\}, MLE models are trained on either synthetic data generated bymθim\_\{\\theta\}^\{i\}or the training data,𝒟ti\\mathcal\{D\}\_\{t\}^\{i\}\. Regressor performance is measured viaR2R^\{2\}scores and classification with macroF1F\_\{1\}\. TheR2R^\{2\}orF1F\_\{1\}statistics are averaged across all columns in the respective categories and the difference in the averages is reported\. For example,ΔF1=avg\(F1\)synthetic−avg\(F1\)real\\Delta F\_\{1\}=\\text\{avg\}\(F\_\{1\}\)\_\{\\text\{synthetic\}\}\-\\text\{avg\}\(F\_\{1\}\)\_\{\\text\{real\}\}\. Larger values imply closer, or even improved performance, compared to real data\. Reported values are for synthetic and training data from a single target model and RF performance is evaluated on training data from a separate target model as holdout data\.

### F\.1Berka Quality Results

Synthetic data quality measures for target models trained on the Berka dataset are reported in Tables[3](https://arxiv.org/html/2605.06835#A6.T3)–[7](https://arxiv.org/html/2605.06835#A6.T7)\. The primary take away from these tables is that synthetic data quality does vary with training configuration changes, but it does not sharply degrade in any of the scenarios\. As such, the main privacy risk results presented in Sections[4\.1](https://arxiv.org/html/2605.06835#S4.SS1)and[4\.2](https://arxiv.org/html/2605.06835#S4.SS2)are based on models with reasonable utility and reflect legitimate modeling\-privacy trade\-offs\.

Table 3:Quality metrics across training iterations for target models trained on the Berka dataset\.Table 4:Quality metrics across training dataset sizes for target models trained on the Berka dataset\.Table 5:Quality metrics across different target model configurations trained on the Berka dataset\. Scenario \(1\) is “narrow, shallow, short,” \(2\) is “narrow, shallow,” \(3\) is “narrow,” \(4\) is default, \(5\) is “wide, deep,” \(6\) is “wide, deep, long\.”Table 6:Quality metrics across diffusion timesteps for target models trained on the Berka dataset\.Table 7:Quality metrics across batch sizes for target models trained on the Berka dataset\.For the Berka dataset, synthetic quality is impacted by training choices is different ways\. In the ranges tested, some parameter changes have fairly limited impact on quality, such as batch size or model architecture changes\. Others, like training iterations and diffusion steps, generally improve the quality metrics with larger values\. Finally, training dataset size has a more nuanced impact on quality, where some metrics improve with more data and some improve to a point and then degrade\.

### F\.2Diabetes Quality Results

Tables[8](https://arxiv.org/html/2605.06835#A6.T8)–[12](https://arxiv.org/html/2605.06835#A6.T12)report the quality metrics for variations in the target model training setups on the Diabetes dataset\. As discussed in Section[3\.2](https://arxiv.org/html/2605.06835#S3.SS2), the Diabetes dataset is smaller, in total size, than Berka, but has a larger number of columns\. As such, there is more variation in the target model quality results with changes to the experimental configurations\. It should be noted that, with the increased number of columns, the correlation and MI matrices are also much larger, thereby inducing larger Froebenius norms\.

Similar to the Berka results, generated data quality does not collapse across a wide variety of training setups\. However, there are a few instances, at the extreme ends of the hyperparameter spectra, where metrics indicate larger deterioration\. For instance, when training for only a few thousand steps in Table[8](https://arxiv.org/html/2605.06835#A6.T8), many of the metrics are significantly worse\. Similarly very small training sizes also produce lowβ\\beta\-coverage and large correlation differences\.

Table 8:Quality metrics across training iterations for target models trained on the Diabetes dataset\.Table 9:Quality metrics across training dataset sizes for target models trained on the Diabetes dataset\.Table 10:Quality metrics across different target model configurations trained on the Diabetes dataset\. Scenario \(1\) is “narrow, shallow, short,” \(2\) is “narrow, shallow,” \(3\) is “narrow,” \(4\) is default, \(5\) is “wide, deep,” \(6\) is “wide, deep, long\.”Table 11:Quality metrics across diffusion timesteps for target models trained on the Diabetes dataset\.Table 12:Quality metrics across batch sizes for target models trained on the Diabetes dataset\.The dependencies of the quality metrics for target models trained on the Diabetes dataset have some similarities and notable differences compared to the Berka\-trained models\. Quality metrics are inconsistently affected by changes in batch size and increasing training steps generally improves such measurements\. On the other hand, for the Diabetes dataset, larger models and training dataset sizes mostly improve synthetic quality, while the impact of changing the number of diffusion steps is varied\.

## Appendix GDataset Preprocessing

Prior to training the target models in the experiments, both datasets undergo light preprocessing steps\. Any columns corresponding to unique identifiers are dropped\. Missing values for categorical columns are treated as a distinct category\. For Diabetes, some categorical columns also have entries of “NaN” and “?”\. These values are similarly treated as distinct categories for the columns in which they appear\. For numerical columns in Berka, missing values are imputed as0, while no missing numerical values are present for the Diabetes data\. Finally, categorical values are ordinally encoded for uniform representation with encoding performed over the entire dataset to avoid issues with less frequent values\.

When computing the metrics discussed in Appendix[F](https://arxiv.org/html/2605.06835#A6)and DCR, preprocessing differed depending on the metric computed\. Forα\\alpha\-precision andβ\\beta\-coverage, categorical values are one\-hot encoded and numerical columns are left unchanged\. When computing KS and TVD statistics or Correlation and MI matrix differences, the SynthEval\[[34](https://arxiv.org/html/2605.06835#bib.bib24)\]preprocessing approach is applied before downstream computations, which leaves categorical columns ordinally encoded and min\-max transforms numerical values\. For the MLE quality measures, min\-max transforms are also applied to numerical values, but categorical columns are one\-hot encoded\. Finally, for DCR computations, categorical variables are one\-hot encoded and numerical columns are normalized to the interval\[−1,1\]\[\-1,1\]\.

## Appendix HCompute Resources

Generally, the experiments conducted in this work are quite computationally intensive\. In addition to training the target models, both the TF and Ensemble attacks require training numerous shadow models, each of which is a large tabular diffusion model itself\. In the default setting, the TF MIA trains2020such models, while the Ensemble approach constructs1717, one primary shadow model and1616RMIA models\. Both attacks also need to train membership classifiers and require non\-trivial feature extraction computations\. For the TF attack, this involves extracting loss values via forward and backward diffusion processes for each data point for21002100\(ϵ,t\)\\epsilon,t\)pairs\. In the Ensemble MIA, computationally intensive distance\-, DOMIAS\-, and RMIA\-based features need to be derived\.

The experiments in this work leverage both A40 and A100 GPUs with 48GB and 80GB of GPU memory, respectively\. Depending on the experimental setup, end\-to\-end attack training and inference times vary\. For example, attacks applied to larger models require more time compared to the default settings\. However, a single default configuration for the TF approach requires approximately2\.52\.5hours for Berka and11hour for the Diabetes dataset on A40 GPUs\. For an individual setup, the Ensemble attack takes around1616hours to complete for Berka and1111hours to complete for Diabetes using A100s\. This increases to2222and1616hours, respectively, on A40 GPUs\.
On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics

Similar Articles

Recovering Hidden Reward in Diffusion-Based Policies

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

Semi-supervised knowledge transfer for deep learning from private training data

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Submit Feedback

Similar Articles

Recovering Hidden Reward in Diffusion-Based Policies
How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
Semi-supervised knowledge transfer for deep learning from private training data
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why