LLM-as-a-Discriminator: When Synthetic Tables Still Look Real

arXiv cs.LG Papers

Summary

This paper proposes an LLM-as-Discriminator method to audit privacy of synthetic tabular data by asking an LLM to classify samples as real or synthetic, showing that LLM discrimination can serve as a practical privacy audit signal.

arXiv:2606.09865v1 Announce Type: new Abstract: Privacy and data sharing are often in tension. Many organizations use synthetic data to reduce privacy risk and still share useful data. For tabular data, auditing privacy remains hard. In many cases, even humans cannot easily tell if a table is real or synthetic. In this paper, we propose a method based on LLM discrimination. We ask an LLM to classify each table sample as REAL or SYNTHETIC. We test two settings: C1 with table only, and C2 with table plus distributional metadata. We use LLaMA as an open model and Gemini as a reference model. In our experiments, we run three synthesis models, CTGAN, TVAE, and Gaussian Copula, on two public datasets, UCI Adult and ACS Census. We collect 451 valid trials. Our results show clear differences between models. On Adult, LLaMA reaches DRS=0% in reported cells, while Gemini reaches DRS=100% for CTGAN and TVAE. On Census, LLaMA predicts SYNTHETIC for most samples, while Gemini stays high in C1 but drops for CTGAN and TVAE in C2. We also compare with a classifier two-sample test (C2ST) and record linkage as distributional baselines, and with a human pilot of 2 annotators and 240 trials. Our results show that LLM discrimination is a practical privacy audit signal when model choice, per provider reporting, and data encoding are handled with care. For reproducibility, code and experiment scripts are available at https://github.com/SlokomManel/LLM-as-a-Discriminator.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:14 AM

# LLM-as-a-Discriminator: When Synthetic Tables Still Look Real
Source: [https://arxiv.org/html/2606.09865](https://arxiv.org/html/2606.09865)
11institutetext:Vrije Universiteit, Amsterdam11email:manel\.slokom@live\.fr22institutetext:Equativ, Paris22email:slokom\.malek@livgmaile\.com
33institutetext:EDICIA, Nantes33email:thierno\.kante@edicia\.fr###### Abstract

Privacy and data sharing are often in tension\. Many organizations use synthetic data to reduce privacy risk and still share useful data\. For tabular data, auditing privacy remains hard\. In many cases, even humans cannot easily tell if a table is real or synthetic\. In this paper, we propose a method based on LLM discrimination\. We ask an LLM to classify each table sample asREALorSYNTHETIC\. We test two settings: C1 with table only, and C2 with table plus distributional metadata\. We use LLaMA as an open model and Gemini as a reference model\. In our experiments, we run three synthesis models, CTGAN, TVAE, and Gaussian Copula, on two public datasets, UCI Adult and ACS Census\. We collect451 valid trials\. Our results show clear differences between models\. On Adult, LLaMA reaches DRS=0% in reported cells, while Gemini reaches DRS=100% for CTGAN and TVAE\. On Census, LLaMA predictsSYNTHETICfor most samples, while Gemini stays high in C1 but drops for CTGAN and TVAE in C2\. We also compare with a classifier two\-sample test \(C2ST\) and record linkage as distributional baselines, and with a human pilot of 2 annotators and 240 trials\. Our results show that LLM discrimination is a practical privacy audit signal when model choice, per provider reporting, and data encoding are handled with care\. For reproducibility, code and experiment scripts are available at[https://github\.com/SlokomManel/LLM\-as\-a\-Discriminator](https://github.com/SlokomManel/LLM-as-a-Discriminator)\.

## 1Introduction

Organizations increasingly turn to synthetic tabular data as a privacy\-preserving alternative to real microdata\. The key practical question is whether an informed adversary can still tell which records are real\. If so, the protection offered may be weaker than assumed\.

Classical Statistical Disclosure Control \(SDC\) tools, e\.g\., k\-anonymity, microaggregation, record linkage\[[9](https://arxiv.org/html/2606.09865#bib.bib219),[23](https://arxiv.org/html/2606.09865#bib.bib250),[18](https://arxiv.org/html/2606.09865#bib.bib203)\], were designed for masked or perturbed data without having generative models in mind\. The rise of deep generative models has prompted a shift toward membership inference attacks \(MIAs\) as a privacy risk measure\[[21](https://arxiv.org/html/2606.09865#bib.bib279),[20](https://arxiv.org/html/2606.09865#bib.bib156)\]\. MIA is framed as a binary classification: given a target record and a synthetic release, did the record belong to the original training set? State\-of\-the\-art shadow modeling approaches achieve non\-trivial true positive rates even in the black\-box setting, though no single attack dominates across generators and datasets\[[19](https://arxiv.org/html/2606.09865#bib.bib261)\]\. Attribute inference\[[10](https://arxiv.org/html/2606.09865#bib.bib235)\]and linkage attack further confirm that synthetic data faces privacy\-utility trade\-offs analogous to classical anonymization\. Yet all these methods share a common limitation: they do not capture whether a*human or AI observer*can directly distinguish a synthetic table from a real one\.

Large language models \(LLMs\) are increasingly used as automated evaluators\. In\[[27](https://arxiv.org/html/2606.09865#bib.bib304)\], the authors showed that strong LLM judges match human preferences at over 80However, using an LLM as a black\-box*discriminator*for privacy auditing of synthetic tabular releases has not been studied systematically\. We address this gap with an*LLM\-as\-Discriminator*protocol under two threat conditions: table only \(C1\) and table plus distributional metadata \(C2\)\. This framing mirrors a realistic adversary with access to the release but not to generator internals, and yields interpretable outputs \(verdict, confidence\) that practitioners can inspect\. We make four contributions: \(i\) an*LLM\-as\-Discriminator111Code available at[https://github\.com/SlokomManel/LLM\-as\-a\-Discriminator](https://github.com/SlokomManel/LLM-as-a-Discriminator)\.*protocol for black\-box privacy auditing under two threat conditions: table only \(C1\) and table plus metadata \(C2\); \(ii\) an evaluation on451 valid trialsacross two datasets, three synthesis methods, and two LLM families, showing strong model\- and encoding\-dependence; \(iii\) a balanced*Disclosure Risk Score*\(DRS\) compared with empirical baselines \(C2ST and record linkage\), with directional alignment reported; \(iv\) a controlled human pilot \(240 trials\) showing that LLaMA lags human annotators while Gemini matches or exceeds human performance\.

Our experimental evaluation is organized around five research questions, detailed in Section[5](https://arxiv.org/html/2606.09865#S5)\. The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2606.09865#S2)reviews related work on disclosure risk, data synthesis, and LLM\-based evaluation\. Section[3](https://arxiv.org/html/2606.09865#S3)describes the LLM\-as\-Discriminator protocol and the Disclosure Risk Score\. Section[4](https://arxiv.org/html/2606.09865#S4)presents the experimental setup\. Section[5](https://arxiv.org/html/2606.09865#S5)reports our results\. Section[6](https://arxiv.org/html/2606.09865#S6)discusses our findings and connects them to classical privacy measures\. Section[7](https://arxiv.org/html/2606.09865#S7)concludes and outlines directions for future work\. Additional analysis is provided in the Appendix[0\.A](https://arxiv.org/html/2606.09865#Pt0.A1)\.

## 2Background and Related Work

In this section, we review related work in three parts: \(1\) privacy risk measures, \(2\) synthesis methods, and \(3\) LLM\-as\-evaluator\.

### 2\.1Privacy Risk Measures for Statistical Databases

Privacy risks take two main forms\[[24](https://arxiv.org/html/2606.09865#bib.bib288)\]: identity disclosure, where an adversary links external information to a released record, and attribute disclosure, where an adversary infers sensitive information about an individual\.

SDC\-based measures\.kk\-anonymity\[[22](https://arxiv.org/html/2606.09865#bib.bib249)\],ℓ\\ell\-diversity, andtt\-closeness\[[24](https://arxiv.org/html/2606.09865#bib.bib288)\]measure identity and attribute disclosure risk through indistinguishability constraints\. CAP\[[13](https://arxiv.org/html/2606.09865#bib.bib191),[12](https://arxiv.org/html/2606.09865#bib.bib268)\]estimates the probability that an adversary correctly guesses a sensitive attribute value\. These measures are well\-established but were designed for masked or perturbed data, not generative models\.

Attack\-based measures\.MIAs\[[19](https://arxiv.org/html/2606.09865#bib.bib261)\]infer whether a target record belonged to the generator’s training set\. Attribute inference attacks\[[10](https://arxiv.org/html/2606.09865#bib.bib235),[17](https://arxiv.org/html/2606.09865#bib.bib21)\]target sensitive attribute values rather than membership\. Record linkage\[[6](https://arxiv.org/html/2606.09865#bib.bib287),[15](https://arxiv.org/html/2606.09865#bib.bib298)\]measures the risk of matching a synthetic record to an external database\. We use MIAs and record linkage as formal baselines\.

To the best of our knowledge, none of these methods captures whether an informed observer can directly distinguish a synthetic table from a real one after release\. We address this gap\.

### 2\.2Synthetic Tabular Data Generation

Synthesis methods for tabular data differ in how they model and sample from the underlying data distribution\. We provide a brief overview of the main approaches\.

Statistical methods\.In\[[14](https://arxiv.org/html/2606.09865#bib.bib233)\], the authors use Gaussian copulas and parametric marginal models to reproduce pairwise column correlations\. CART\-based synthesizers draw records from estimated conditional distributions via recursive partitioning\[[16](https://arxiv.org/html/2606.09865#bib.bib237),[5](https://arxiv.org/html/2606.09865#bib.bib37)\]\. Both approaches are transparent, but their outputs carry predictable statistical structure that attack\-based methods can exploit\.

GAN\-based methods\.CTGAN\[[25](https://arxiv.org/html/2606.09865#bib.bib177)\]adapts the conditional GAN paradigm to heterogeneous tabular data using mode\-specific normalization of continuous columns and conditional vector training\. TVAE applies a variational autoencoder to learn a latent embedding of the joint distribution\. Both methods capture complex inter\-column dependencies more faithfully than statistical baselines\.

Diffusion and language model methods\.TabDDPM\[[11](https://arxiv.org/html/2606.09865#bib.bib297)\]applies denoising diffusion probabilistic models to tabular data and achieves strong fidelity on standard benchmarks\. GReaT\[[1](https://arxiv.org/html/2606.09865#bib.bib306)\]serializes tabular rows as natural language sentences and fine\-tunes a large language model to generate new rows by token\-level completion\. Records produced by these methods can be semantically plausible and resist purely distributional detection, raising questions that statistical evaluation protocols were not designed to answer\.

### 2\.3LLMs as evaluators

LLMs have shown strong performance across a wide range of tasks and have been extended to tabular data processing\[[26](https://arxiv.org/html/2606.09865#bib.bib305),[4](https://arxiv.org/html/2606.09865#bib.bib301)\]\. More recently, they have been used as automated evaluators replacing or complementing human annotation\[[8](https://arxiv.org/html/2606.09865#bib.bib299)\]\. Zheng et al\.\[[27](https://arxiv.org/html/2606.09865#bib.bib304)\]showed that GPT\-4 judgments agree with human preferences at over 80%, establishing LLM\-as\-a\-Judge as a scalable alternative to human annotation\. Chiang and Lee\[[3](https://arxiv.org/html/2606.09865#bib.bib303)\]found high LLM\-human agreement on well\-defined binary tasks and documented systematic biases such as position bias and verbosity preference\. We follow this binary\-task framing and apply it to a new setting: discriminating real from synthetic tabular records\.

Interpretability of LLM judgments\.LLM judges produce a verdict alongside a natural language rationale that explains their decision\[[7](https://arxiv.org/html/2606.09865#bib.bib302)\]\. This property is useful in privacy auditing, where practitioners need not only a discrimination signal but also an explanation of why a record appears synthetic\. Practitioners need not only a discrimination signal but also an explanation of why a record appears synthetic\. Our proposed method elicits a verdict, a confidence score, and a rationale in a single prompt\. We use these rationales to analyze model reasoning across threat conditions\.

## 3LLM\-as\-Discriminator Framework

We frame privacy auditing as a binary classification task\. Given a tabular sampleTTdrawn from either a real datasetℛ\\mathcal\{R\}or a synthetic dataset𝒮\\mathcal\{S\}, a discriminator𝒟\\mathcal\{D\}must output three things: a verdictv∈\{REAL,SYNTHETIC\}v\\in\\\{\\texttt\{REAL\},\\texttt\{SYNTHETIC\}\\\}, a confidence scorec∈\[0,100\]c\\in\[0,100\], and a natural\-language rationale\. We instantiate𝒟\\mathcal\{D\}with large language models\. LLMs operate at scale, produce explicit and reproducible reasoning, and carry broad implicit knowledge of realistic data distributions acquired during pre\-training\.

The core privacy insight is simple\. If𝒟\\mathcal\{D\}achieves accuracy close to the chance baseline of 50%, the synthetic data is*perceptually indistinguishable*from real data\. This provides empirical evidence of privacy robustness under an informed adversary\. If𝒟\\mathcal\{D\}achieves substantially higher accuracy, discriminative artifacts are present in the synthetic data\. Such artifacts can be exploited to mount record linkage, membership inference, or attribute inference attacks\.

### 3\.1Experimental Conditions

We define two threat conditions \(see Table[1](https://arxiv.org/html/2606.09865#S3.T1)\)\. UnderC1\(table only\), the discriminator receives a Markdown\-formatted sample ofN=20N=20rows with column names and dataset dimensions, modelling an adversary with access to the data release only\. UnderC2\(table plus metadata\), the same sample is augmented with a structured metadata block computed from the full dataset, including per\-column descriptive statistics \(mean, standard deviation, quartiles, skewness, kurtosis\), Shapiro–Wilk results, top\-kkcategory frequencies, Shannon entropy, and a Pearson correlation matrix\. This models a scenario common in statistical agency releases, where summary statistics are published alongside the synthetic data\.

Table 1:Summary of the two experimental conditions\. Both conditions share the same 20\-row table sample\. They differ in the additional context provided to the discriminator\.
### 3\.2Prompt Design

Each trial consists of a system prompt and a user prompt\. The system prompt instructs the LLM to act as a data scientist and mandates a structured JSON response with five fields:verdict\(REALorSYNTHETIC\),confidence\(0–100\),reasoning,red\_flags, andsupporting\_evidence\. Temperature is fixed at zero for reproducibility\. The user prompt instantiates the C1 or C2 template with the 20\-row Markdown table and, for C2, the statistical metadata block\. The structured output allows unambiguous extraction of verdicts and makes LLM reasoning auditable\. We provide no few\-shot examples; discrimination accuracy reflects the properties of the synthetic data rather than task\-specific calibration\.

### 3\.3Disclosure Risk Score

We introduce the*Disclosure Risk Score*\(DRS\) to translate raw discrimination accuracy into a privacy\-oriented scalar:

DRS=min⁡\(p^REAL,p^SYNTHETIC\),\\mathrm\{DRS\}=\\min\\\!\\left\(\\hat\{p\}\_\{\\mathrm\{REAL\}\},\\;\\hat\{p\}\_\{\\mathrm\{SYNTHETIC\}\}\\right\),\(1\)
wherep^REAL\\hat\{p\}\_\{\\mathrm\{REAL\}\}is the fraction of REAL trials correctly labeledREAL, andp^SYNTHETIC\\hat\{p\}\_\{\\mathrm\{SYNTHETIC\}\}is the fraction of REAL SYNTHETIC trials correctly labeledSYNTHETIC\. We take the minimum of the two per\-class accuracies to account against label\-bias artifacts\. A discriminator that labels everythingREALachievesp^REAL=1\\hat\{p\}\_\{\\mathrm\{REAL\}\}=1butp^SYNTHETIC=0\\hat\{p\}\_\{\\mathrm\{SYNTHETIC\}\}=0, yieldingDRS=0\\mathrm\{DRS\}=0\. This correctly reflects zero discriminative ability\. A DRS close to 0 indicates that the discriminator cannot distinguish real from synthetic data\. A DRS close to 0\.5 indicates chance\-level balanced discrimination\. A DRS above 0\.5 indicates above\-chance discrimination, which is the privacy\-concerning regime\.

### 3\.4Human Annotation Protocol

We provide a human discrimination baseline using the same C1/C2 trial structure222Screen recording:[https://github\.com/SlokomManel/LLM\-as\-a\-Discriminator/blob/main/results/human/streamlit\-human˙labeling˙app\-2026\-05\-31\-01\-57\-42\.webm](https://github.com/SlokomManel/LLM-as-a-Discriminator/blob/main/results/human/streamlit-human_labeling_app-2026-05-31-01-57-42.webm)\. Each session presents an annotator with a 20\-row Markdown table sampled from a real or synthetic dataset; under C1 the table only is shown, under C2 it is augmented with the metadata block from Section[3\.1](https://arxiv.org/html/2606.09865#S3.SS1)\. The annotator records a binary verdict \(REALorSYNTHETIC\), a confidence score \(0–100\), and optional free\-text notes\. Sessions follow a paired C1 then C2 design: the same 10 samples are shown first without metadata, then with metadata, enabling within\-session measurement of the information effect without sample confounding\.

## 4Experimental Setup

### 4\.1Datasets

We use two census datasets with the same income prediction task but different column encodings\.UCI Adult census income333ADULT:[https://archive\.ics\.uci\.edu/dataset/2/adult](https://archive.ics.uci.edu/dataset/2/adult)comprises 30,162 records and 14 attributes \(continuous, ordinal, and categorical\)\. The binary sensitive target is income\. Column values are stored as human\-readable strings \(e\.g\., occupation = “Craft\-repair”\), making the data directly interpretable by an LLM\.ACS Census Income 2018comprises 32,561 records and 10 attributes\. We use the income\-above\-$50K prediction task\. Unlike UCI Adult, all categorical columns are stored as numeric integer codes\. This presents a different perceptual challenge to the LLM discriminator\.

### 4\.2Synthetic Data Generation

We generate synthetic data using three methods via the SDV framework444SDV:[https://docs\.sdv\.dev/sdv](https://docs.sdv.dev/sdv)with default hyperparameters: CTGAN\[[25](https://arxiv.org/html/2606.09865#bib.bib177)\], a conditional GAN with mode\-specific normalization; TVAE\[[25](https://arxiv.org/html/2606.09865#bib.bib177)\], a variational autoencoder over the joint distribution; and Gaussian Copula\[[14](https://arxiv.org/html/2606.09865#bib.bib233)\], a parametric baseline using marginal\-specific transformations\. Each method produces a synthetic dataset of the same dimensions as the real data\.

### 4\.3LLM models evaluated

### 4\.4Trial design

For each synthesizer and condition pair we plannedT=200T=200trials, stratified equally between real and synthetic tables\. Each trial samples 20 rows uniformly at random\. Trial order and labels are randomized to prevent position bias\. Each trial yields one structured JSON response containing a binary verdict, a confidence score, and free\-text reasoning\.

API failures and final sample size\.For UCI Adult, we planned 3,600 trials in total \(3 synthesizers×\\times2 conditions×\\times200 trials per cell\)\. Of these,3,263 calls \(90\.6%\) faileddue to HTTP 429 quota errors and model deprecations\. The final UCI Adult sample comprises337 valid verdicts: 287 fromllama\-3\.1\-8b\-instantand 50 fromgemini\-2\.5\-flash\. For ACS Census, we used the same two models withT=10T=10trials per cell, yielding114 valid verdicts\(54 from Gemini, 60 from Groq\)\. Table[2](https://arxiv.org/html/2606.09865#S4.T2)reports the trial counts per cell\.

Table 2:Trial counts per synthesizer and condition cell after discarding API failures\.
### 4\.5Formal privacy baselines

We validate the LLM\-based DRS against two empirical privacy measures\.Classifier two\-sample test \(C2ST\):We train a logistic regression classifier on a balanced pool of 70% of real and synthetic records and evaluate its discrimination accuracy on the held\-out 30%\. The C2ST score is the overall accuracy of this classifier, measuring whether the two populations are statistically separable\. This is a distributional detection test, not a membership inference attack: it asks whether real and synthetic records come from the same distribution, not whether any individual record was in the generator’s training set\.Record linkage risk:For each synthetic record, we compute the Euclidean distance to its nearest real record in feature space after min\-max normalization\. The linkage success rate is the fraction of synthetic records whose nearest\-neighbor distance falls within the 10th percentile threshold\.

### 4\.6Human annotation study

Two human annotators \(HA1 and HA2\) independently completed six annotation sessions each: 2 datasets×\\times3 synthesis methods, covering both C1 and C2 conditions with 10 trials per session \(5 real and 5 synthetic\)\. The study yields240 unique trials\(120 per annotator\)\. Both annotators achieved full coverage across methods, conditions, and datasets\. We apply the DRS formula \(Equation[1](https://arxiv.org/html/2606.09865#S3.E1)\) identically to human and LLM verdicts, enabling direct comparison\.

## 5Experimental Results

We report results in five steps: baseline LLM discrimination, metadata effects, method ranking, alignment with formal baselines, and calibration against human judgment\. All DRS values are reported per provider\.

### 5\.1RQ1: Baseline LLM Discrimination

Tables[3](https://arxiv.org/html/2606.09865#S5.T3)and[4](https://arxiv.org/html/2606.09865#S5.T4)report per\-provider DRS for both datasets by synthesis method and condition, under the minimum\-coverage matched protocol \(N≥8N\\geq 8per cell,NGroq=NGeminiN\_\{\\text\{Groq\}\}=N\_\{\\text\{Gemini\}\}within each included cell\)\. The two providers show contrasting and dataset\-dependent behavior\.

Table 3:UCI Adult DRS by provider, synthesis method, and condition \(minimum\-coverage matched protocol,NGroq=NGeminiN\_\{\\text\{Groq\}\}=N\_\{\\text\{Gemini\}\}per cell\)\. Groq labels every table asReal\(p^REAL=100%\\hat\{p\}\_\{\\text\{REAL\}\}=100\\%,p^SYN=0%\\hat\{p\}\_\{\\text\{SYN\}\}=0\\%\), collapsing DRS to 0%\. Gemini discriminates perfectly for CTGAN and TVAE under both conditions\. Gaussian Copula cells excluded \(NGemini<8N\_\{\\text\{Gemini\}\}<8\)\.Table 4:ACS Census DRS by provider, synthesis method, and condition \(minimum\-coverage matched protocol,NGroq=NGeminiN\_\{\\text\{Groq\}\}=N\_\{\\text\{Gemini\}\}per cell\)\. Groq labels every table asSynthetic\(p^REAL=0%\\hat\{p\}\_\{\\text\{REAL\}\}=0\\%,p^SYN=100%\\hat\{p\}\_\{\\text\{SYN\}\}=100\\%\), the inverse collapse from Adult\. Gemini achieves DRS=100%=100\\%under C1 for all three methods; under C2 CTGAN drops to DRS=0%=0\\%and TVAE to DRS=25%=25\\%\.Groq \(LLaMA\-3\.1\-8b\)\. Groq yields DRS = 0% across all reported cells on both datasets via two opposing mechanisms\. On UCI Adult, it labels 98\.3% of tables asREAL, defaulting to the majority label when no clear artifact is present\. On ACS Census, the bias inverts: every table is labeledSYNTHETICbecause genuine ACS properties such as heavy\-tailed income and many unique categorical values are flagged as synthetic artifacts, even though these appear in both real and synthetic records\. In both cases DRS = 0% reflects a collapsed discriminator, not good privacy protection\.

Google \(Gemini\-2\.5\-Flash\)\.On UCI Adult, Gemini achieves DRS = 100% for CTGAN and TVAE under both conditions\. On ACS Census, DRS = 100% under C1 for all three synthesizers but degrades under C2: CTGAN drops to 0%, TVAE to 25%, while Gaussian Copula holds at 100%\. This suggests that distributional metadata disrupts the detection signal when real and synthetic marginals are closely aligned\.

Overconfidence\.Mean self\-reported LLM confidence is 82–99% regardless of verdict correctness\[[2](https://arxiv.org/html/2606.09865#bib.bib300)\]\. Raw confidence scores should not be used as a privacy signal without calibration\.

### 5\.2RQ2 and RQ3: Metadata effects and method ranking

Because LLaMA \(via Groq\) collapses to DRS = 0% under all conditions and methods, neither the C1 to C2 metadata effect nor the cross\-method ranking is estimable from its outputs\. We therefore focus on Gemini\.

On UCI Adult, metadata has no effect on Gemini: DRS = 100% under both C1 and C2 for CTGAN and TVAE\. On ACS Census, metadata reduces discrimination for CTGAN \(100% to 0%\) and TVAE \(100% to 25%\), while Gaussian Copula is unaffected\. This shows that the effect of distributional metadata is dataset\- and method\-dependent, not uniformly risk\-increasing\.

For method ranking, Gemini achieves DRS = 100% in most cells, with condition\-sensitive reversals on Census C2 for CTGAN and TVAE\. No synthesis method dominates across all providers and conditions combinations\. Privacy rankings are only meaningful when reported per provider and per condition\.

### 5\.3RQ4: Alignment with formal privacy measures

Table[5](https://arxiv.org/html/2606.09865#S5.T5)compares Gemini DRS against two formal privacy baselines across both datasets and all synthesis methods\. We use Gemini only, as Groq provides no meaningful discrimination signal\. We observe a ceiling effect on the UCI Adult dataset\. Gemini achieves DRS = 100% for CTGAN and TVAE under both conditions\. This ceiling prevents any DRS\-based method ranking\. C2ST still differentiates: TVAE \(0\.725\) is well above chance, while CTGAN \(0\.538\) and Gaussian Copula \(0\.510\) are near chance\. Yet all three are equally detectable by Gemini\. As for ACS census, we see a partial alignment\. Under C1, all methods reach DRS = 100%\. Under C2, DRS degrades for CTGAN \(100% to 0%\) and TVAE \(100% to 25%\), while Gaussian Copula holds at 100%\. Within C2, the DRS ordering \(Gaussian Copula\>\>TVAE\>\>CTGAN\) is inverted relative to C2ST \(TVAE\>\>CTGAN\>\>Gaussian Copula\)\. Gaussian Copula has a low C2ST score \(0\.542\) yet consistently high DRS, suggesting its distributional artifacts remain perceptually salient regardless of metadata context\.

Table 5:Gemini DRS per condition versus formal privacy baselines on UCI Adult and ACS Census\. Gaussian Copula on UCI Adult is excluded \(NGemini<8N\_\{\\text\{Gemini\}\}<8\)\. C2ST = accuracy of a logistic regression separating real from synthetic records\. Linkage rate = fraction of synthetic records within the 10th\-percentile nearest\-neighbor distance to real data\.
### 5\.4RQ5: Human annotation baseline

Figure[1](https://arxiv.org/html/2606.09865#S5.F1)reports the human and LLM comparison with explicit model labels, split by condition \(C1 and C2\), and broken down by dataset and synthesis method\.

![Refer to caption](https://arxiv.org/html/2606.09865v1/x1.png)Figure 1:Cell\-level DRS by dataset, synthesis method, and condition \(C1/C2\) for HA1, HA2, and Gemini \(gemini\-2\.5\-flash\)\.Multiple findings emerge\. First, human annotators outperform LLaMA\. Both annotators substantially exceed LLaMA\-3\.1\-8b across both datasets \(pooled DRS = 66\.4% vs\. LLaMA≈\\approx0%\)\. On UCI Adult, LLaMA achieves DRS = 0% in five of six cells due to REAL\-label bias\. On ACS Census it achieves DRS = 0% in all six cells due to SYNTHETIC\-label bias\. Human annotators correctly identify synthetic records in both settings\. Second, Gemini matches or exceeds human performance\. For CTGAN on UCI Adult, Gemini achieves DRS = 100% under both conditions, above both annotators \(HA1: 50% and 33%, HA2: 60% and 100% under C1 and C2 respectively\)\. On ACS Census, Gemini achieves DRS = 100% for all three synthesizers under C1, exceeding human performance in most cells\. Frontier LLMs can match or surpass trained human judgment on this task\. Next, the inter\-annotator agreement is moderate\. The mean absolute DRS difference between annotators is 20\.6 pp\. HA2 achieves consistently higher DRS \(77\.9%\) and accuracy \(80\.0%\)\. Both annotators show above\-chance confidence calibration: correct predictions carry a higher mean confidence than incorrect ones \(HA1: 62\.3% vs\. 53\.6%; HA2: 58\.9% vs\. 50\.5%\)\. Finally, both annotators independently surface the same structural artifacts as Gemini: brokeneducation/education\_nummappings, gender and relationship contradictions, and implausible age and education combinations\. We confirm that structural constraint violations are the primary perceptual discriminator for both humans and frontier LLMs\.

## 6Discussion

Our main takeaway is practical: LLM discrimination is a useful first\-pass privacy screen only when interpreted per model\. In several cells, discriminator capability dominates synthesis\-method effects\. We use DRS as a screening metric, not as a standalone privacy guarantee\.

Connecting DRS to traditional privacy measuresTable[6](https://arxiv.org/html/2606.09865#S6.T6)situates LLM\-based DRS in the broader privacy evaluation landscape\. Each metric targets a different privacy dimension: DP gives worst\-case guarantees but requires white\-box access; k\-anonymity checks structural grouping but misses artifacts outside quasi\-identifiers; MIA\[[19](https://arxiv.org/html/2606.09865#bib.bib261)\]targets record\-level memorization but requires shadow\-model training; record linkage measures feature\-space proximity but not semantic plausibility\. DRS captures perceptual discriminability without model access and produces interpretable reasoning\. No single metric is sufficient\. We recommend DRS and C2ST as fast distributional screens, followed by MIA when record\-level memorization is a concern\.

Table 6:Comparison of privacy evaluation paradigms\.LLM Reasoning PatternsWe qualitatively analyzed all 337 valid UCI Adult verdicts\. Gemini cites one decisive structural violation per verdict\. LLaMA lists generic statistical flags that appear in both real and synthetic records, which explains its REAL\-label bias\.

#### Structural constraint violations\.

Gemini’s primarySYNTHETICcue is the broken dependency betweeneducationandeducation\_num\(e\.g\., “Bachelors”→\\to13, “10th”→\\to6 in UCI Adult\)\. CTGAN and TVAE break this mapping; Gemini flags it as decisive:“‘education\_num’ does not maintain a 1\-to\-1 mapping with ‘education’; e\.g\., ‘10th’ grade is assigned 4, not 6\.”\. We also observe a marital status and relationship contradiction: CTGAN pairs “Never\-married” with “Husband” in 4\.5% of records\.

#### Capital gain/loss anomalies\.

UCI Adult has highly sparse capital gain and loss fields \(over 90% zeros, ceiling of 99,999\)\. Synthetic generators fill in small non\-zero values; LLaMA flags:“High skewness in ‘capital\_gain’ and ‘capital\_loss’; large number of zeros in capital\_gain\.”\.

## 7Conclusion, Limitations, and Future Work

We proposed and evaluated an LLM\-as\-Discriminator protocol for privacy auditing of synthetic tabular data \(451 valid trials, two datasets, three synthesis methods, two model families\)\. The core finding is simple: the discriminator model matters as much as the synthesizer\. On UCI Adult, Gemini achieves DRS = 100% for CTGAN and TVAE while LLaMA collapses to DRS = 0% via REAL\-label bias\. On ACS Census, the same split appears in reverse\. Pooled cross\-model DRS is misleading and should not be the primary reported result\. DRS shows directional consistency with formal baselines, and the human pilot calibrates the LLM signal: pooled human DRS is 66\.4%, well above LLaMA, with Gemini matching or exceeding human performance\. Model capability is the bottleneck, not the LLM\-as\-discriminator paradigm itself\.

Our study has four main limitations\. First, 90\.6% of planned trials failed due to API quota exhaustion, leaving per\-cellNNas low as 9; future work should use locally hosted models\. Second, most valid verdicts come from a single 8B\-parameter model; more capable models may behave differently\. Third, DRS is measured under a single zero\-shot prompt; chain\-of\-thought or few\-shot designs remain unexplored\. Fourth, we cover three synthesis methods on two datasets; diffusion\-based\[[11](https://arxiv.org/html/2606.09865#bib.bib297)\]and LLM\-based synthesizers\[[1](https://arxiv.org/html/2606.09865#bib.bib306)\]remain unevaluated\.

The most important next steps are: \(i\) encoding\-aware auditing via automatic column decoding; \(ii\) broader validation across more datasets and synthesis methods; \(iii\) stable local model deployments to avoid API sparsity; \(iv\) larger human annotation studies; and \(v\) joint privacy\-utility reporting\.

## Appendix 0\.AAppendix

### 0\.A\.1Matched\-Sampling Robustness \(Seed Sweep\)

We swept 50 independent matched\-downsampling seeds \(1000–1049\) and recomputed per\-cell DRS across all included cells\. Figure[2](https://arxiv.org/html/2606.09865#Pt0.A1.F2)reports empirical 95% intervals\. Results are stable: 19 of 20 cells have effectively zero interval width\. The single exception is Adult TVAE C2 under Groq \(mean DRS = 22\.1%, interval \[0\.0, 73\.1\] pp,N=11N=11\)\. Mean interval width is 0\.0 pp for Gemini and 7\.3 pp for Groq\. The core finding \(Gemini: high discriminability; Groq: collapse\) is seed\-independent\. Low\-NNGroq cells warrant uncertainty\-aware interpretation\.

![Refer to caption](https://arxiv.org/html/2606.09865v1/x2.png)Figure 2:Cell\-level DRS stability over 50 random seeds\. Points show mean DRS; whiskers show empirical 95% intervals\. Each point is annotated with matched per\-cell sample sizeNN\. Uncertainty is concentrated in Adult TVAE C2 for Groq\.
### 0\.A\.2Per\-Model Accuracy and Label Bias

Table[7](https://arxiv.org/html/2606.09865#Pt0.A1.T7)reports overall accuracy and REAL\-label prediction rate per model on UCI Adult\. Figure[3](https://arxiv.org/html/2606.09865#Pt0.A1.F3)shows REAL\-label prediction rate and mean confidence by model\.

Table 7:Per\-model discrimination accuracy and REAL\-label prediction rate on UCI Adult \(337 valid verdicts\)\.![Refer to caption](https://arxiv.org/html/2606.09865v1/x3.png)Figure 3:REAL\-label prediction rate \(left\) and mean LLM confidence \(right\) by model\. LLaMA labels 98\.3% of Adult tables asREAL; Gemini shows lower label bias and higher accuracy\.
### 0\.A\.3Reasoning Theme Prevalence

Table[8](https://arxiv.org/html/2606.09865#Pt0.A1.T8)reports the eight dominant reasoning themes identified in the qualitative analysis of 337 UCI Adult verdicts, with prevalence broken down by provider\.

Table 8:Reasoning theme prevalence by provider \(UCI Adult, 337 valid verdicts\)\. Percentages show the fraction of each provider’s records in which the theme appears inred\_flagsorreasoningtext\.

## References

- \[1\]V\. Borisov, K\. Sessler, T\. Leemann, M\. Pawelczyk, and G\. Kasneci\(2023\)Language models are realistic tabular data generators\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.09865#S2.SS2.p4.1),[§7](https://arxiv.org/html/2606.09865#S7.p2.1)\.
- \[2\]P\. ChhikaraMind the confidence gap: overconfidence, calibration, and distractor effects in large language models\.Transactions on Machine Learning Research\.Cited by:[§5\.1](https://arxiv.org/html/2606.09865#S5.SS1.p4.1)\.
- \[3\]C\. Chiang and H\. Lee\(2023\)Can large language models be an alternative to human evaluations?\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15607–15631\.Cited by:[§2\.3](https://arxiv.org/html/2606.09865#S2.SS3.p1.1)\.
- \[4\]X\. Fang, W\. Xu, F\. A\. Tan, Z\. Hu, J\. Zhang, Y\. Qi, S\. H\. Sengamedu, and C\. FaloutsosLarge language models \(LLMs\) on tabular data: prediction, generation, and understanding\-a survey\.Transactions on Machine Learning Research\.Cited by:[§2\.3](https://arxiv.org/html/2606.09865#S2.SS3.p1.1)\.
- \[5\]U\. N\. E\. C\. for Europeet al\.\(2023\)Synthetic data for official statistics: a starter guide\.Cited by:[§2\.2](https://arxiv.org/html/2606.09865#S2.SS2.p2.1)\.
- \[6\]S\. Garfinkelet al\.\(2015\)De\-identification of personal information:\.\.US Department of Commerce, National Institute of Standards and Technology\.Cited by:[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p3.1)\.
- \[7\]S\. Han, G\. T\. Junior, T\. Balough, and W\. Zhou\(2025\)Judge’s verdict: a comprehensive analysis of llm judge capability through human agreement\.arXiv preprint arXiv:2510\.09738\.Cited by:[§2\.3](https://arxiv.org/html/2606.09865#S2.SS3.p2.1)\.
- \[8\]R\. Hu, Y\. Cheng, L\. Meng, J\. Xia, Y\. Zong, X\. Shi, and W\. Lin\(2025\)Training an llm\-as\-a\-judge model: pipeline, insights, and practical lessons\.InCompanion Proceedings of the ACM on Web Conference 2025,WWW ’25,pp\. 228–237\.External Links:ISBN 9798400713316Cited by:[§2\.3](https://arxiv.org/html/2606.09865#S2.SS3.p1.1)\.
- \[9\]A\. Hundepool, J\. Domingo\-Ferrer, L\. Franconi, S\. Giessing, E\. S\. Nordholt, K\. Spicer, and P\. De Wolf\(2012\)Statistical disclosure control\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p2.1)\.
- \[10\]B\. Jayaraman and D\. Evans\(2022\)Are attribute inference attacks just imputation?\.InProceedings of the ACM International Conference on Computer and Communications Security,pp\. 1569–1582\.External Links:ISBN 9781450394505Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p3.1)\.
- \[11\]A\. Kotelnikov, D\. Baranchuk, I\. Rubachev, and A\. Babenko\(2023\)TabDDPM: modelling tabular data with diffusion models\.InProceedings of the 40th International Conference on Machine Learning,ICML’23\.Cited by:[§2\.2](https://arxiv.org/html/2606.09865#S2.SS2.p4.1),[§7](https://arxiv.org/html/2606.09865#S7.p2.1)\.
- \[12\]C\. Little, M\. Elliot, and R\. Allmendinger\(2022\)Comparing the utility and disclosure risk of synthetic data with samples of microdata\.InProceedings of the International Conference on Privacy in Statistical Databases,J\. Domingo\-Ferrer and M\. Laurent \(Eds\.\),pp\. 234–249\.External Links:ISBN 978\-3\-031\-13944\-4Cited by:[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p2.3)\.
- \[13\]Mark Elliot\(2014\)‘Final report on the disclosure risk associated with synthetic data produced by the SYLLS Team\.\(Website\)Note:[http://hummedia\.manchester\.ac\.uk/institutes/cmist/archive\-publications/reports/](http://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/), Online; Last accessed 26\-June\-2022Cited by:[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p2.3)\.
- \[14\]N\. Patki, R\. Wedge, and K\. Veeramachaneni\(2016\)The synthetic data vault\.InIEEE International Conference on Data Science and Advanced Analytics,pp\. 399–410\.Cited by:[§2\.2](https://arxiv.org/html/2606.09865#S2.SS2.p2.1),[§4\.2](https://arxiv.org/html/2606.09865#S4.SS2.p1.1)\.
- \[15\]J\. Powar and A\. R\. Beresford\(2023\)SoK: managing risks of linkage attacks on data privacy\.Proceedings on Privacy Enhancing Technologies\.Cited by:[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p3.1)\.
- \[16\]J\. P\. Reiter\(2005\)Using CART to generate partially synthetic public use microdata\.Journal of Official Statistics21\(3\),pp\. 441\.Cited by:[§2\.2](https://arxiv.org/html/2606.09865#S2.SS2.p2.1)\.
- \[17\]S\. Salamatian, A\. Zhang, F\. d\. P\. Calmon, S\. Bhamidipati, N\. Fawaz, B\. Kveton, P\. Oliveira, and N\. Taft\(2013\)How to hide the elephant\- or the donkey\- in the room: practical privacy against statistical inference for large data\.InGlobal Conference on Signal and Information Processing,Vol\.,pp\. 269–272\.Cited by:[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p3.1)\.
- \[18\]N\. Shlomo\(2022\)How to measure disclosure risk in microdata?\.The Survey Statistician86\(2\),pp\. 13–21\.Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p2.1)\.
- \[19\]R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov\(2017\)Membership inference attacks against machine learning models\.InIEEE Symposium on Security and Privacy,Vol\.,pp\. 3–18\.Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p3.1),[§6](https://arxiv.org/html/2606.09865#S6.p2.1)\.
- \[20\]R\. Shokri\(2015\)Quantifying and protecting location privacy\.it\-Information Technology57\(4\),pp\. 257–263\.Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p2.1)\.
- \[21\]T\. Stadler, B\. Oprisanu, and C\. Troncoso\(2020\)Synthetic data–anonymisation groundhog day\.In29th USENIX Security Symposium,Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p2.1)\.
- \[22\]L\. Sweeney\(2002\)Achieving k\-anonymity privacy protection using generalization and suppression\.International Journal of Uncertainty, Fuzziness and Knowledge\-Based Systems10\(05\),pp\. 571–588\.Cited by:[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p2.3)\.
- \[23\]V\. Torra\(2017\)Masking methods\.InData Privacy: Foundations, New Developments and the Big Data Challenge,pp\. 191–238\.Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p2.1)\.
- \[24\]V\. Torra\(2017\)Privacy models and disclosure risk measures\.InData Privacy: Foundations, New Developments and the Big Data Challenge,pp\. 111–189\.External Links:ISBN 978\-3\-319\-57358\-8Cited by:[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.09865#S2.SS1.p2.3)\.
- \[25\]L\. Xu, M\. Skoularidou, A\. Cuesta\-Infante, and K\. Veeramachaneni\(2019\)Modeling tabular data using conditional GAN\.InAdvances in Neural Information Processing Systems 32,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alche\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),pp\. 7335–7345\.Cited by:[§2\.2](https://arxiv.org/html/2606.09865#S2.SS2.p3.1),[§4\.2](https://arxiv.org/html/2606.09865#S4.SS2.p1.1)\.
- \[26\]W\. X\. Zhao, K\. Zhou, J\. Li, T\. Tang, Z\. Dong, Y\. Hou, B\. Zhang, Y\. Min, J\. Zhang, P\. Liu,et al\.\(2026\)A survey of large language models\.Frontiers of Computer Science20\(12\),pp\. 2012627\.Cited by:[§2\.3](https://arxiv.org/html/2606.09865#S2.SS3.p1.1)\.
- \[27\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§1](https://arxiv.org/html/2606.09865#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.09865#S2.SS3.p1.1)\.

Similar Articles

Evaluating LLM Simulators as Differentially Private Data Generators

arXiv cs.CL

This paper evaluates LLM-based simulators as generators of differentially private synthetic data, using PersonaLedger to assess whether LLMs can faithfully reproduce statistical distributions from DP-protected personas. While achieving promising fraud detection utility (AUC 0.70 at ε=1), the study identifies significant distribution drift caused by systematic LLM biases that override input statistics.

Evaluating LLMs as Human Surrogates in Controlled Experiments

arXiv cs.CL

This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.