Geometry-Aware Tabular Diffusion

arXiv cs.LG 06/03/26, 04:00 AM Papers
Summary
Introduces Geometry-Aware Tabular Diffusion (GATD), which augments tabular diffusion denoisers with explicit pairwise geometric features. Achieves state-of-the-art performance on ten benchmarks while using significantly fewer parameters.
arXiv:2606.02607v1 Announce Type: new Abstract: Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion (GATD), which augments tabular diffusion denoisers with pairwise angles and lengths computed from column value differences and used as inputs and auxiliary targets. Our MLP instantiation achieves state-of-the-art benchmark performance while using 3.5x fewer parameters on average (up to 25x for classification tasks): on ten datasets, it wins 8/10 Shape, 7/10 Trend, and 9/10 downstream utility (F1/RMSE), reducing Shape and Trend error by 27% and 20%. Default loss weights transfer to GNN and Transformer denoisers, improving Shape on 27/30 and Trend on 25/30 architecture-dataset cells. A matched ablation shows supervision (not extra inputs or capacity) drives the gain. This shows explicit relational supervision is a portable inductive bias for tabular diffusion.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:39 AM
# Geometry-Aware Tabular Diffusion
Source: [https://arxiv.org/html/2606.02607](https://arxiv.org/html/2606.02607)
###### Abstract

Tabular synthesis is critical for privacy\-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter\-column relationships\. We introduce Geometry\-Aware Tabular Diffusion \(GATD\), which augments tabular diffusion denoisers with pairwise angles and lengths computed from column value differences and used as inputs and auxiliary targets\. Our MLP instantiation achieves state\-of\-the\-art benchmark performance while using3\.5×3\.5\\timesfewer parameters on average \(up to25×25\\timesfor classification tasks\): on ten datasets, it wins 8/10 Shape, 7/10 Trend, and 9/10 downstream utility \(F1/RMSE\), reducing Shape and Trend error by 27% and 20%\. Default loss weights transfer to GNN and Transformer denoisers, improving Shape on 27/30 and Trend on 25/30 architecture\-dataset cells\. A matched ablation shows supervision \(not extra inputs or capacity\) drives the gain\. This shows explicit relational supervision is a portable inductive bias for tabular diffusion\.

Diffusion Models, Tabular Data, Geometric Deep Learning, Data Synthesis

## 1Introduction

Tabular data remains the dominant format in enterprise applications, healthcare, and scientific research\. The ability to synthesize realistic tabular data enables privacy\-preserving data sharing\(Zhang et al\.,[2024](https://arxiv.org/html/2606.02607#bib.bib21)\), augmenting limited training sets\(Kotelnikov et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib11)\), and facilitating downstream model development without exposing sensitive records\. However, tabular synthesis presents unique difficulties\. Unlike images or text, tabular data exhibits heterogeneous column types, complex inter\-column dependencies, and highly non\-Gaussian marginal distributions—properties that have challenged deep generative models\.

Diffusion models have recently emerged as a promising approach to tabular synthesis\. Methods such as TabDDPM\(Kotelnikov et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib11)\), STaSy\(Kim et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib9)\), TabSyn\(Zhang et al\.,[2024](https://arxiv.org/html/2606.02607#bib.bib21)\), and TabDiff\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\)adapt the denoising framework to mixed continuous\-categorical data, achieving strong results on standard benchmarks\. Among these, transformer\-based architectures have become prevalent: self\-attention mechanisms provide a flexible means to model relationships between columns, allowing the network to learn what should covary and how\.

Yet this flexibility leaves inter\-column structure to be inferred from the denoising objective alone\. This raises a natural question:*can we provide explicit relational structure as an auxiliary supervision signal, and does the same signal transfer across denoising architectures within tabular diffusion?*

We answer affirmatively to both\. This paper introducesGeometry\-Aware Tabular Diffusion \(GATD\), which augments tabular diffusion denoisers with explicit pairwise geometric features computed directly from column values: an angle capturing the directional relationship between columns and a length capturing magnitude \(Figure[1](https://arxiv.org/html/2606.02607#S2.F1); full definitions in Section[3\.2](https://arxiv.org/html/2606.02607#S3.SS2)\)\. These features are provided as model inputs, and crucially, the model is trained to predict them via auxiliary losses\. The geometric representation provides an explicit encoding of inter\-column structure that, we find, transfers across architecturally distinct diffusion denoisers\.

Our claim is not that attention or message passing cannot learn such structure, but that explicit geometric supervision can reduce the burden on the denoiser and provide a portable relational inductive bias\.

A key finding is that geometric*supervision*is essential, not merely geometric*inputs*: an architecture\-matched ablation shows that supplying geometric inputs and prediction heads without supervision yields no benefit \(Cohen’sd=−0\.08d=\-0\.08\), while restoring supervision produces a large effect \(d=0\.81d=0\.81; Section[4\.3](https://arxiv.org/html/2606.02607#S4.SS3)\)\. The auxiliary prediction task forces the network to internalize inter\-column structure; architectural machinery alone produces no benefit\.

We evaluate the geometric signal as a drop\-in module across three diffusion denoising backbones: a residual Diffusion\-MLP, a GNN with Laplacian\-eigenmap positional encoding, and a column\-wise Transformer\. All use the same default geometric loss weights,\(λθ,λℓ,λc\)=\(15,15,8\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)=\(15,15,8\), on ten benchmark datasets with 3 training seeds and 20 generation seeds per cell\. Full per\-architecture statistics and the MLP\+Geom\-vs\-TabDiff comparison appear in Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2)and Section[4\.4](https://arxiv.org/html/2606.02607#S4.SS4)\. As a corollary of the cross\-architecture portability claim, the compact MLP instantiation matches or exceeds TabDiff\.

A previously\-reported categorical\-anchor mechanism on the MLP backbone \(ρ=0\.70\\rho=0\.70,p=0\.025p=0\.025\) does not generalize across architectures \(Section[4\.5](https://arxiv.org/html/2606.02607#S4.SS5), Appendix[A\.4](https://arxiv.org/html/2606.02607#A1.SS4)\): we characterize categorical structure as one operating regime among several, not a necessary condition for\+Geom\+\\textsc\{Geom\}\.

##### Contributions\.

We contribute: \(1\) pairwise angle/length features for tabular diffusion, used as inputs and auxiliary targets; \(2\) an architecture\-matched supervision ablation showing InputsOnly is indistinguishable from NoGeom \(d=−0\.08d=\-0\.08\) while supervised geometry is large\-effect \(d=0\.81d=0\.81\); \(3\) portability across MLP, GNN, and Transformer denoisers \(27/30 Shape, 25/30 Trend wins\) with shared defaults; \(4\) an efficient MLP instantiation matching/exceeding TabDiff; and \(5\) practical guidance onO\(d2\)O\(d^\{2\}\)scaling, sampling, and loss weights\.

## 2Related Work

### 2\.1Diffusion Models for Tabular Data

Diffusion models have emerged as a powerful alternative, offering stable training and strong distributional coverage\.

TabDDPM\(Kotelnikov et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib11)\)pioneered diffusion for tabular data, combining Gaussian diffusion for continuous columns with multinomial diffusion\(Hoogeboom et al\.,[2021](https://arxiv.org/html/2606.02607#bib.bib7)\)for categoricals\.STaSy\(Kim et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib9)\)used score\-based methods with self\-paced learning\.CoDi\(Lee et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib12)\)proposed co\-evolving contrastive diffusion with separate models for continuous and categorical columns\.TabSyn\(Zhang et al\.,[2024](https://arxiv.org/html/2606.02607#bib.bib21)\)introduced a VAE\-then\-diffusion approach, applying diffusion in a learned latent space\.

TabDiff\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\)unifies continuous and categorical diffusion by combining EDM\(Karras et al\.,[2022](https://arxiv.org/html/2606.02607#bib.bib8)\)for continuous columns with masked diffusion\(Austin et al\.,[2021](https://arxiv.org/html/2606.02607#bib.bib2)\)for categoricals, adding learnable per\-column noise schedules and a transformer architecture for modeling column relationships\. TabDiff achieves state\-of\-the\-art results, outperforming prior methods \(TabDDPM, STaSy, CoDi, TabSyn, CTGAN, TVAE\) on 17 of 21 measures for 3 core metrics across 7 benchmark datasets; we therefore adopt TabDiff as our primary baseline\.

We use the same diffusion losses \(EDM for continuous, masked cross\-entropy for categorical\), holding the diffusion framework constant to isolate the contribution of explicit geometric supervision \(Section[3](https://arxiv.org/html/2606.02607#S3)\) and the reflection\-based boundary handling described below\. Geometry as a drop\-in signal also improves transformer\-based denoisers on these benchmarks, indicating it complements rather than replaces attention\.

### 2\.2Geometric Deep Learning

Geometric deep learning incorporates geometric structure into neural networks\(Bronstein et al\.,[2021](https://arxiv.org/html/2606.02607#bib.bib3)\)\.GNNspass messages on graphs\(Kipf & Welling,[2017](https://arxiv.org/html/2606.02607#bib.bib10); Veličković et al\.,[2018](https://arxiv.org/html/2606.02607#bib.bib19)\);positional encodingsin transformers\(Vaswani et al\.,[2017](https://arxiv.org/html/2606.02607#bib.bib18); Su et al\.,[2024](https://arxiv.org/html/2606.02607#bib.bib17)\)demonstrate how geometric information guides attention\.

A key insight is that explicit geometric structure accelerates learning and improves generalization\. However, geometric deep learning has focused on inherently structured data—graphs, point clouds, molecules\. Tabular data, despite meaningful column relationships, has not benefited from geometric approaches\. We bridge this gap by constructing geometric features from the implicit relational structure in tabular data\.

### 2\.3Position in the Literature

To our knowledge, no prior tabular generator provides explicit pairwise geometric supervision\. CTGAN/TVAE\(Xu et al\.,[2019](https://arxiv.org/html/2606.02607#bib.bib20)\), TabDDPM\(Kotelnikov et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib11)\), and CoDi\(Lee et al\.,[2023](https://arxiv.org/html/2606.02607#bib.bib12)\)use MLP backbones with no explicit relational modeling; TabSyn\(Zhang et al\.,[2024](https://arxiv.org/html/2606.02607#bib.bib21)\)and TabDiff\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\)use transformer backbones that learn column relationships implicitly through attention\. Our pairwise angle and length features as both inputs and auxiliary prediction targets enable strong performance across architecturally diverse backbones, including a compact MLP that matches or exceeds transformer\-based SOTA\. The same supervision signal also improves transformer\-based denoisers \(Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2)\), indicating geometric supervision and attention are complementary rather than substitutable inductive biases for relational modeling\.

![Refer to caption](https://arxiv.org/html/2606.02607v1/x1.png)Figure 1:Geometric Intuition\.Inter\-column relationships are encoded as pairwise anglesθij=arctan⁡\(vj−vi\)\\theta\_\{ij\}=\\arctan\(v\_\{j\}\-v\_\{i\}\)and lengthsℓij=12log⁡\(1\+\(vj−vi\)2\)\\ell\_\{ij\}=\\frac\{1\}\{2\}\\log\(1\+\(v\_\{j\}\-v\_\{i\}\)^\{2\}\), providing explicit relational targets\. Diagrams \(b\) and \(c\) show sample rows fromAdultandDefault\.

## 3Method

### 3\.1Preliminaries

##### Diffusion Framework\.

We adopt TabDiff’s\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\)diffusion framework exactly: EDM\(Karras et al\.,[2022](https://arxiv.org/html/2606.02607#bib.bib8)\)for continuous columns and masked diffusion\(Austin et al\.,[2021](https://arxiv.org/html/2606.02607#bib.bib2)\)for categoricals \(with learnable per\-columnkk\)\. For continuous columns, the denoised output is:

Dθ\(𝐱;σ\)=cskip\(σ\)𝐱\+cout\(σ\)Fθ\(cin\(σ\)𝐱;σ\),D\_\{\\theta\}\(\\mathbf\{x\};\\sigma\)=c\_\{\\text\{skip\}\}\(\\sigma\)\\,\\mathbf\{x\}\+c\_\{\\text\{out\}\}\(\\sigma\)\\,F\_\{\\theta\}\\big\(c\_\{\\text\{in\}\}\(\\sigma\)\\,\\mathbf\{x\};\\sigma\\big\),\(1\)whereFθF\_\{\\theta\}is the raw network and the preconditioning coefficients arecin=1/σ2\+σdata2c\_\{\\text\{in\}\}=1/\\sqrt\{\\sigma^\{2\}\+\\sigma\_\{\\text\{data\}\}^\{2\}\},cskip=σdata2/\(σ2\+σdata2\)c\_\{\\text\{skip\}\}=\\sigma\_\{\\text\{data\}\}^\{2\}/\(\\sigma^\{2\}\+\\sigma\_\{\\text\{data\}\}^\{2\}\), andcout=σ⋅σdata/σ2\+σdata2c\_\{\\text\{out\}\}=\\sigma\\cdot\\sigma\_\{\\text\{data\}\}/\\sqrt\{\\sigma^\{2\}\+\\sigma\_\{\\text\{data\}\}^\{2\}\}\(Karras et al\.,[2022](https://arxiv.org/html/2606.02607#bib.bib8)\)\. We use learnable per\-columnρ\\rhofor noise scheduling\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\)\. This deliberate choice isolates the contribution of our geometric features from diffusion modifications\.

##### Notation\.

Consider a tabular dataset withdcontd\_\{\\text\{cont\}\}continuous columns anddcatd\_\{\\text\{cat\}\}categorical columns, withd=dcont\+dcatd=d\_\{\\text\{cont\}\}\+d\_\{\\text\{cat\}\}total columns\. We denote normalized column values asv∈\[−1,1\]dv\\in\[\-1,1\]^\{d\}\.

### 3\.2Geometric Feature Representation

The core contribution of our work is augmenting the diffusion model with explicit pairwise geometric features that capture inter\-column relationships\.

#### 3\.2\.1Pairwise Angles

For each pair of columns\(i,j\)\(i,j\)withi<ji<j, we compute:

θij=arctan⁡\(vj−vi\)\\theta\_\{ij\}=\\arctan\(v\_\{j\}\-v\_\{i\}\)\(2\)This angle captures the*directional relationship*between columns\. It is bounded \(θij∈\(−π2,π2\)\\theta\_\{ij\}\\in\(\-\\frac\{\\pi\}\{2\},\\frac\{\\pi\}\{2\}\)\) and antisymmetric \(θji=−θij\\theta\_\{ji\}=\-\\theta\_\{ij\}\)\.

#### 3\.2\.2Pairwise Lengths

We also compute the log\-length for each pair:

ℓij=12log⁡\(1\+\(vj−vi\)2\)\\ell\_\{ij\}=\\frac\{1\}\{2\}\\log\(1\+\(v\_\{j\}\-v\_\{i\}\)^\{2\}\)\(3\)This captures the*magnitude*of the difference between columns, with the logarithm compressing large differences\.

An otherwise\-identical raw\-difference ablation gives similar but slightly weaker aggregate performance \(Appendix[A\.18](https://arxiv.org/html/2606.02607#A1.SS18)\), indicating that pairwise supervision is the primary mechanism while the bounded arctan parameterization provides more stable targets\.

#### 3\.2\.3Handling Mixed Types

To compute geometric features across both continuous and categorical columns, we map all columns to a unified normalized spacev∈\[−1,1\]dv\\in\[\-1,1\]^\{d\}:

vcont\\displaystyle v\_\{\\text\{cont\}\}=2⋅QuantileTransform\(xcont\)−1\\displaystyle=2\\cdot\\text\{QuantileTransform\}\(x\_\{\\text\{cont\}\}\)\-1\(4\)vcat\\displaystyle v\_\{\\text\{cat\}\}=2⋅indexmax⁡\(cardinality−1,1\)−1\\displaystyle=2\\cdot\\frac\{\\text\{index\}\}\{\\max\(\\text\{cardinality\}\-1,1\)\}\-1\(5\)This enables geometric computation across all\(d2\)\\binom\{d\}\{2\}column pairs, regardless of type\.

Representing categorical variables as continuous features has precedent in target encoding\(Micci\-Barreca,[2001](https://arxiv.org/html/2606.02607#bib.bib13)\), entity embeddings\(Guo & Berkhahn,[2016](https://arxiv.org/html/2606.02607#bib.bib6)\), and feature tokenization\(Gorishniy et al\.,[2021](https://arxiv.org/html/2606.02607#bib.bib5)\)\. Our approach uses a fixed deterministic mapping to\[−1,1\]\[\-1,1\]rather than learned embeddings, enabling direct computation of pairwise geometric features across all column types\. A side benefit of this fixed mapping: ordinal columns \(education levels, Likert scales\) gain ordered reinforcement for free, since the deterministic position assignment preserves rank order, and small prediction errors yield neighboring categories rather than distant ones\. For non\-ordinal categoricals, the fixed ordering introduces a mild bias in the direction of error but does not affect predictive accuracy in our experiments\.

![Refer to caption](https://arxiv.org/html/2606.02607v1/x2.png)Figure 2:GeometryAwareMLP Architecture\.The model receives noised data and geometric features \(angles𝜽\\boldsymbol\{\\theta\}, lengthsℓ\\boldsymbol\{\\ell\}\) as input\. After processing through residual blocks, geometric heads predict angles and lengths, which are concatenated with the hidden state before the denoising heads\. This augmentation path \(red\) encourages the model to leverage geometric structure\. It is critical to note that the length head is supervised during training, but detached during generation\. The model only sees angles during sampling\. The impact of this design choice is shown in Figure[3](https://arxiv.org/html/2606.02607#S4.F3)\.

### 3\.3Architecture

Our architecture,GeometryAwareMLP, extends a residual MLP with geometric inputs and auxiliary prediction heads \(Figure[2](https://arxiv.org/html/2606.02607#S3.F2)\)\.

#### 3\.3\.1Input Representation

The model receives a concatenated input:

𝐳input=\[𝐭emb;𝐱cont;𝐱catoh;𝜽input;ℓinput\]\\mathbf\{z\}\_\{\\text\{input\}\}=\[\\mathbf\{t\}\_\{\\text\{emb\}\};\\mathbf\{x\}\_\{\\text\{cont\}\};\\mathbf\{x\}\_\{\\text\{cat\}\}^\{\\text\{oh\}\};\\boldsymbol\{\\theta\}\_\{\\text\{input\}\};\\boldsymbol\{\\ell\}\_\{\\text\{input\}\}\]\(6\)where𝐭emb∈ℝ128\\mathbf\{t\}\_\{\\text\{emb\}\}\\in\\mathbb\{R\}^\{128\}is a sinusoidal time embedding processed through a 2\-layer MLP,𝐱cont\\mathbf\{x\}\_\{\\text\{cont\}\}contains the noised continuous values,𝐱catoh\\mathbf\{x\}\_\{\\text\{cat\}\}^\{\\text\{oh\}\}is the one\-hot encoded categorical values, and𝜽input,ℓinput∈ℝd\(d−1\)/2\\boldsymbol\{\\theta\}\_\{\\text\{input\}\},\\boldsymbol\{\\ell\}\_\{\\text\{input\}\}\\in\\mathbb\{R\}^\{d\(d\-1\)/2\}are geometric features computed from the current state\.

#### 3\.3\.2Network Backbone

The input is processed through:

1. 1\.Input projection:2\-layer MLP mapping to hidden dimensiondmodeld\_\{\\text\{model\}\}
2. 2\.Residual blocks:nblocksn\_\{\\text\{blocks\}\}residual MLP blocks with expansion factor 4 and dropout 0\.1: ResidualBlock\(𝐡\)=𝐡\+MLP\(LayerNorm\(𝐡\)\)\\text\{ResidualBlock\}\(\\mathbf\{h\}\)=\\mathbf\{h\}\+\\text\{MLP\}\(\\text\{LayerNorm\}\(\\mathbf\{h\}\)\)\(7\)

#### 3\.3\.3Output Heads

Geometric headspredict from𝐡\\mathbf\{h\}:

𝜽^\\displaystyle\\hat\{\\boldsymbol\{\\theta\}\}=π2⋅tanh⁡\(MLPθ\(𝐡\)\)\\displaystyle=\\frac\{\\pi\}\{2\}\\cdot\\tanh\(\\text\{MLP\}\_\{\\theta\}\(\\mathbf\{h\}\)\)\(8\)ℓ^\\displaystyle\\hat\{\\boldsymbol\{\\ell\}\}=MLPℓ\(𝐡\)\\displaystyle=\\text\{MLP\}\_\{\\ell\}\(\\mathbf\{h\}\)\(9\)where each MLP includes LayerNorm, a hidden layer with GELU, and a linear output\.

Augmented representation:Predicted angles are concatenated with the hidden state:

𝐡aug=\[𝐡;𝜽^\]\\mathbf\{h\}\_\{\\text\{aug\}\}=\[\\mathbf\{h\};\\hat\{\\boldsymbol\{\\theta\}\}\]\(10\)We use angles only \(not lengths\) because angles encode strictly more information:vj−vi=tan⁡\(θij\)v\_\{j\}\-v\_\{i\}=\\tan\(\\theta\_\{ij\}\)recovers both sign and magnitude, whereas lengths lose sign due to squaring\. The length head provides auxiliary supervision to regularize the backbone, but its predictions are not used in𝐡aug\\mathbf\{h\}\_\{\\text\{aug\}\}to avoid redundant, potentially inconsistent signals\.

Denoising headspredict from𝐡aug\\mathbf\{h\}\_\{\\text\{aug\}\}:

Fcont\\displaystyle F\_\{\\text\{cont\}\}=MLPcont\(𝐡aug\)\\displaystyle=\\text\{MLP\}\_\{\\text\{cont\}\}\(\\mathbf\{h\}\_\{\\text\{aug\}\}\)\(11\)logitsc\\displaystyle\\text\{logits\}\_\{c\}=Linear\(LayerNorm\(𝐡aug\)\)\\displaystyle=\\text\{Linear\}\(\\text\{LayerNorm\}\(\\mathbf\{h\}\_\{\\text\{aug\}\}\)\)\(12\)
This augmentation encourages the model to leverage geometric structure when generating outputs\.

#### 3\.3\.4Architecture Comparison

The MLP instantiation differs from TabDiff along four axes: column encoding \(direct concatenation vs\. learned embeddings\), backbone architecture \(residual MLP vs\. transformer encoder–decoder\), relationship modeling \(explicit pairwise geometric features vs\. self\-attention\), and parameter count \(∼\\sim400K–6M vs\.∼\\sim10M;3\.5×3\.5\\timesfewer on average, up to25×25\\timesfor classification\)\. The full component\-level breakdown including shared diffusion machinery appears in Appendix[A\.9](https://arxiv.org/html/2606.02607#A1.SS9), Table[14](https://arxiv.org/html/2606.02607#A1.T14)\.

#### 3\.3\.5Cross\-Architecture Variants

The geometric\-input, prediction\-head, and augmentation\-path mechanisms described above transfer to the GNN and Transformer diffusion denoisers evaluated in Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2)\. TheGNN variantreplaces the residual MLP with edge\-conditioned message passing on a complete column graph, with geometric features supplied as edge attributes, and pools node embeddings before the prediction heads\. TheTransformer variantuses a column\-wise transformer encoder, with geometric features projected and concatenated to the denoiser representation\. In all three diffusion backbones, the geometric prediction heads attach to the final pre\-output representation and the augmentation path𝐡aug=\[𝐡;𝜽^\]\\mathbf\{h\}\_\{\\text\{aug\}\}=\[\\mathbf\{h\};\\hat\{\\boldsymbol\{\\theta\}\}\]feeds the denoising heads\. Full architectural diagrams and per\-backbone wiring details appear in Appendix[A\.2](https://arxiv.org/html/2606.02607#A1.SS2)\.

#### 3\.3\.6Architecture Hyperparameters

We usedmodel=256d\_\{\\text\{model\}\}=256throughout, withnblocks=0n\_\{\\text\{blocks\}\}=0for classification andnblocks=8n\_\{\\text\{blocks\}\}=8for regression \(interpretation in Section[4\.5](https://arxiv.org/html/2606.02607#S4.SS5)\)\.

### 3\.4Training Objective

Our loss combines standard diffusion objectives with geometric supervision:

ℒ=ℒcont\+ℒcat\+ℒangle\+ℒlength\+ℒconsistency\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{cont\}\}\+\\mathcal\{L\}\_\{\\text\{cat\}\}\+\\mathcal\{L\}\_\{\\text\{angle\}\}\+\\mathcal\{L\}\_\{\\text\{length\}\}\+\\mathcal\{L\}\_\{\\text\{consistency\}\}\(13\)
#### 3\.4\.1Diffusion Losses

ℒdiffusion=λϵℒcont\+λcatℒcat\\mathcal\{L\}\_\{\\text\{diffusion\}\}=\\lambda\_\{\\epsilon\}\\mathcal\{L\}\_\{\\text\{cont\}\}\+\\lambda\_\{\\text\{cat\}\}\\mathcal\{L\}\_\{\\text\{cat\}\}\(14\)ℒcont\\mathcal\{L\}\_\{\\text\{cont\}\}is EDM\-weighted MSE\(Karras et al\.,[2022](https://arxiv.org/html/2606.02607#bib.bib8)\)on the denoised prediction\.ℒcat\\mathcal\{L\}\_\{\\text\{cat\}\}is cross\-entropy for categorical columns\(Austin et al\.,[2021](https://arxiv.org/html/2606.02607#bib.bib2)\), applied only to masked tokens and weighted by the inverse of the masking probability\.

#### 3\.4\.2Geometric Losses

ℒgeometric=λθℒangle\+λℓℒlength\+λcℒconsistency\\mathcal\{L\}\_\{\\text\{geometric\}\}=\\lambda\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{angle\}\}\+\\lambda\_\{\\ell\}\\mathcal\{L\}\_\{\\text\{length\}\}\+\\lambda\_\{c\}\\mathcal\{L\}\_\{\\text\{consistency\}\}\(15\)
Angle/Length prediction:Direct supervision on geometric predictions:

ℒangle\\displaystyle\\mathcal\{L\}\_\{\\text\{angle\}\}=‖𝜽^−𝜽true‖2\\displaystyle=\\\|\\hat\{\\boldsymbol\{\\theta\}\}\-\\boldsymbol\{\\theta\}\_\{\\text\{true\}\}\\\|^\{2\}\(16\)ℒlength\\displaystyle\\mathcal\{L\}\_\{\\text\{length\}\}=‖ℓ^−ℓtrue‖2\\displaystyle=\\\|\\hat\{\\boldsymbol\{\\ell\}\}\-\\boldsymbol\{\\ell\}\_\{\\text\{true\}\}\\\|^\{2\}\(17\)
Consistency loss:Ensures predicted geometry matches model outputs:

ℒconsistency=𝔼\[\(1−t\)2\]⋅\(\\displaystyle\\mathcal\{L\}\_\{\\text\{consistency\}\}=\\mathbb\{E\}\[\(1\-t\)^\{2\}\]\\cdot\\big\(∥𝜽^−sg\(𝜽pred\)∥2\\displaystyle\\lVert\\hat\{\\boldsymbol\{\\theta\}\}\-\\text\{sg\}\(\\boldsymbol\{\\theta\}\_\{\\text\{pred\}\}\)\\rVert^\{2\}\+∥ℓ^−sg\(ℓpred\)∥2\)\\displaystyle\+\\lVert\\hat\{\\boldsymbol\{\\ell\}\}\-\\text\{sg\}\(\\boldsymbol\{\\ell\}\_\{\\text\{pred\}\}\)\\rVert^\{2\}\\big\)\(18\)where𝜽pred,ℓpred\\boldsymbol\{\\theta\}\_\{\\text\{pred\}\},\\boldsymbol\{\\ell\}\_\{\\text\{pred\}\}are computed from the denoised output andsg\(⋅\)\\text\{sg\}\(\\cdot\)denotes stop\-gradient\. The\(1−t\)2\(1\-t\)^\{2\}weighting emphasizes consistency at low noise levels, allowing the model to prioritize enforcing consistency when it is most valuable\.

#### 3\.4\.3Loss Weights

We useλϵ=1\.0\\lambda\_\{\\epsilon\}=1\.0,λcat=0\.05\\lambda\_\{\\text\{cat\}\}=0\.05,λθ=15\.0\\lambda\_\{\\theta\}=15\.0,λℓ=15\.0\\lambda\_\{\\ell\}=15\.0, andλc=8\.0\\lambda\_\{c\}=8\.0for all datasets\.

##### Loss Weight Analysis\.

Our weighting inverts the typical loss hierarchy: at convergence, the weighted geometric terms \(λθℒangle\+λℓℒlength\+λcℒconsistency\\lambda\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{angle\}\}\+\\lambda\_\{\\ell\}\\mathcal\{L\}\_\{\\text\{length\}\}\+\\lambda\_\{c\}\\mathcal\{L\}\_\{\\text\{consistency\}\}\) account for approximately 95% of total loss, with diffusion losses contributing only 5%\. The inversion is essential: reducing geometric weights toward parity with diffusion degrades performance, suggesting heavy auxiliary supervision forces the network to internalize inter\-column structure that transfers to improved generation\. By contrast, TabDiff usesλcat≈1\.0\\lambda\_\{\\text\{cat\}\}\\approx 1\.0\(with optional annealing\) and no geometric supervision\.

### 3\.5Training Details

We train with AdamW and EMA \(hyperparameters in Table[17](https://arxiv.org/html/2606.02607#A1.T17), Appendix[A\.12](https://arxiv.org/html/2606.02607#A1.SS12)\) for 20,000 epochs \(vs\. TabDiff’s 8,000\), with best model selected after epoch 10,000\. GATD benefits from extended training while TabDiff does not \(Appendix[A\.10](https://arxiv.org/html/2606.02607#A1.SS10)\)\. Despite2\.5×2\.5\\timesmore epochs, the complete training run is1\.7×1\.7\\timesfaster on average in wall\-clock time; this is an end\-to\-end training\-cost comparison\. Sampling uses 1000 steps vs\. TabDiff’s 50 \(Appendix[A\.11](https://arxiv.org/html/2606.02607#A1.SS11)\)\.

### 3\.6Sampling

At inference, we use EDM sampling \(Euler method, 1000 steps\) for continuous columns and iterative unmasking for categorical columns\. Geometric features are computed during sampling but no geometric supervision is applied\. The 1000\-step setting is our high\-fidelity operating point; reduced\-step sampling preserves much of the gain \(Appendix[A\.13](https://arxiv.org/html/2606.02607#A1.SS13)\)\.

##### Boundary Handling\.

During inverse transformation, generated continuous values are first mapped from the model’s\[−1,1\]\[\-1,1\]scale to the quantile\-transformer’s\[0,1\]\[0,1\]scale\. Letssdenote such a scaled value\. Rather than hard clipping, we use reflection: ifs\>1s\>1we maps↦2−ss\\mapsto 2\-s, and ifs<0s<0we maps↦−ss\\mapsto\-s, repeating up to 10 times before a final numerical clip to\[0,1\]\[0,1\]\. This avoids concentrating mass at the boundaries\.

Table 1:Dataset statistics\. We reclassify low\-cardinality integers as categorical: education\.num \(Adult\), Administrative, Informational, SpecialDay \(Shoppers\), and n\_tokens\_title, n\_non\_stop\_words, num\_keywords \(News\)\. We keep TabDiff’s column classification the same for benchmarking reproducibility\.

## 4Experiments

### 4\.1Experimental Setup

##### Datasets\.

We evaluate on seven datasets from the TabDiff benchmark and 3 additional datasets, spanning 5 binary classification and 5 regression tasks \(Table[1](https://arxiv.org/html/2606.02607#S3.T1)\)\.

##### Metrics\.

Following TabDiff, we evaluate: \(1\)Shape—marginal distribution fidelity via SDMetrics\(Patki et al\.,[2016](https://arxiv.org/html/2606.02607#bib.bib15)\)ColumnShapeSimilarity; \(2\)Trend—correlation preservation via ColumnPairTrendsSimilarity; \(3\)MLE \(machine learning efficacy\)—downstream utility using XGBoost\(Chen & Guestrin,[2016](https://arxiv.org/html/2606.02607#bib.bib4)\)\(AUROC/F1 for classification, R2/RMSE for regression\)\.

##### Baselines\.

Our primary evaluation compares\+Geom\+\\textsc\{Geom\}against the same diffusion denoising architecture without geometric features or losses, across three backbones: a residual MLP \(the original GATD\), a GNN with Laplacian\-eigenmap positional encoding \(GNN\+LE\), and a column\-wise Transformer\. We additionally compare againstTabDiff\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\), the prior state\-of\-the\-art tabular diffusion model, on the MLP track to anchor absolute performance against published numbers \(Section[4\.4](https://arxiv.org/html/2606.02607#S4.SS4)\)\. For the supervision\-vs\-capacity ablation, we further include anInputsOnlyconfiguration defined in Section[4\.3](https://arxiv.org/html/2606.02607#S4.SS3)\.

##### Protocol\.

We extend the TabDiff protocol: 20,000 training epochs \(vs\. TabDiff’s 8,000\), best model after epoch 10,000 \(vs\. 4,000\), 3 train seeds, 20 generation seeds per train seed\.

### 4\.2Main Results: Cross\-Architecture Evaluation

Two complementary pieces of evidence support our methodological claim\. First, direct supervision is the operative variable: holding architecture, parameters, and gradient topology constant, removing only the geometric loss weights collapses performance to the no\-geometry baseline \(Section[4\.3](https://arxiv.org/html/2606.02607#S4.SS3)\)\. Second, the same supervision signal ports across architecturally diverse diffusion backbones: Table[2](https://arxiv.org/html/2606.02607#S4.T2)reports pairwise Shape and Trend error of\+Geom\+\\textsc\{Geom\}vs\. its same\-architecture baseline across three diffusion denoising backbones, applied as a drop\-in module without architecture\-specific tuning\. Per\-dataset MLE\-1 and MLE\-2 \(downstream\-utility\) results appear in Tables[12](https://arxiv.org/html/2606.02607#A1.T12)and[13](https://arxiv.org/html/2606.02607#A1.T13)\(Appendix[A\.7](https://arxiv.org/html/2606.02607#A1.SS7)\); aggregate MLE behavior is summarized in the population\-level test paragraph below\.

##### Statistical results\.

\+Geom\+\\textsc\{Geom\}wins 27/30 Shape and 25/30 Trend cells across the MLP, GNN, and Transformer diffusion backbones \(per\-architecture Shape 9/8/10 and Trend 8/9/8\)\. Treating each architecture\-dataset metric cell as a Bernoulli win/loss under a 50% null win rate, the 52 wins out of 60 Shape/Trend cells give a two\-sided exact sign\-test value ofp=5\.21×10−9p=5\.21\\times 10^\{\-9\}\. On downstream utility,\+Geom\+\\textsc\{Geom\}improves aggregate AUROC/R2on all three diffusion backbones; F1/RMSE improves on two of three\. These results support the central claim that geometric supervision is portable across diffusion denoising architectures, while leaving non\-diffusion generative frameworks to future work\.

Table 2:Cross\-architecture fidelity: pairwise Shape and Trend error of\+Geom\+\\textsc\{Geom\}vs\. its same\-architecture baseline across three diffusion denoising backbones \(MLP, GNN\+LE, Transformer\); 3 training seeds and 20 generation seeds per cell\.Bold= better cell\.Δ\\Deltarows report relative improvement \(↑\\uparrowbetter;↓\\downarrowworse\)\.\+Geom\+\\textsc\{Geom\}wins 27/30 Shape cells and 25/30 Trend cells\.

### 4\.3Architecture\-Matched Capacity Control: Inputs vs Supervision

The cross\-architecture results could reflect added inputs/heads rather than supervision, so we compare three MLP configurations:NoGeom\(no geometric inputs, heads, or losses\),InputsOnly\(full\+Geom\+\\textsc\{Geom\}architecture and augmentation path, butλθ=λℓ=λc=0\\lambda\_\{\\theta\}=\\lambda\_\{\\ell\}=\\lambda\_\{c\}=0\), andFull\+Geom\+\\textsc\{Geom\}\. InputsOnly and Full share parameter count, forward compute, and gradient topology; only direct geometric supervision differs\. InputsOnly achieves lower total loss, yet its geometric losses remain 8–339×\\timeshigher \(Figure[8](https://arxiv.org/html/2606.02607#A1.F8)\); aggregate Shape errors are 1\.229/1\.279/0\.862 for NoGeom/InputsOnly/Full\. Pooled effects ared\(NoGeom→InputsOnly\)=−0\.08d\(\\textsc\{NoGeom\}\\to\\textsc\{InputsOnly\}\)=\-0\.08,d\(NoGeom→Full\)=\+0\.79d\(\\textsc\{NoGeom\}\\to\\textsc\{Full\}\)=\+0\.79, andd\(InputsOnly→Full\)=\+0\.81d\(\\textsc\{InputsOnly\}\\to\\textsc\{Full\}\)=\+0\.81\. Thus supervision, not added capacity or augmentation path, drives the gain; Figure[3](https://arxiv.org/html/2606.02607#S4.F3)shows the corresponding cliff\.

![Refer to caption](https://arxiv.org/html/2606.02607v1/x3.png)Figure 3:Loss\-ablation distributions by method\.Each panel plots 10 dataset values across 6 methods using KDE violins, median bars, and colored dataset dots tracked across positions\. The shaded band marks geometric supervision: medians drop sharply from Inputs Only to supervised methods for Shape \(1\.07→\\to0\.85\) and Trend \(1\.93→\\to1\.40\), while similar supervised violins show Angle Only / Angle \+ Length / Full have small aggregate fidelity differences\. MLE\-1 isolates News \(yellow\) as the main consistency\-loss beneficiary, falling from∼\\sim50–57pp to 30–40pp; MLE\-2 remains similar, matching saturated downstream utility\. Length\-only provides the best News utility at the cost of fidelity, pulling the challenging dataset’s MLE\-1 score into the visible range∼\\sim15–20pp\.Table 3:MLP\+Geom vs\. TabDiff summary\. Errors and gaps are averaged over 10 datasets; lower is better\. Full per\-dataset values appear in Appendix[A\.5](https://arxiv.org/html/2606.02607#A1.SS5)\.
### 4\.4State\-of\-the\-Art Comparison: MLP\+Geom vs TabDiff

As a corollary of the cross\-architecture portability claim, we anchor the MLP instantiation of GATD against the prior state\-of\-the\-art tabular diffusion model, TabDiff\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\)\. TabDiff results are reproduced from the official TabDiff codebase under their published 8,000\-epoch protocol with three training seeds and twenty generation seeds per dataset\. Two protocol differences could in principle confound the comparison; both are controlled in the appendix\. Training\-epoch budget: a matched 8,000\-epoch run of GATD\-MLP retains the gain \(27/40 wins, 0\.854 Shape vs\. TabDiff\-8k’s 1\.187 — 28% reduction, vs\. 27% at 20k; Appendix[A\.15](https://arxiv.org/html/2606.02607#A1.SS15)\)\. Sampling\-step budget: at 50 steps GATD already beats TabDiff on 3 of 4 aggregate metrics, and wins 4 of 4 at 100 steps \(Figure[9](https://arxiv.org/html/2606.02607#A1.F9), Appendix[A\.13](https://arxiv.org/html/2606.02607#A1.SS13)\)\.

##### Per\-dataset wins\.

MLP\+Geom wins 8/10 on Shape, 7/10 on Trend, 6/10 on AUROC/R2\(MLE\-1\), 9/10 on F1/RMSE \(MLE\-2\)\. Only MLE\-2 reaches 10\-cell sign\-test significance \(p=0\.022p=0\.022\); Shape \(p=0\.11p=0\.11\), Trend \(p=0\.34p=0\.34\), MLE\-1 \(p=0\.75p=0\.75\) are directionally favorable but underpowered atn=10n=10\.

##### Aggregate gap\-to\-real closure\.

Using train\-on\-real, test\-on\-real \(TRTR\) performance as the upper bound \(Real rows of Tables[12](https://arxiv.org/html/2606.02607#A1.T12)and[13](https://arxiv.org/html/2606.02607#A1.T13)\), MLP\+Geom closes 58\.2% of TabDiff’s aggregate AUROC/R2gap to real \(11\.4%→4\.8%11\.4\\%\\to 4\.8\\%\) and 42\.0% of its F1/RMSE gap \(9\.7%→5\.6%9\.7\\%\\to 5\.6\\%\); per\-dataset Shape and Trend errors are reduced by 27\.3% and 19\.6% on average\. Largest wins \(63\.1pp on News for AUROC/R2, 15\.0pp on Beijing for F1/RMSE\) substantially exceed largest losses \(0\.2pp on Adult, 0\.5pp on California\)\.

Table 4:Protocol\-control summary for MLP\+Geom vs\. TabDiff\. Lower is better\. MLE\-1 and MLE\-2 report gap to real\-data performance\. The 8k row matches TabDiff’s training budget; the 50/100\-step rows test the sampling\-budget concern\. Step rows use the Appendix[A\.13](https://arxiv.org/html/2606.02607#A1.SS13)protocol\.
##### Calibrating the magnitude of improvement\.

TabDiff’s reported gain over the prior SOTA \(TabSyn\) was 13% Shape and 23% Trend reduction on 7 datasets, single seed\(Shi et al\.,[2025](https://arxiv.org/html/2606.02607#bib.bib16)\); GATD\-MLP’s 27\.3%/19\.6% reductions over TabDiff are on 10 datasets with3×203\\times 20seeds\. The cross\-architecture evidence adds two additional diffusion backbones reaching SOTA\-competitive Shape/Trend, neither previously demonstrated here as a drop\-in tabular diffusion denoiser\. The two cells where MLP\+Geom under\-performs TabDiff differ by very different magnitudes: the Magic Shape gap \(0\.4 percentage points\) is well within one standard deviation \(Table[2](https://arxiv.org/html/2606.02607#S4.T2),σ≈0\.08\\sigma\\approx 0\.08\) and is a tie within noise; the California Shape gap \(22\.8 percentage points\) is a real loss\. The 3\-seed protocol additionally surfaces substantial training\-seed instability in TabDiff that single\-seed reporting would obscure \(Table[5](https://arxiv.org/html/2606.02607#S4.T5)\); each\+Geom\+\\textsc\{Geom\}backbone is more reproducible than TabDiff on the majority of datasets\.

### 4\.5Analysis

##### Architecture\.

Classification in MLP architectures favorsnblocks=0n\_\{\\text\{blocks\}\}=0\(10/10 fidelity wins\) while regression favorsnblocks=8n\_\{\\text\{blocks\}\}=8\(10/10 MLE wins\), both atdmodel=256d\_\{\\text\{model\}\}=256\(Appendix[A\.3](https://arxiv.org/html/2606.02607#A1.SS3)\)\. California and Powerplant \(both 0 categorical columns\) prefer 0 blocks even as regression tasks\.

##### When Does Geometry Help Most?

On the MLP backbone, per\-dataset Shape improvement is rank\-correlated with categorical\-column count \(ρ=0\.70\\rho=0\.70,p=0\.025p=0\.025\); this is the empirical basis for the “categorical anchor” interpretation in the original GATD analysis\. The same correlation is weaker and individually non\-significant for the GNN and Transformer backbones \(ρ∈\[0\.38,0\.42\]\\rho\\in\[0\.38,0\.42\],p≥0\.22p\\geq 0\.22\); pooled across the three diffusion backbones it reachesρ=0\.51\\rho=0\.51,p=0\.004p=0\.004\. Continuous\-only datasets such as Powerplant \(d=5d=5, 0 cat\. cols\.\) nevertheless show gains under geometric supervision, demonstrating that categorical structure is one operating regime rather than a necessary condition\. Per\-architecture correlations and the categorical\-fraction control analysis are reported in Appendix[A\.4](https://arxiv.org/html/2606.02607#A1.SS4)\.

##### Computational Profile\.

Geometric features addO\(d2\)O\(d^\{2\}\)tensor operations per row\. In our benchmark regime \(d≤48d\\leq 48\), this cost is dominated by backbone forward/backward passes, and the all\-pair structure is lesser or comparable in order to column\-wise self\-attention\. Combined with shallow networks for classification \(nblocks=0n\_\{\\text\{blocks\}\}=0\), GATD\-MLP’s end\-to\-end training run is1\.7×\\timesfasteron average than TabDiff despite2\.5×2\.5\\timesmore epochs; this is a total wall\-clock training comparison, not an epoch/second throughput claim \(Appendix[A\.11](https://arxiv.org/html/2606.02607#A1.SS11)\)\. The pairwise computation is already batched GPU\-accelerated tensor operations\. On APS \(d=171d=171\), the profile becomes more memory\-bound, but GATD remains faster per epoch and both models run on a single T4 \(Appendix[A\.14](https://arxiv.org/html/2606.02607#A1.SS14)\)\.

Table 5:Training\-seed stability of\+Geom\+\\textsc\{Geom\}backbones relative to TabDiff\. Each cell reports the median across\-train\-seed CV ratio \(TabDiff / method\) over 10 datasets, with the count of datasets where the method’s std is smaller than TabDiff’s in parentheses\. CV ratios\>1\>1and high parenthetical counts indicate the method is more stable than TabDiff\.

## 5Conclusion

We introduced Geometry\-Aware Tabular Diffusion \(GATD\), a portable relational inductive bias for tabular diffusion: explicit pairwise geometric features as both inputs and auxiliary prediction targets, evaluated as a drop\-in module across MLP, GNN, and Transformer diffusion denoisers\. Population\-level statistical evidence appears in Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2)\.

The architecture\-matched ablation localizes the gain to direct supervision of the geometric heads \(Section[4\.3](https://arxiv.org/html/2606.02607#S4.SS3)\); the MLP instantiation matches or exceeds TabDiff without an attention mechanism \(Section[4\.4](https://arxiv.org/html/2606.02607#S4.SS4)\)\. Cross\-architecture results \(Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2)\) show that the same signal also improves transformer\- and message\-passing\-based diffusion denoisers, indicating geometric supervision, attention, and message passing are complementary rather than mutually exclusive inductive biases\.

##### Limitations\.

O\(d2\)O\(d^\{2\}\)pairwise scaling\.The geometric signal is quadratic in column count because it intentionally supervises all column pairs, including weak or uncorrelated pairs that provide negative relational evidence\. This is a practical memory consideration rather than a methodological barrier: column\-wise self\-attention is alsoO\(d2\)O\(d^\{2\}\)in the number of columns, and in our evaluated regime \(d≤48d\\leq 48\) the cost is dominated by backbone compute\. Wider\-table behavior, characterized on APS Failure at Scania Trucks \(d=171d=171; Appendix[A\.14](https://arxiv.org/html/2606.02607#A1.SS14)\), shows the tradeoff directly: GATD’s memory and parameter count exceed TabDiff’s at this scale, but both models fit on a single T4 and minimal loss\-weight tuning recovers 3 of 4 wins\.

Smaller downstream\-utility effects\.F1/RMSE shows no reliable population\-level cross\-architecture signal; AUROC/R2reaches significance only under the magnitude\-weighted Wilcoxon test\. Some continuous\-heavy regression cells with GNN or Transformer backbones regress in downstream utility, indicating that geometric supervision may require tuning when paired with relational backbones on continuous\-heavy data\.

Empirical portability scope\.We evaluate three denoising backbones within the diffusion framework\. Because the geometric features and losses are computed from data rather than from the diffusion mechanism itself, extending them to non\-diffusion paradigms such as autoregressive models, VAEs, or normalizing flows is a plausible direction, but it is not directly supported by the experiments in this paper\.

Heuristic loss weights and operating\-point choices\.The defaultsλθ=λℓ=15\\lambda\_\{\\theta\}=\\lambda\_\{\\ell\}=15,λc=8\\lambda\_\{c\}=8are heuristic; per\-dataset tuning unlocks substantial gains on specific datasets \(Section[A\.17](https://arxiv.org/html/2606.02607#A1.SS17), Figure[10](https://arxiv.org/html/2606.02607#A1.F10), Appendix[A\.16](https://arxiv.org/html/2606.02607#A1.SS16)\)\. The 1000\-step sampling budget is an operating point, not a requirement \(Appendix[A\.13](https://arxiv.org/html/2606.02607#A1.SS13)\)\.

##### Future Work\.

Promising directions include: \(1\) higher\-order geometric supervision, such as triangle\-closure losses on column triples; \(2\) learnable geometric transformations beyond fixed arctan/log\-length; \(3\) extension to conditional generation and non\-diffusion generative paradigms, evaluated separately from the present diffusion study; \(4\) integration with differentially\-private training, where auxiliary supervision may interact with gradient clipping and noise injection; \(5\) Bayesian formulations for learning angle and length priors, including radial/directional posterior decompositions over magnitude and orientation\(Oh et al\.,[2020](https://arxiv.org/html/2606.02607#bib.bib14)\); and \(6\) per\-architecture loss\-weight tuning\.

## Impact Statement

This paper presents work on synthetic tabular data generation\. While synthetic data is often motivated by privacy concerns, our method, like most generative models, provides no formal privacy guarantees\. Organizations should not treat synthetic data as inherently privacy\-preserving or as a substitute for rigorous techniques such as differential privacy\. There is a risk that synthetic data generation could be used for “privacy\-washing”—presenting a privacy\-conscious image to stakeholders without providing meaningful protection\. We encourage practitioners to pair synthetic data generation methods with formal, mathematical privacy frameworks and training protocols like DPSGD\(Abadi et al\.,[2016](https://arxiv.org/html/2606.02607#bib.bib1)\)when handling sensitive information\.

## Acknowledgements

The author thanks the ICML reviewers and area chair for constructive feedback that improved the experimental controls, scalability discussion, and presentation\.

## References

- Abadi et al\. \(2016\)Abadi, M\., Chu, A\., Goodfellow, I\., McMahan, H\. B\., Mironov, I\., Talwar, K\., and Zhang, L\.Deep learning with differential privacy\.In*Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security*, pp\. 308–318, 2016\.
- Austin et al\. \(2021\)Austin, J\., Johnson, D\. D\., Ho, J\., Tarlow, D\., and van den Berg, R\.Structured denoising diffusion models in discrete state\-spaces\.In*Advances in Neural Information Processing Systems*, volume 34, pp\. 17981–17993, 2021\.
- Bronstein et al\. \(2021\)Bronstein, M\. M\., Bruna, J\., Cohen, T\., and Veličković, P\.Geometric deep learning: Grids, groups, graphs, geodesics, and gauges\.*arXiv preprint arXiv:2104\.13478*, 2021\.
- Chen & Guestrin \(2016\)Chen, T\. and Guestrin, C\.XGBoost: A scalable tree boosting system\.In*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pp\. 785–794\. ACM, 2016\.doi:10\.1145/2939672\.2939785\.
- Gorishniy et al\. \(2021\)Gorishniy, Y\., Rubachev, I\., Khrulkov, V\., and Babenko, A\.Revisiting deep learning models for tabular data\.In*Advances in Neural Information Processing Systems*, volume 34, pp\. 18932–18943, 2021\.
- Guo & Berkhahn \(2016\)Guo, C\. and Berkhahn, F\.Entity embeddings of categorical variables\.*arXiv preprint arXiv:1604\.06737*, 2016\.
- Hoogeboom et al\. \(2021\)Hoogeboom, E\., Nielsen, D\., Jaini, P\., Forré, P\., and Welling, M\.Argmax flows and multinomial diffusion: Learning categorical distributions\.In*Advances in Neural Information Processing Systems*, volume 34, pp\. 12454–12465, 2021\.
- Karras et al\. \(2022\)Karras, T\., Aittala, M\., Aila, T\., and Laine, S\.Elucidating the design space of diffusion\-based generative models\.In*Advances in Neural Information Processing Systems*, volume 35, 2022\.
- Kim et al\. \(2023\)Kim, J\., Lee, C\., and Park, N\.STaSy: Score\-based tabular data synthesis\.In*International Conference on Learning Representations*, 2023\.
- Kipf & Welling \(2017\)Kipf, T\. N\. and Welling, M\.Semi\-supervised classification with graph convolutional networks\.In*International Conference on Learning Representations*, 2017\.
- Kotelnikov et al\. \(2023\)Kotelnikov, A\., Baranchuk, D\., Rubachev, I\., and Babenko, A\.TabDDPM: Modelling tabular data with diffusion models\.In*International Conference on Machine Learning*, volume 202, pp\. 17564–17579\. PMLR, 2023\.
- Lee et al\. \(2023\)Lee, C\., Kim, J\., and Park, N\.CoDi: Co\-evolving contrastive diffusion models for mixed\-type tabular synthesis\.In*International Conference on Machine Learning*, volume 202, pp\. 18940–18956\. PMLR, 2023\.
- Micci\-Barreca \(2001\)Micci\-Barreca, D\.A preprocessing scheme for high\-cardinality categorical attributes in classification and prediction problems\.*ACM SIGKDD Explorations Newsletter*, 3\(1\):27–32, 2001\.
- Oh et al\. \(2020\)Oh, C\., Adamczewski, K\., and Park, M\.Radial and directional posteriors for bayesian deep learning\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 34, pp\. 5298–5305, 2020\.doi:10\.1609/aaai\.v34i04\.5976\.
- Patki et al\. \(2016\)Patki, N\., Wedge, R\., and Veeramachaneni, K\.The synthetic data vault\.In*IEEE International Conference on Data Science and Advanced Analytics*, pp\. 399–410, 2016\.
- Shi et al\. \(2025\)Shi, J\., Xu, M\., Hua, H\., Zhang, H\., Ermon, S\., and Leskovec, J\.TabDiff: a mixed\-type diffusion model for tabular data generation\.In*International Conference on Learning Representations*, 2025\.arXiv:2410\.20626\.
- Su et al\. \(2024\)Su, J\., Lu, Y\., Pan, S\., Murtadha, A\., Wen, B\., and Liu, Y\.RoFormer: Enhanced transformer with rotary position embedding\.*Neurocomputing*, 568:127063, 2024\.doi:10\.1016/j\.neucom\.2023\.127063\.
- Vaswani et al\. \(2017\)Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, Ł\., and Polosukhin, I\.Attention is all you need\.In*Advances in Neural Information Processing Systems*, volume 30, 2017\.
- Veličković et al\. \(2018\)Veličković, P\., Cucurull, G\., Casanova, A\., Romero, A\., Liò, P\., and Bengio, Y\.Graph attention networks\.In*International Conference on Learning Representations*, 2018\.
- Xu et al\. \(2019\)Xu, L\., Skoularidou, M\., Cuesta\-Infante, A\., and Veeramachaneni, K\.Modeling tabular data using conditional GAN\.In*Advances in Neural Information Processing Systems*, volume 32, 2019\.
- Zhang et al\. \(2024\)Zhang, H\., Zhang, J\., Srinivasan, B\., Shen, Z\., Qin, X\., Faloutsos, C\., Rangwala, H\., and Karypis, G\.Mixed\-type tabular data synthesis with score\-based diffusion in latent space\.In*International Conference on Learning Representations*, 2024\.

## Appendix AExtended Results

### A\.1Full Loss Ablation

Table[11](https://arxiv.org/html/2606.02607#A1.T11)presents the complete loss ablation across the 10 benchmark datasets for Shape, Trend, and MLE metrics for the MLP architecture\.

### A\.2Cross\-Architecture Wiring Diagrams

Section[3\.3\.5](https://arxiv.org/html/2606.02607#S3.SS3.SSS5)describes how the geometric\-input, prediction\-head, and augmentation\-path mechanisms transfer across the three diffusion denoising backbones evaluated in Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2)\. Figure[2](https://arxiv.org/html/2606.02607#S3.F2)\(main text\) shows the Diffusion\-MLP variant\. This appendix provides the corresponding wiring diagrams for the GNN and Transformer variants and consolidates the behavioral details that apply to all three\.

##### Shared mechanisms across all three diffusion backbones\.

Three mechanisms are common to every\+Geom\+\\textsc\{Geom\}instantiation, regardless of backbone family:

*Geometric inputs are computed from the current state at every forward pass\.*At each training step the inputs are derived from the noised data; at each sampling step they are derived from the partially\-denoised current state\. The input pathway is therefore active throughout reverse diffusion, even though no geometric*supervision*is applied at sampling time \(Section[3\.6](https://arxiv.org/html/2606.02607#S3.SS6)\)\.

*Angle head and length head play asymmetric roles\.*Both heads attach to the final pre\-output representation𝐡\\mathbf\{h\}and are supervised during training byℒangle\\mathcal\{L\}\_\{\\text\{angle\}\}andℒlength\\mathcal\{L\}\_\{\\text\{length\}\}\. Only the bounded angle prediction𝜽^=\(π/2\)tanh⁡\(⋅\)\\hat\{\\boldsymbol\{\\theta\}\}=\(\\pi/2\)\\tanh\(\\cdot\)enters the augmentation path𝐡aug=\[𝐡;𝜽^\]\\mathbf\{h\}\_\{\\text\{aug\}\}=\[\\mathbf\{h\};\\hat\{\\boldsymbol\{\\theta\}\}\]that feeds the denoising heads \(Section[3\.3](https://arxiv.org/html/2606.02607#S3.SS3), “Augmented representation”\)\. The length head provides auxiliary supervision only: its forward output is computed at every step but never consumed downstream\. Consequently the angle head is essential at inference \(the denoising heads cannot run without𝐡aug\\mathbf\{h\}\_\{\\text\{aug\}\}\), whereas the length head is effectively training\-only at the architectural level\.

*The denoising heads \(𝐅*cont*\\mathbf\{F\}\_\{\\text\{cont\}\}and per\-column categorical logits\) always read from𝐡*aug*\\mathbf\{h\}\_\{\\text\{aug\}\}, never from𝐡\\mathbf\{h\}directly\.*This is what makes geometric supervision an inductive bias on the denoising path rather than a side\-task: the gradient from the denoising loss flows back through𝜽^\\hat\{\\boldsymbol\{\\theta\}\}into the angle head and from there into the backbone\.

In the figures that follow, the augmentation path is highlighted in red and the standard \(denoising\-only\) path in grey\.

![Refer to caption](https://arxiv.org/html/2606.02607v1/x4.png)Figure 4:GNN backbone, no\-geometry baseline\.Standard graph denoiser: node embeddings \(per\-column projection\+\+column\-position embedding\+\+time embedding\) feednroundsn\_\{\\text\{rounds\}\}rounds of message passing with data\-independent learned pair embeddings \(npairs×4n\_\{\\text\{pairs\}\}\\times 4\) as edge attributes; mean\-pooled node embeddings feed the denoising heads\.
#### A\.2\.1Diffusion\-MLP Variant

Figure[2](https://arxiv.org/html/2606.02607#S3.F2)\(main text\) describes the Diffusion\-MLP variant\. The\+Geom\+\\textsc\{Geom\}instantiation concatenates the time embedding, noised continuous values, one\-hot categoricals, pairwise input angles, and pairwise input lengths into a single flat vector that feeds an input\-projection MLP and a stack ofnblocksn\_\{\\text\{blocks\}\}residual blocks\. The NoGeom baseline removes the angle and length input slices, the angle head, the length head, and the augmentation path; the denoising heads then read from𝐡\\mathbf\{h\}directly\.

#### A\.2\.2GNN Variant

![Refer to caption](https://arxiv.org/html/2606.02607v1/x5.png)Figure 5:GNN\+Geom architecture\.Four\-dimensional geometric edge features \(anglesθij\\theta\_\{ij\}, log\-lengthsℓij\\ell\_\{ij\}, row\-normalized endpoint valuesx~i,x~j\\tilde\{x\}\_\{i\},\\tilde\{x\}\_\{j\}\) replace the data\-independent learned pair embeddings of the baseline \(Figure[4](https://arxiv.org/html/2606.02607#A1.F4)\); they pass through Edge LayerNorm into the GNN layers\. The augmentation path \(red\) routes the predicted angle𝜽^\\hat\{\\boldsymbol\{\\theta\}\}through the concat operation\[𝐡;𝜽^\]=𝐡aug\[\\mathbf\{h\};\\hat\{\\boldsymbol\{\\theta\}\}\]=\\mathbf\{h\}\_\{\\text\{aug\}\}feeding the denoising heads\. The length head provides auxiliary training supervision only and is not consumed at inference\.The GNN backbone replaces the residual MLP with edge\-conditioned message passing on a complete column graph\. Nodes correspond to columns and carry a per\-column projection of the noised value \(continuous: a learned scalar embedding; categorical: a learned embedding of the one\-hot vector\) summed with a learned column\-position embedding and the time embedding broadcast across nodes\. The graph is fully connected and a stack ofnroundsn\_\{\\text\{rounds\}\}message\-passing layers \(each: edge MLP→\\toscatter\-add aggregation→\\tonode MLP→\\toresidual \+ LayerNorm\) propagates information; the angle direction is sign\-flipped for the reverse\-direction message to respect the antisymmetryθji=−θij\\theta\_\{ji\}=\-\\theta\_\{ij\}\. After message passing, node embeddings are mean\-pooled into a graph\-level vector𝐡\\mathbf\{h\}that feeds the prediction heads under the shared scheme described above\.

The two variants differ in their edge attributes\. The NoGeom baseline \(Figure[4](https://arxiv.org/html/2606.02607#A1.F4)\) supplies a learned, data\-independent embedding per pair, of dimensionnpairs×4n\_\{\\text\{pairs\}\}\\times 4— a parameter\-matched stand\-in for the geometric features that lets the GNN distinguish pairs but provides no information about their current values\. The\+Geom\+\\textsc\{Geom\}variant \(Figure[5](https://arxiv.org/html/2606.02607#A1.F5)\) replaces these learned embeddings with four data\-dependent edge features per pair: the pairwise angleθij\\theta\_\{ij\}, the pairwise log\-lengthℓij\\ell\_\{ij\}, and the row\-normalized valuesx~i,x~j\\tilde\{x\}\_\{i\},\\tilde\{x\}\_\{j\}of the two endpoints \(centered and scaled by the per\-row mean and standard deviation across columns\)\. The four\-dimensional edge tensor passes through a LayerNorm before entering the message\-passing stack, and the angle and length prediction heads attach to𝐡\\mathbf\{h\}in the standard pattern\.

![Refer to caption](https://arxiv.org/html/2606.02607v1/x6.png)Figure 6:Transformer backbone, no\-geometry baseline\.Time embedding, noised continuous values, and one\-hot categoricals concatenate into a flat vector that the input\-projection MLP maps to𝐡0\\mathbf\{h\}\_\{0\}; a linear expansion producesntokensn\_\{\\text\{tokens\}\}latent tokens with learned positional embeddings, processed by a pre\-norm Transformer encoder of depthnblocksn\_\{\\text\{blocks\}\}\. A linear contraction flattens the encoded tokens back to a singledmodeld\_\{\\text\{model\}\}vector that feeds the denoising heads\.![Refer to caption](https://arxiv.org/html/2606.02607v1/x7.png)Figure 7:Transformer\+Geom architecture\.Pairwise input anglesθij\\theta\_\{ij\}and lengthsℓij\\ell\_\{ij\}are concatenated alongside the standard inputs before the input\-projection MLP; the latent\-token Transformer backbone is unchanged from the baseline \(Figure[6](https://arxiv.org/html/2606.02607#A1.F6)\)\. The augmentation path \(red\) routes the predicted angle𝜽^\\hat\{\\boldsymbol\{\\theta\}\}through the concat operation\[𝐡;𝜽^\]=𝐡aug\[\\mathbf\{h\};\\hat\{\\boldsymbol\{\\theta\}\}\]=\\mathbf\{h\}\_\{\\text\{aug\}\}feeding the denoising heads\. The length head provides auxiliary training supervision only\.
#### A\.2\.3Transformer Variant

The Transformer variant uses the same flat input vector as the Diffusion\-MLP — time embedding, noised continuous values, one\-hot categoricals, and \(under\+Geom\+\\textsc\{Geom\}\) pairwise input angles and lengths concatenated into a single tensor that passes through the shared input\-projection MLP to produce admodeld\_\{\\text\{model\}\}\-dimensional vector𝐡0\\mathbf\{h\}\_\{0\}\. A linear map then expands𝐡0\\mathbf\{h\}\_\{0\}into a sequence ofntokensn\_\{\\text\{tokens\}\}latent tokens of dimensiondmodeld\_\{\\text\{model\}\}, a learned positional embedding is added, and a pre\-norm Transformer encoder of depthnblocksn\_\{\\text\{blocks\}\}withnheadsn\_\{\\text\{heads\}\}attention heads processes the token sequence\. A final linear map flattens the encoded tokens back to a singledmodeld\_\{\\text\{model\}\}vector𝐡\\mathbf\{h\}that feeds the prediction heads under the shared scheme\. Note that the tokens are not column tokens: they are a fixed\-length latent set \(ntokens=4n\_\{\\text\{tokens\}\}=4by default\) that the encoder uses as a working sequence, with all column information already mixed into𝐡0\\mathbf\{h\}\_\{0\}before tokenization\.

The NoGeom baseline \(Figure[6](https://arxiv.org/html/2606.02607#A1.F6)\) removes the angle and length input slices, the angle head, the length head, and the augmentation path\. The\+Geom\+\\textsc\{Geom\}variant \(Figure[7](https://arxiv.org/html/2606.02607#A1.F7)\) restores all four and routes the predicted angle into the augmentation path in the standard pattern\.

### A\.3Architecture Ablation: Effect of Residual Blocks

Table[6](https://arxiv.org/html/2606.02607#A1.T6)compares 0 vs\. 8 residual blocks across all datasets\. The pattern is consistent: classification tasks favor 0 blocks for fidelity metrics \(Shape, Trend\), while regression tasks favor 8 blocks for downstream utility \(MLE\)\. Exceptions occur precisely in low\-categorical datasets \(California, Power with 0 categorical columns; Magic with 1\), where the per\-dataset gain from geometric supervision is smaller on the MLP backbone \(Table[7](https://arxiv.org/html/2606.02607#A1.T7)\)\.

Table 6:Ablation study: Effect of residual blocks on performance\. Bold indicates better performance\. For Shape/Trend, Improv\. shows improvement from 0 to 8 blocks \(Blue↑\\uparrow= 8 blocks better,Red↓\\downarrow= 0 blocks better\)\. For MLE, Average shows % gap from Real \(↓\\downarrowbetter\)\. Dashes indicate missing data\.
### A\.4Categorical\-Anchor Analysis

Table 7:Per\-architecture and pooled rank correlation between categorical\-column count and per\-dataset Shape improvement under\+Geom\+\\textsc\{Geom\}across diffusion denoisers\. The original “categorical anchor” hypothesis is supported on the MLP track \(ρ=0\.70\\rho=0\.70,p=0\.025p=0\.025\) and as a weak aggregate effect across architectures, but does not reach Spearman significance for any individual non\-MLP backbone\. Trend correlations are not significant for any architecture\.When controlling for total dataset dimensionality by using categorical*fraction*\(cat / total cols\) instead of raw count, the MLP\-track correlation weakens toρ=0\.49\\rho=0\.49\(p=0\.15p=0\.15\), indicating that the original observation partly conflated “more categorical columns” with “larger overall dataset\.” Continuous\-only datasets such as Powerplant \(0 categorical columns,d=5d=5\) show Shape and Trend gains across multiple diffusion backbones, with the per\-cell BH\-FDR\-corrected effect surviving atq<0\.05q<0\.05for the evaluated diffusion architectures on Trend\. We characterize the categorical\-anchor pattern as one operating regime among several rather than a necessary condition for\+Geom\+\\textsc\{Geom\}\. Re\-computing the same rank correlation across the loss\-ablation configurations of Appendix[A\.1](https://arxiv.org/html/2606.02607#A1.SS1)\(Table[8](https://arxiv.org/html/2606.02607#A1.T8)\) shows the pattern is supervision\-dependent:*Inputs Only*produces no positive correlation, while every supervised configuration reproduces it\.

Table 8:Rank correlation between categorical\-column count and per\-dataset Shape improvement \(vs NoGeom\) on the MLP backbone, across loss\-ablation configurations\. The anchor pattern appears only under direct supervision; geometric inputs and architecture alone \(Inputs Only\) do not produce it\.n=10n=10datasets\.
### A\.5MLP\+Geom vs\. TabDiff Per\-Dataset Results

Table 9:MLP\+Geom vs\. TabDiff with mean±\\pmstd per dataset\.Bold= better cell\.Δ\\Deltarows: MLP\+Geom relative improvement \(↑\\uparrowbetter;↓\\downarrowworse\)\. For MLE, rightmost column is mean % gap toReal\(↓\\downarrowbetter\)\.
### A\.6Ablation: Loss\-Component Decomposition

Table[10](https://arxiv.org/html/2606.02607#A1.T10)dissects the relative contribution of the three geometric loss components on the MLP backbone\. Four configurations: \(1\)*Inputs Only*—geometric features as input butλθ=λℓ=λc=0\\lambda\_\{\\theta\}=\\lambda\_\{\\ell\}=\\lambda\_\{c\}=0\(the architecture\-matched supervision\-removed configuration analyzed in Section[4\.3](https://arxiv.org/html/2606.02607#S4.SS3)\); \(2\)*ℓ\\ellLoss*—length loss only \(λℓ=15\\lambda\_\{\\ell\}=15, others 0\); \(3\)*θ\\thetaLoss*—angle loss only \(λθ=15\\lambda\_\{\\theta\}=15, others 0\); \(4\)*θ\+ℓ\\theta\+\\ellLoss*—angle and length \(λθ=λℓ=15\\lambda\_\{\\theta\}=\\lambda\_\{\\ell\}=15,λc=0\\lambda\_\{c\}=0\); \(5\)*Full*—complete method \(15,15,815,15,8\)\.

Table 10:Aggregate performance comparison of geometry loss\-weight ablations on the MLP backbone\. Values are averaged across 3 training seeds, 20 sampling seeds, and all 10 datasets\. Gap shows % distance from Real baseline \(↓\\downarrowbetter\)\.We note an interesting tradeoff between fidelity \(Shape and Trend\), and utility \(MLE\)\. Length\-only appears to provide the greatest aggregate utility without much fidelity improvement, while the angle loss alone provides nearly all of the fidelity improvement at the cost of the gains afforded by Length\-only utility improvement: the jump from Inputs Only toθ\\thetaLoss accounts for 97% of the Shape improvement and 103% of the Trend improvement\. Adding the length loss further improves Shape \(0\.873→\\to0\.861\) but degrades MLE \(5\.2%→\\to6\.0%\); the consistency loss recovers MLE \(6\.0%→\\to4\.8%\) while maintaining Shape gains\. Figure[3](https://arxiv.org/html/2606.02607#S4.F3)visualizes the per\-dataset trajectory across the five configurations; the MLE\-1 panel reveals News as the dominant outlier carrying most of the consistency\-loss benefit, a single\-dataset effect the aggregate column would otherwise hide\. The Inputs Only row complements the architecture\-matched analysis in Section[4\.3](https://arxiv.org/html/2606.02607#S4.SS3): even with the geometric heads instantiated and the augmentation path active, gradient descent against the denoising loss alone fails to discover any useful representation of the geometric features\.

Table 11:Loss ablation across all 10 datasets, MLP backbone, 3 training seeds×\\times20 generation seeds per cell\.NoGeom= no geometric architecture or supervision;Inputs Only= geometric architecture \(input concatenation, prediction heads, augmentation path\) with all loss weights zeroed \(λθ=λℓ=λc=0\\lambda\_\{\\theta\}=\\lambda\_\{\\ell\}=\\lambda\_\{c\}=0\);Angle Only=λθ=15\\lambda\_\{\\theta\}=15,λℓ=λc=0\\lambda\_\{\\ell\}=\\lambda\_\{c\}=0;Length Only=λℓ=15\\lambda\_\{\\ell\}=15,λθ=λc=0\\lambda\_\{\\theta\}=\\lambda\_\{c\}=0;Angle \+ Length=λθ=λℓ=15\\lambda\_\{\\theta\}=\\lambda\_\{\\ell\}=15,λc=0\\lambda\_\{c\}=0;Full=λθ=λℓ=15\\lambda\_\{\\theta\}=\\lambda\_\{\\ell\}=15,λc=8\\lambda\_\{c\}=8\.Boldmarks the best cell per column at displayed precision \(ties retained\)\. For MLE\-1 the per\-dataset metric is AUROC for classification and R2for regression; for MLE\-2 it is F1 and RMSE\. The Avg\. column is the mean error for Shape/Trend and the mean % gap toRealfor MLE\. Per\-dataset relative improvements vs\.NoGeomare visualized in Figure[3](https://arxiv.org/html/2606.02607#S4.F3)\.
### A\.7Per\-Dataset Downstream Utility Across Architectures

Tables[12](https://arxiv.org/html/2606.02607#A1.T12)and[13](https://arxiv.org/html/2606.02607#A1.T13)report per\-dataset MLE\-1 \(AUROC for classification, R2for regression\) and MLE\-2 \(F1 for classification, RMSE for regression\) for all three diffusion backbone pairs on all 10 datasets\. Aggregated cross\-architecture statistical tests appear in Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2); the supervision\-vs\-capacity attribution on the MLP track appears in Section[4\.3](https://arxiv.org/html/2606.02607#S4.SS3)\.

Table 12:Per\-dataset MLE\-1 \(AUROC for classification, R2for regression\) for all three backbone pairs\.Boldmarks the better cell within each pair\. The rightmost column reports mean % gap toReal\(↓\\downarrowbetter\)\.Table 13:Per\-dataset MLE\-2 \(F1 for classification, RMSE for regression\) for all three backbone pairs\.Boldmarks the better cell within each pair\. The rightmost column reports mean % gap toReal\(↓\\downarrowbetter\)\.
### A\.8Training Dynamics

Figure[8](https://arxiv.org/html/2606.02607#A1.F8)shows training curves comparing Full Geometry vs\. Inputs Only \(geometric inputs without supervision\)\.

![Refer to caption](https://arxiv.org/html/2606.02607v1/x8.png)Figure 8:Training Dynamics \(Default\)\.Full Geometry \(solid\) vs\. Inputs Only \(dashed\)\.*Total Loss*includes weighted geometric terms \(λθ,λℓ,λc\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}=15,15,815,15,8\) for Full Geometry but only diffusion losses \(ℒcont\+ℒcat\\mathcal\{L\}\_\{\\text\{cont\}\}\+\\mathcal\{L\}\_\{\\text\{cat\}\}\) for Inputs Only\. Inputs Only converges to lower total loss \(0\.60 vs\. 2\.08\) yet fails to learn geometry: Angle 2\.63 vs\. 0\.07 \(37×\\times\), Length 0\.11 vs\. 0\.01 \(8×\\times\), Consistency 0\.88 vs\. 0\.003 \(339×\\times\)\. Weighted geometric loss is 37\.4×\\timeshigher, showing geometric*inputs*alone provide no learning signal; explicit*supervision*is essential\.
### A\.9Detailed Comparison with TabDiff

Table[14](https://arxiv.org/html/2606.02607#A1.T14)provides a comprehensive comparison of shared and differing components between our method and TabDiff\.

The shared diffusion framework ensures that performance differences arise from our architectural and supervisory contributions, not from diffusion modifications\. The 20×\\timesincrease in sampling steps and 2\.5×\\timesincrease in training epochs represent additional compute; our 3\.5×\\timesaverage parameter reduction partially offsets inference cost\. Notably, extended training does not benefit TabDiff: Table[15](https://arxiv.org/html/2606.02607#A1.T15)shows that TabDiff at 8,000 epochs outperforms or ties TabDiff at 20,000 epochs on 16/24 metrics across 6 datasets, validating our use of TabDiff’s published 8,000\-epoch protocol\.

Table 14:Detailed comparison with TabDiff\. We deliberately share the core diffusion framework \(left\) to isolate the contribution of geometric features\. Key differences \(right\) are architectural \(MLP vs\. transformer\), supervisory \(geometric losses\), and procedural \(more sampling steps and training epochs\)\.
### A\.10TabDiff Extended Training

Table[15](https://arxiv.org/html/2606.02607#A1.T15)compares TabDiff trained for 8,000 epochs \(published protocol\) versus 20,000 epochs on 6 datasets\. Extended training does not improve TabDiff performance—8,000 epochs outperforms or ties 16/24 metrics\.

Table 15:TabDiff at 8,000 vs\. 20,000 training epochs\. Bold = better cell within each \(dataset, metric\) pair; higher is better for Shape, Trend, AUROC/R2, and F1, while lower is better for RMSE\. AUROC/R2row: AUROC for classification \(Adult, Default, Magic, Shoppers\), R2for regression \(Beijing, News\)\. F1/RMSE row: F1 for classification, RMSE for regression\. Aggregate across the 24 \(dataset, metric\) cells: 8k wins 16, 20k wins 8\. Extended training does not benefit TabDiff, validating our use of their published 8,000\-epoch protocol\. Single training seed, 20 sampling seeds\.
### A\.11Computational Cost

Table[16](https://arxiv.org/html/2606.02607#A1.T16)compares per\-dataset training time, sampling time, and parameter counts between GATD\-MLP and TabDiff across all ten benchmark datasets\. All measurements are taken on a single NVIDIA T4 GPU under identical batching and I/O conditions\.

##### Analysis\.

GATD\-MLP achieves substantial parameter reduction \(average 0\.29×\\times, range 0\.04–0\.55×\\times\) but incurs higher single\-dataset sampling cost \(average 2\.71×\\times\) due to 20×\\timesmore sampling steps \(1000 vs\. 50\)\. The trade\-off varies by dataset: Magic achieves both smaller model*and*faster sampling \(0\.69×\\times\), while regression datasets withnblocks=8n\_\{\\text\{blocks\}\}=8\(Beijing, News, Bikesharing, California, Powerplant\) have larger models and slower sampling\. Classification datasets withnblocks=0n\_\{\\text\{blocks\}\}=0\(Adult, Default, Diabetes, Magic, Shoppers\) maintain the strongest parameter advantage\. Despite 2\.5×\\timesmore training epochs, total end\-to\-end training time averages 1\.7×\\timesfaster than TabDiff due to smaller models and simpler architectures; this is a complete\-run wall\-clock comparison, not a per\-epoch throughput claim\.

Table 16:Computational cost comparison\. Top 7 datasets are from the main evaluation; bottom 3 are additional benchmarks\.*Params*: model parameters\.*Train*: total end\-to\-end training time in hours\.*Sample*: time to generate one synthetic dataset in seconds\. GATD uses 3\.5×\\timesfewer parameters on average and requires 2\.7×\\timeslonger single\-dataset sampling time due to 1000 steps \(vs\. TabDiff’s 50\)\. Total training runtime is 1\.7×\\timesfaster on average despite 2\.5×\\timesmore epochs\. All timings are wall\-clock measurements on a single NVIDIA T4 GPU\.

### A\.12Hyperparameters

Table[17](https://arxiv.org/html/2606.02607#A1.T17)lists all hyperparameters used in our experiments\. The configuration is identical across all 10 datasets and all three diffusion backbones, with the single exception of the architecture\-depth choice:nblocks=0n\_\{\\text\{blocks\}\}=0for classification tasks andnblocks=8n\_\{\\text\{blocks\}\}=8for regression tasks \(motivated in Section[4\.5](https://arxiv.org/html/2606.02607#S4.SS5)and analyzed empirically in Appendix[A\.3](https://arxiv.org/html/2606.02607#A1.SS3)\)\. Classification additionally uses a higher learning rate \(5×10−35\\times 10^\{\-3\}vs\.1×10−31\\times 10^\{\-3\}\) given the smaller backbone; the linear\-decay schedule, batch size of 4,096, and AdamW optimizer with EMA \(decay 0\.9995, warmup 200 steps\) are shared\. The geometric loss weights\(λθ,λℓ,λc\)=\(15,15,8\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)=\(15,15,8\)are likewise dataset\-agnostic; per\-dataset tuning unlocks additional gains on specific datasets \(Appendix[A\.16](https://arxiv.org/html/2606.02607#A1.SS16), Section[A\.17](https://arxiv.org/html/2606.02607#A1.SS17)\) but is not required for the headline cross\-architecture results\.

Table 17:Full hyperparameter settings\.
### A\.13Sampling\-Step Ablation

We evaluated MLP\+Geom against TabDiff at sampling\-step budgets of 50, 100, 250, 500, 1000, and 2000 steps to characterize the sensitivity of GATD’s reported gains to the sampling protocol\. Per\-\(dataset, step\-count\) cells use*single training seed and 20 generation seeds*, with the exception of the 1000\-step row, which is reported from the main\-results3×203\\times 20protocol\. The full3×203\\times 20protocol was prohibitive across all six step\-count cells\. We compare each GATD step budget against TabDiff at its published 50\-step protocol\.

Table 18:GATD\-MLP at varying sampling\-step budgets vs TabDiff at its published 50\-step protocol\. Most rows use a single training seed and 20 generation seeds per cell \(1×201\\times 20protocol\); the full3×203\\times 20protocol was prohibitive across this many step\-count cells\. The 1000\-step row, marked with∗, uses values from the main\-results protocol \(3×203\\times 20, 60 obs per cell\)\. Aggregate values are means across all 10 datasets; MLE\-1 and MLE\-2 are reported as average percentage gap to real\-data performance \(↓\\downarrowbetter\)\. Win counts \(Shape/Trend/MLE\-1/MLE\-2 out of 10\) compare GATD at each step budget against TabDiff at 50 steps\. Performance improves monotonically up to 2000 steps with no clear saturation point in our evaluated range\.![Refer to caption](https://arxiv.org/html/2606.02607v1/figures/figure_step_ablation_summary.png)Figure 9:Sampling\-step ablation summary\.Aggregate metrics across all 10 datasets as a function of GATD\-MLP sampling steps, against TabDiff at its published 50\-step protocol \(red dashed lines\)\. From left: Shape error \(↓\\downarrow\), Trend error \(↓\\downarrow\), MLE\-1 gap to real \(↓\\downarrow\), MLE\-2 gap to real \(↓\\downarrow\)\. GATD wins 3 of 4 aggregate metrics at 50 steps and all 4 by 100 steps; performance continues to improve up to 2000 steps with no clear saturation in our evaluated range\. Single training seed and 20 generation seeds per \(dataset, step\-count\) cell, except the 1000\-step row which uses the main\-results3×203\\times 20protocol\. Wall\-clock costs in Section[A\.13](https://arxiv.org/html/2606.02607#A1.SS13), paragraph “Wall\-clock cost\.”##### Aggregate findings\.

Table[18](https://arxiv.org/html/2606.02607#A1.T18)\(visualized in Figure[9](https://arxiv.org/html/2606.02607#A1.F9)\) reports per\-step\-budget performance\. At 50 sampling steps, GATD already wins 3 of 4 metrics in aggregate against TabDiff \(Trend 1\.96 vs 2\.05; MLE\-1 4\.6% vs 11\.4% gap; MLE\-2 7\.2% vs 9\.7% gap\), losing only on Shape \(1\.55 vs 1\.19\)\. By 100 sampling steps, GATD wins all 4 aggregate metrics\. The story tightens further as step count increases: aggregate Shape error decreases monotonically from 1\.55 \(50 steps\) to 0\.85 \(2000 steps\), and aggregate Trend error decreases monotonically from 1\.96 to 1\.58\. The MLE downstream\-utility gaps remain in a narrow band \(4–7%\) across all step counts, indicating that step count primarily affects distributional fidelity rather than predictive utility\.

##### Per\-dataset wins are mixed at low step counts\.

While the aggregate at 50 steps shows GATD winning 3 of 4 metrics, the per\-dataset story is mixed: Shape 3/10, Trend 4/10, MLE\-1 5/10, MLE\-2 8/10\. By 100 steps, per\-dataset wins improve to 7/6/8/9; by 250 steps and beyond, the pattern stabilizes around 8/7/5–7/8–9\. The aggregate\-vs\-per\-dataset divergence at low step counts reflects a few large\-effect datasets pulling aggregate means below TabDiff’s, while the median dataset still favors TabDiff at very low step budgets\.

##### No saturation point in our evaluated range\.

Our data does not show a clear saturation point: aggregate Shape continues to improve from 0\.86 at 1000 steps to 0\.85 at 2000 steps, and aggregate Trend from 1\.65 to 1\.58\. The 1000\-step setting we report as default in the main text is a reasonable operating point for accuracy\-vs\-cost trade\-off, not a saturation point—practitioners willing to spend additional inference cost can extract modest additional gains\. Conversely, at 50–100 steps, GATD is already beating TabDiff on most aggregate metrics, suggesting that practitioners constrained by inference budget can also use GATD without giving up most of its advantage\.

##### Wall\-clock cost\.

A representative example from the Adult dataset: TabDiff requires 35 seconds to generate a single synthetic dataset at 50 steps, whereas GATD\-MLP requires 2\.5 seconds at 50 steps and 5 seconds at 100 steps—an order of magnitude faster per\-step due to the substantially smaller MLP backbone\. At 1000 steps GATD\-MLP requires roughly 50 seconds per dataset, comparable to TabDiff at 50 steps; at 2000 steps the cost is roughly 100 seconds\.

### A\.14Wide\-Table Scalability: APS Failure at Scania Trucks

To empirically characterize GATD’s behavior on wider tables than those in the main 10\-dataset benchmark, we evaluated against TabDiff on the APS Failure at Scania Trucks dataset \(d=171d=171columns,n=60,000n=60\{,\}000rows, 1 categorical column\)\. This dataset sits well above thed≤48d\\leq 48regime of our main evaluation and tests the practical implications ofO\(d2\)O\(d^\{2\}\)pairwise feature scaling\. Both methods use default configurations and train on a single NVIDIA T4 GPU\.

##### Default configuration\.

With the same loss weights\(λθ,λℓ,λc\)=\(15,15,8\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)=\(15,15,8\)used across the main 10\-dataset benchmark, GATD already substantially outperforms TabDiff on distributional fidelity atd=171d=171\(Table[19](https://arxiv.org/html/2606.02607#A1.T19)\): Shape error drops from 9\.48% to 0\.51% \(95% relative reduction\), and Trend is essentially tied\. TabDiff retains the AUC advantage on this single\-categorical\-column dataset; the small geometric signal available from a single discrete column appears insufficient to recover downstream classification utility under the default loss weighting, although tuned weights \(below\) substantially close this gap\. TheO\(d2\)O\(d^\{2\}\)pairwise feature cost is empirically observable: atd=171d=171, GATD’s parameter count \(17M vs\. 12M\) and memory footprint \(12\.3 GB vs\. 6\.5 GB\) exceed TabDiff’s, but both models fit on a single T4 and GATD is still slightly faster per epoch \(5\.5s vs\. 7\.5s\)\.

Table 19:APS Failure at Scania Trucks \(d=171d=171\), default\-hyperparameter comparison\. With default\(λθ,λℓ,λc\)=\(15,15,8\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)=\(15,15,8\), GATD substantially outperforms TabDiff on distributional fidelity \(Shape 95% relative reduction; Trend essentially tied\), while TabDiff wins downstream utility \(AUC\), consistent with APS having only 1 categorical column among 171\. TheO\(d2\)O\(d^\{2\}\)scaling cost is empirically observable at this dimensionality: GATD’s parameter count and memory usage exceed TabDiff’s, though both fit on a single T4 and GATD remains slightly faster per epoch\.
##### With minimal tuning\.

Following the practitioner guidance in Section[A\.17](https://arxiv.org/html/2606.02607#A1.SS17), we performed a small grid search over\(λθ,λℓ,λc\)∈\{0,60\}3\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)\\in\\\{0,60\\\}^\{3\}\(8 trials total\), motivated by the observation that APS has only 1 categorical column and should benefit from emphasizing length supervision over angle supervision\. The grid identifies\(0,60,0\)\(0,60,0\)as the optimal configuration\. Table[20](https://arxiv.org/html/2606.02607#A1.T20)reports the tuned\-vs\-TabDiff comparison: GATD now wins 3 of 4 metrics, including a substantial F1 margin \(73\.2% vs\. 44\.8%\) where the default configuration had lost AUC\. The tuning cost \(8 trials\) is modest, and demonstrates that the practitioner\-tunableλ\\lambdavalues address failure modes that TabDiff’s fixed\-architecture protocol cannot reach\.

Table 20:APS with 8\-trial loss\-weight grid search\. GATD wins 3 of 4 metrics, including a substantial F1 margin \(73\.2% vs 44\.8%\)\. The tuning cost \(8 trials\) is modest relative to the gains, and demonstrates that the practitioner\-tunable lambdas address failure modes that TabDiff’s fixed\-architecture protocol cannot\.
##### Interpretation\.

TheO\(d2\)O\(d^\{2\}\)pairwise feature scaling is real and observable at this dimensionality—GATD’s memory consumption and parameter count both exceed TabDiff’s atd=171d=171\. However, the practical implications are not the catastrophic failure suggested by an asymptotic complexity argument: training time per epoch remains favorable, both methods fit comfortably on a single T4, and tuned GATD wins 3 of 4 metrics\. This is also the same order as column\-wise self\-attention, so the relevant question is not whether quadratic pair structure exists, but whether the pairwise signal is useful\. Our results and failed pair\-selection experiments support supervising all pairs, including weak or noisy relationships, because those non\-relationships are part of the relational signature\.

### A\.15GATD at TabDiff’s 8,000\-Epoch Training Protocol

Reviewers raised a reasonable concern that GATD’s 20,000\-epoch training schedule \(vs\. TabDiff’s published 8,000\) might be a confounding source of GATD’s reported gains rather than the geometry itself\. To isolate geometric supervision from training\-protocol differences, we additionally trained GATD\-MLP at 8,000 epochs—identical to TabDiff’s protocol—across 3 training seeds and 20 generation seeds per dataset, holding other hyperparameters at defaults\. Table[21](https://arxiv.org/html/2606.02607#A1.T21)reports the result\.

##### Interpretation\.

Extended training contributes at most a small fraction of GATD’s reported improvement over TabDiff\. At identical 8,000\-epoch budgets, GATD\-MLP still achieves 27 of 40 wins \(with 3 ties\) against TabDiff, compared to 29 wins and 2 ties at 20,000 epochs\. The 20,000\-epoch protocol we report by default in the main text is a moderately favorable operating point for GATD but is not the source of its gains over TabDiff: the gap between GATD\-8k \(0\.854 Shape\) and TabDiff\-8k \(1\.187 Shape\) is 28% relative reduction, very close to the 27% reduction reported at 20k\. The marginal benefit of 2\.5×\\timesmore training is small \(Shape 0\.854→\\to0\.862, a 0\.9% change; MLE\-2 gap 6\.1%→\\to5\.6%\)\. This isolates geometric supervision \(and architectural simplification to an MLP\) as the primary mechanisms, not the extended training budget\.

Table 21:GATD\-MLP at TabDiff’s 8,000\-epoch training budget\. At identical training budgets, GATD wins 27 of 40 metric\-dataset cells with 3 ties, compared to 29 wins and 2 ties at 20,000 epochs\. The marginal benefit of 2\.5×\\timesmore training is small: Shape error changes only from 0\.854 to 0\.862, while MLE\-2 gap decreases from 6\.1% to 5\.6%\. Geometric supervision, not extended training, is the primary driver of GATD’s improvement over TabDiff\.

### A\.16Loss\-Weight Sensitivity Analysis

The default geometric loss weights\(λθ,λℓ,λc\)=\(15,15,8\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)=\(15,15,8\)are not finely tuned: they were chosen at the outset of development based on rough alignment between angle/length and diffusion loss magnitudes, and held fixed across all 10 datasets and all three diffusion backbones\. To characterize whether this default sits in a stable basin or at a precarious peak, we conducted single\-variable sensitivity sweeps, holding the other two weights at default values:

- •Angle sweep \(primary\):λθ∈\{0,5,10,15,20,30\}\\lambda\_\{\\theta\}\\in\\\{0,5,10,15,20,30\\\}withλℓ=15\\lambda\_\{\\ell\}=15,λc=8\\lambda\_\{c\}=8fixed\.
- •Length sweep:λℓ∈\{0,5,10,15,20,30\}\\lambda\_\{\\ell\}\\in\\\{0,5,10,15,20,30\\\}withλθ=15\\lambda\_\{\\theta\}=15,λc=8\\lambda\_\{c\}=8fixed\.
- •Consistency sweep:λc∈\{0,5,10,15,20,30\}\\lambda\_\{c\}\\in\\\{0,5,10,15,20,30\\\}withλθ=15\\lambda\_\{\\theta\}=15,λℓ=15\\lambda\_\{\\ell\}=15fixed\.

Sweeps were run on 4 representative datasets, each cell at a*single training seed and a single generation seed*\(1×11\\times 1protocol\)\. The full3×203\\times 20protocol used for the main results was prohibitive at the scale of this sensitivity grid:6values×3lambdas×3train seeds×20gen seeds×10datasets=10,8006\\text\{ values\}\\times 3\\text\{ lambdas\}\\times 3\\text\{ train seeds\}\\times 20\\text\{ gen seeds\}\\times 10\\text\{ datasets\}=10\{,\}800generations, well outside the available compute budget\.

![Refer to caption](https://arxiv.org/html/2606.02607v1/x9.png)

![Refer to caption](https://arxiv.org/html/2606.02607v1/x10.png)

![Refer to caption](https://arxiv.org/html/2606.02607v1/x11.png)

Figure 10:Loss\-weight sensitivity sweeps for the three geometric loss terms on four representative datasets\. Each panel varies one weight while holding the other two at the default setting: angle sweep\(λθ,λℓ,λc\)=\(X,15,8\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)=\(X,15,8\), length sweep\(15,X,8\)\(15,X,8\), and consistency sweep\(15,15,X\)\(15,15,X\), withX∈\{0,5,10,15,20,30\}X\\in\\\{0,5,10,15,20,30\\\}under the1×11\\times 1protocol\. Aggregate Shape and Trend errors remain within±5%\\pm 5\\%of the default\(15,15,8\)\(15,15,8\)across the sweeps\. Per\-dataset behavior is heterogeneous: continuous\-heavy datasets such as Powerplant can improve when angle supervision is reduced, whereas categorical\-heavy datasets such as Adult generally prefer the default or higher angle weight\.Across angle, length, and consistency sweeps on these 4 datasets \(Figure[10](https://arxiv.org/html/2606.02607#A1.F10)\), single\-seed Shape and Trend errors remain within close range of the default\(15,15,8\)\(15,15,8\)configuration over the fullλθ∈\{0,30\}\\lambda\_\{\\theta\}\\in\\\{0,30\\\}range\. The default sits in a stable basin rather than at a precarious peak, indicating that GATD does not require precise hyperparameter calibration to achieve its reported performance\.

##### Per\-dataset tuning unlocks additional gains\.

Although the default is stable in aggregate, per\-dataset tuning of the geometric weights unlocks substantial additional improvements\. The most striking example is Powerplant \(0 categorical columns,d=5d=5\): at default\(15,15,8\)\(15,15,8\), Powerplant achieves 0\.973 Shape error and 0\.387 Trend error\. Settingλθ=0\\lambda\_\{\\theta\}=0—disabling the angle loss—yields 0\.908 Shape and 0\.108 Trend, halving TabDiff’s best Trend result on this dataset \(0\.219\)\. A small grid search overλθ,λℓ∈\{0,30,60\}\\lambda\_\{\\theta\},\\lambda\_\{\\ell\}\\in\\\{0,30,60\\\}\(9 configurations\) identifies\(λθ,λℓ,λc\)=\(0,60,60\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)=\(0,60,60\)as the optimal: 0\.764 Shape, 0\.111 Trend on Powerplant\. These per\-dataset tunable knobs have no equivalent in TabDiff or other transformer\-based tabular generators, where inter\-column relationships are learned implicitly without explicit user\-tunable weights\.

The pattern parallels the per\-dataset improvements observed on the MLP backbone \(Section[4\.5](https://arxiv.org/html/2606.02607#S4.SS5), Appendix[A\.4](https://arxiv.org/html/2606.02607#A1.SS4)\): on low\-categorical datasets, disabling the angle loss and emphasizing length supervision can improve aggregate metrics\. We provide concrete tuning recommendations in Section[A\.17](https://arxiv.org/html/2606.02607#A1.SS17)\.

### A\.17Practitioner Guidance for Loss\-Weight Tuning

The geometric loss weights\(λθ,λℓ,λc\)\(\\lambda\_\{\\theta\},\\lambda\_\{\\ell\},\\lambda\_\{c\}\)provide tunable practitioner levers absent from transformer\-based tabular generators\. Our default configuration\(15,15,8\)\(15,15,8\)achieves competitive performance across all 10 datasets without per\-dataset tuning, but per\-dataset tuning unlocks substantial additional gains in some regimes\. Based on our sensitivity analysis \(Appendix[A\.16](https://arxiv.org/html/2606.02607#A1.SS16)\) and the categorical\-anchor analysis \(Section[4\.5](https://arxiv.org/html/2606.02607#S4.SS5)\), we offer the following practitioner guidance:

- •Start with the default\(15,15,8\)\(15,15,8\)\.The default sits in a stable performance basin: aggregate Shape and Trend errors remain within±5%\\pm 5\\%of the default across the fullλ∈\{0,30\}\\lambda\\in\\\{0,30\\\}range when sweeping any single weight \(Appendix[A\.16](https://arxiv.org/html/2606.02607#A1.SS16)\)\.
- •Categorical\-heavy datasets \(high categorical\-column fraction\):the angle loss is empirically the primary driver on the MLP backbone \(Section[4\.5](https://arxiv.org/html/2606.02607#S4.SS5)\)\. Increasingλθ\\lambda\_\{\\theta\}to 20–30 may improve Shape on datasets with rich categorical structure\.
- •Continuous\-heavy datasets \(low categorical\-column fraction\):the length loss is the primary driver\. Increasingλℓ\\lambda\_\{\\ell\}and reducingλθ\\lambda\_\{\\theta\}\(potentially to 0\) may unlock substantial gains\. On Powerplant \(0 categorical columns\), reducingλθ\\lambda\_\{\\theta\}from 15 to 0 cuts Trend error by 72% \(0\.387 to 0\.108\), beating TabDiff’s best Trend on this dataset \(0\.219\) by a factor of two\.
- •Wide tables with few categorical columns:prefer\(0,60,0\)\(0,60,0\)as a starting point\. On APS Failure at Scania Trucks \(d=171d=171, 1 categorical column; Appendix[A\.14](https://arxiv.org/html/2606.02607#A1.SS14)\), this configuration achieves 3 of 4 wins against TabDiff after only an 8\-trial grid search\.

These knobs are not available in TabDiff or other transformer\-based tabular generators, where inter\-column relationships are learned implicitly without explicit user\-tunable weights\. The geometric loss weights thus provide an additional dimension of practitioner control that complements the architecture\-portability claim of Section[4\.2](https://arxiv.org/html/2606.02607#S4.SS2)\.

### A\.18Geometric Feature Form: Arctan vs Raw Differences

To validate the arctan transformation as the geometric feature choice, we compared GATD\-MLP against an otherwise\-identical configuration using raw pairwise differences \(vj−viv\_\{j\}\-v\_\{i\}\) as the angle feature, with the length feature unchanged\. All other architecture, training, and supervision details are held identical\. Per\-dataset cells use a*single training seed and 20 generation seeds*\(1×201\\times 20protocol\); the main\-text3×203\\times 20protocol was not run for this ablation\.

Table 22:Arctan vs raw pairwise differences as the angle feature, evaluated on all 10 datasets\. Arctan = the default GATD configuration\. RawDiffs = identical architecture, training, and supervision withθij=arctan⁡\(vj−vi\)\\theta\_\{ij\}=\\arctan\(v\_\{j\}\-v\_\{i\}\)replaced byθij=vj−vi\\theta\_\{ij\}=v\_\{j\}\-v\_\{i\}\. Single training seed, 20 generation seeds per cell \(1×201\\times 20protocol\)\. Both feature forms produce competitive results; arctan provides a consistent edge, particularly on downstream\-utility metrics\.##### Findings\.

Both feature forms produce competitive results \(Table[22](https://arxiv.org/html/2606.02607#A1.T22)\), confirming that pairwise auxiliary supervision is the primary mechanism rather than the specific transformation choice\. Arctan, however, provides a consistent edge on Shape \(7/10 datasets, aggregate 0\.86 vs 0\.91\), MLE\-1 \(8/10, aggregate 4\.8% vs 5\.1% gap\), and MLE\-2 \(8/10, aggregate 5\.6% vs 6\.1% gap\)\. On Trend the two forms are essentially tied at the per\-dataset level \(5/5\), with arctan winning slightly in aggregate \(1\.65 vs 1\.66\)\.

##### Why arctan helps\.

Two properties of arctan provide consistent benefits over raw differences\. First, arctan’s bounded output \(θij∈\(−π/2,π/2\)\\theta\_\{ij\}\\in\(\-\\pi/2,\\pi/2\)\) provides stable supervision targets regardless of input scale, whereas raw differences inherit the scale of the underlying column values\. Second, arctan’s nonlinear compression \(large differences map to similar angles near±π/2\\pm\\pi/2\) emphasizes the directional structure of column relationships over their absolute magnitudes, which is precisely the inductive bias the geometric supervision is designed to provide\. These properties matter most for distributional fidelity \(Shape\) and downstream utility \(MLE\-1, MLE\-2\) where the bounded representation appears to provide more consistent gradients during training\.

The arctan formulation also opens a natural extension to higher\-order geometric features such as triangle\-closure losses on column triples \(Section[5](https://arxiv.org/html/2606.02607#S5)\), which raw differences would not support without additional normalization\.
Geometry-Aware Tabular Diffusion

Similar Articles

TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling

Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

Constrained Tabular Diffusion for Finance

Active Tabular Augmentation via Policy-Guided Diffusion Inpainting

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Submit Feedback

Similar Articles

TDGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models, diffusion-based models, and latent-space generative modeling
Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction
Constrained Tabular Diffusion for Finance
Active Tabular Augmentation via Policy-Guided Diffusion Inpainting
PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation