GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data

arXiv cs.LG Papers

Summary

This paper introduces GOTabPFN, a method that combines Graph-guided Ordering with Local Refinement (GO-LR) and Neuro-Inspired Subunit Compression (NSC) to make small tabular foundation models effective for high-dimensional, low-sample-size prediction without retraining large backbones.

arXiv:2606.05441v1 Announce Type: new Abstract: We investigate how to make small tabular foundation models effective for High-Dimensional, Low-Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph-guided Ordering with Local Refinement (GO-LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP-path-style surrogate. We propose GOTabPFN,which builds on GO-LR, and a Neuro-Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta-features, yielding a compact representation that makes TabPFN-style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:11 AM

# From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data
Source: [https://arxiv.org/html/2606.05441](https://arxiv.org/html/2606.05441)
Md Younus AhamedPrashnna Kumar GyawaliGianfranco DorettoDonald A\. Adjeroh

###### Abstract

We investigate how to make small tabular foundation models effective for High\-Dimensional, Low\-Sample Size \(HDLSS\) tabular prediction without retraining large backbones\. We introduce Graph\-guided Ordering with Local Refinement \(GO\-LR\), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP\-path\-style surrogate\. We propose GOTabPFN,which builds on GO\-LR, and a Neuro\-Inspired Subunit Compression \(NSC\) unit to pool locally adjacent ordered features into meta\-features, yielding a compact representation that makes TabPFN\-style prediction practical in HDLSS regimes\. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets\.

Tabular foundation models, high\-dimensional data, feature ordering, compact tokenization, TabPFN

## 1Introduction

High\-Dimensional, Low\-Sample Size \(HDLSS\) tabular prediction remains a challenge: whenm≫nm\\gg n\(withmm=no\. of features,nn=no\. of samples\), both learning and representation become costly\. Tabular foundation models such as TabPFN and its variants are strong general\-purpose baselines, but popular versions \(e\.g\., TabPFN\-2\.5\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115)\)\) are designed and benchmarked for inputs with up to roughly2,0002\{,\}000features, leaving many HDLSS domains \(e\.g\., gene expression withm≫2,000m\\gg 2\{,\}000\) outside their intended operating range without prior feature selection or compression\. This motivates representation strategies that reduce dimensionality under tight sample budgets while preserving predictive structure, so TabPFN\-style learners remain effective in truly high\-dimensional regimes\.

Permutation learning seeks an ordering of a finite set that improves a downstream objective, typically via differentiable relaxations that approximate discrete permutations in end\-to\-end neural training\(Barthelet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib119); Jurewicz and Derczynski,[2022](https://arxiv.org/html/2606.05441#bib.bib120)\)\. For tabular data, the lack of inherent spatial or temporal structure weakens inductive bias relative to vision or language, especially in HDLSS settings\. Although tree\-based methods remain strong baselines, learning cross\-feature dependencies without overfitting is difficult; even simple models \(e\.g\., MLPs or Lasso\) can outperform advanced tabular approaches inn≪mn\\ll mregimes \(ProtoGate\(Jianget al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib98)\)\)\. This suggests that feature selection alone is often insufficient; we also need a learnable feature ordering that organizes correlated features into neighborhoods amenable to structured compression\. We therefore formulate the Column Permutation Problem \(CPP\)\(Fogelet al\.,[2013](https://arxiv.org/html/2606.05441#bib.bib14); Limaet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib121); Tegze and Vlach,[1986](https://arxiv.org/html/2606.05441#bib.bib123); Liiv,[2010](https://arxiv.org/html/2606.05441#bib.bib126); Behrischet al\.,[2016](https://arxiv.org/html/2606.05441#bib.bib127)\): learn a data\-driven column order that reduces redundancy, reveals long\-range dependencies, and induces a useful sequential structure for downstream modules\. In practice, CPP can be tackled via attention\-based pointer mechanisms and graph\-aware variants that generate permutations while encoding relational structure\(Vinyalset al\.,[2015](https://arxiv.org/html/2606.05441#bib.bib122); Yanget al\.,[2022b](https://arxiv.org/html/2606.05441#bib.bib124); Veličkovićet al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib125)\)\.

Feature ordering has a long history in pattern recognition and is central to Incremental Attribute Learning \(IAL\), where features arrive sequentially and must be ranked before training\(Wang and Guan,[2013](https://arxiv.org/html/2606.05441#bib.bib133)\)\. Unlike set\-based models that assume order invariance\(Zaheeret al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib143)\), column order can expose redundancy and shape how models capture dependencies; even simple Fisher/correlation/entropy rankings reduce interference and error over unordered baselines\(Wanget al\.,[2015c](https://arxiv.org/html/2606.05441#bib.bib134),[b](https://arxiv.org/html/2606.05441#bib.bib136)\), motivating learned, task\-aware ordering\(Wanget al\.,[2015a](https://arxiv.org/html/2606.05441#bib.bib135),[2014](https://arxiv.org/html/2606.05441#bib.bib139)\)\. In deep tabular learning, Mambular\(Thielmannet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib141)\)underscored the impact of ordering andHabibet al\.\([2024](https://arxiv.org/html/2606.05441#bib.bib142),[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\)introduced explicit ordering algorithms in TabSeq and DynaTab, respectively\. Other related efforts show brittleness to column permutations, prompting permutation\-invariant architectures\(Eremeevet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib131); Brahmavaret al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib132)\)and TabICL\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\), which ensembles across permutations\. Beyond supervised prediction, COPER\(Eisenberget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib140)\)uses a permutation\-based correlation objective for multi\-view \(image\-table\) clustering, and ROTATOR\-LLM\(Wanget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib128)\)studies feature ordering for LLM\-based tabular inference\.

While ordering can expose local structure, HDLSS tables introduce a second bottleneck: even a “good” permutation still leavesmmraw features to process, which is prohibitive whenm≫nm\\gg n\. To make TabPFN\-style predictors practical in this regime without changing the backbone, we introduce Neuro\-Inspired Subunit Compression \(NSC\), motivated by subunit\-style integration in cortical dendrites\(Poiraziet al\.,[2003](https://arxiv.org/html/2606.05441#bib.bib16); Schilleret al\.,[2000](https://arxiv.org/html/2606.05441#bib.bib17); Majoret al\.,[2013](https://arxiv.org/html/2606.05441#bib.bib18); Kastellakiset al\.,[2015](https://arxiv.org/html/2606.05441#bib.bib19); Kirchner and Gjorgjieva,[2021](https://arxiv.org/html/2606.05441#bib.bib20); Ujfalussy and Makara,[2020](https://arxiv.org/html/2606.05441#bib.bib24); Wuet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib25)\)\. NSC groups adjacent features along the GO\-LR \(Graph\-guided Ordering with Local Refinement\) axis into contiguous subunits and pools each into a meta\-feature, reducing dimensionality frommmtoMM\(M≪mM\\ll m\), withMMtied to intrinsic dimension estimates from the covariance spectrum\(Roy and Vetterli,[2007](https://arxiv.org/html/2606.05441#bib.bib21); Halkoet al\.,[2011](https://arxiv.org/html/2606.05441#bib.bib22); Levina and Bickel,[2004](https://arxiv.org/html/2606.05441#bib.bib23)\)\. Naïve compression often produces latent components without a stable coordinate system, yielding run and subsample\-dependent representations that are not effective for TabPFN\-style models, which assume a fixed, consistently parameterized input space\(Hollmannet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib114),[2025](https://arxiv.org/html/2606.05441#bib.bib113)\)\. We therefore design a structure\-constrained compression interface that yields reproducible latent features within the feature budgets targeted by recent TabPFN variants\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115); Liu and Ye,[2025](https://arxiv.org/html/2606.05441#bib.bib116); Kolberget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib117)\)\.

Our contributions:

- •We cast feature ordering as a combinatorial optimization problem, prove its NP\-hardness, and propose MinLA\-grounded ordering via GO\-LR\.
- •We introduce scalable HDLSS compression via NSC, a neuro\-inspired subunit\-style pooling that is controlled by intrinsic\-dimension estimates\.
- •Building on the above, we propose GOTabPFN for analyzing HDLSS tabular data\. Across HDLSS benchmarks, GOTabPFN improves accuracy and stability under tight feature budgets in high\-dimensions\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/FO.png)Figure 1:Graph\-based feature ordering\.GO\-LR linearizes a weighted feature graph to keep related features nearby for local segmentation and compression\. It uses NNPath for local initialization, then refines the order with a global MinLA\-style objective over pairwise placements\. See Appendix[T](https://arxiv.org/html/2606.05441#A20)for more clarifications\.
## 2Related Work

In Appendix[A](https://arxiv.org/html/2606.05441#A1), we provide more details on related work, including on tabular foundation models, the TabPFN family, HDLSS\-specific models, and LLM\-based tabular models\.

Existing approaches often struggle in HDLSS settings withm≫nm\\\!\\gg\\\!n, since they either assume moderate feature counts or rely primarily on feature selection and task\-specific tuning to cope with very high dimensionality\. GOTabPFN bridges this gap by coupling MinLA\-grounded ordering \(GO\-LR\) with subunit\-style compression \(NSC\), yielding stable, low\-dimensional representations that enable TabPFN\-style predictors to operate effectively in truly high\-dimensional regimes without modifying the TabPFN backbone\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/meta.png)Figure 2:Meta\-feature construction\.GO\-LR first orders features globally; NSC then segments the ordered axis into contiguous neighborhoods and compresses each segment by PCA into a scalar meta\-feature\. The final vectorZ​\(x\)=\(z1,…,zM\)Z\(x\)=\(z\_\{1\},\\ldots,z\_\{M\}\)is passed to the frozen TabPFN\-2\.5 head\. See Appendix[T](https://arxiv.org/html/2606.05441#A20)for additional clarifications\.
## 3Methodology

Problem formulation\.LetX∈ℝn×mX\\in\\mathbb\{R\}^\{n\\times m\}be the input matrix withnnsamples andmmfeatures\. We define the sample partition\{Ic\}c=1k\\\{I\_\{c\}\\\}\_\{c=1\}^\{k\}obtained by clustering the samples, and the cluster\-restricted matrices in Eq\.[1](https://arxiv.org/html/2606.05441#S3.E1)\.

X\(c\)=X​\[Ic,:\]∈ℝnc×m,nc=\|Ic\|X^\{\(c\)\}=X\[I\_\{c\},:\]\\in\\mathbb\{R\}^\{n\_\{c\}\\times m\},\\qquad n\_\{c\}=\|I\_\{c\}\|\(1\)For eachX\(c\)X^\{\(c\)\}, we construct the corresponding cluster\-wise feature graphGc=\(V,E,w\(c\)\)G\_\{c\}=\(V,E,w^\{\(c\)\}\), whereV=\{1,…,m\}V=\\\{1,\\dots,m\\\}is the shared feature set andwi​j\(c\)w^\{\(c\)\}\_\{ij\}measures feature dissimilarity within clustercc\. The local permutationπc\\pi\_\{c\}is obtained by minimizing a MinLA\-style dispersion objective onGcG\_\{c\}, and the final global permutationΠ∗\\Pi^\{\\ast\}is obtained by aggregating local ranks across clusters\. All permutations are over features, and GO\-LR outputs a single global feature orderingΠ∗\\Pi^\{\\ast\}, not separate feature spaces that must later be rearranged across clusters\.Π∗\\Pi^\{\\ast\}is then used for NSC segmentation and compression\. Figs\.[1](https://arxiv.org/html/2606.05441#S1.F1),[2](https://arxiv.org/html/2606.05441#S2.F2), and[3](https://arxiv.org/html/2606.05441#S3.F3)summarize the pipeline: GO\-LR linearizes feature graphs, NSC segments and compresses contiguous ordered neighborhoods into meta\-features, and the resulting tokens are passed to a frozen TabPFN\-2\.5 head within GOTabPFN\.

### 3\.1Feature Ordering as a Combinatorial Optimization Problem\.

Problem Setup: Feature Ordering by Graph Dispersion\.In this section, we show that GO\-LR\-based feature ordering corresponds to the Minimum Linear Arrangement \(MinLA\) problem, is NP\-hard, and strictly generalizes TSP\-path\. Here, TSP\-path refers to the Traveling Salesman \(TSP\) path problem: given a complete weighted graph, find a Hamiltonian pathσ\\sigmathat minimizesPathCost​\(σ\)=∑t=1m−1dσt,σt\+1\\mathrm\{PathCost\}\(\\sigma\)=\\sum\_\{t=1\}^\{m\-1\}d\_\{\\sigma\_\{t\},\\sigma\_\{t\+1\}\}\. We further show that the practical GO\-LR algorithm provides a TSP\-path style initialization, which is then locally refined under the dispersion objective\. We connect GO\-LR\-based feature ordering to classical combinatorial optimization, including linear arrangement and seriation problems\(Díazet al\.,[2002](https://arxiv.org/html/2606.05441#bib.bib11); Seminaroti,[2016](https://arxiv.org/html/2606.05441#bib.bib13); Fogelet al\.,[2013](https://arxiv.org/html/2606.05441#bib.bib14)\)\. It is MinLA \(NP\-hard\), admits a TSP\-path heuristic implementation, and strictly generalizes TSP\-path via an exact embedding\.

###### Theorem 3\.1\(Theoretical Characterization of GO\-LR\)\.

GO\-LR\-based feature ordering corresponds to a weighted MinLA problem, is NP\-hard in the number of features, and strictly generalizes the TSP\-path problem\.

###### Proof sketch\.

Theorem follows from Lemma[3\.8](https://arxiv.org/html/2606.05441#S3.Thmtheorem8), Lemma[3\.9](https://arxiv.org/html/2606.05441#S3.Thmtheorem9), and Theorem[3\.12](https://arxiv.org/html/2606.05441#S3.Thmtheorem12)as described below\. ∎

Moreover, the practical GO\-LR algorithm uses a nearest\-neighbor TSP\-path heuristic for initialization and then applies a local refinement step \(direction selection and adjacent swaps\) that monotonically decreases the MinLA dispersion objective\. The remainder of this section establishes this characterization through a sequence of equivalence and reduction results\.

###### Definition 3\.2\(Local Feature Graph\)\.

Given clusterccwith samplesX\(c\)∈ℝnc×mX^\{\(c\)\}\\in\\mathbb\{R\}^\{n\_\{c\}\\times m\}, we define a weighted feature graphGc=\(V,E,w\)G\_\{c\}=\(V,E,w\)whereV=\{1,…,m\}V=\\\{1,\\dots,m\\\}indexes features andwi​j≥0w\_\{ij\}\\geq 0quantifies dissimilarity between featuresiiandjjcomputed fromX\(c\)X^\{\(c\)\}\(e\.g\.,1−\|corr\|1\-\|\{\\rm corr\}\|; see App\.[T](https://arxiv.org/html/2606.05441#A20), JS\(Jensen\-Shannon divergence\)/KL\(Kullback\-Leibler divergence\), cosine/Euclidean/Manhattan\)\. We write\(i,j\)∈E\(i,j\)\\in Ewhenever a pair is included \(typicallyE=V×V∖\{\(i,i\)\}E=V\\times V\\setminus\\\{\(i,i\)\\\}for a complete graph, or a sparse neighborhood graph\)\.

###### Definition 3\.3\(Dispersion Objective \(GO\-LR Local Ordering\)\)\.

A local ordering is a bijectionπ:V→\{0,…,m−1\}\\pi:V\\to\\\{0,\\dots,m\-1\\\}assigning each feature to a position\. The cluster\-wise dispersion ofπ\\piis in Eq\.[2](https://arxiv.org/html/2606.05441#S3.E2)\. The GO\-LR local ordering problem is to compute the local order with minimum dispersion, as shown in Eq\.[3](https://arxiv.org/html/2606.05441#S3.E3)\.

DGc​\(π\)=∑\(i,j\)∈Ewi​j​\|π​\(i\)−π​\(j\)\|D\_\{G\_\{c\}\}\(\\pi\)=\\sum\_\{\(i,j\)\\in E\}w\_\{ij\}\\,\|\\pi\(i\)\-\\pi\(j\)\|\(2\)πc∗∈arg⁡minπ⁡DGc​\(π\)\\pi\_\{c\}^\{\\ast\}\\in\\arg\\min\_\{\\pi\}D\_\{G\_\{c\}\}\(\\pi\)\(3\)

###### Definition 3\.4\(GO\-LR Local Refinement Operator\)\.

Letπ\(0\)←NNPath​\(Gc\)\\pi^\{\(0\)\}\\leftarrow\\mathrm\{NNPath\}\(G\_\{c\}\)be the nearest\-neighbor initialization \(a permutation ofVV\)\. We definerev​\(π\)\\mathrm\{rev\}\(\\pi\)as the reversed permutation and let𝒩​\(π\)\\mathcal\{N\}\(\\pi\)denote the set of permutations obtained by one adjacent transposition \(Eq\.[4](https://arxiv.org/html/2606.05441#S3.E4)\)\. GO\-LR first performs direction selection \(Eq\.[5](https://arxiv.org/html/2606.05441#S3.E5)\) and then appliesPPpasses of adjacent\-swap descent \(Eq\.[6](https://arxiv.org/html/2606.05441#S3.E6)\), with early stopping ifπ\(p\+1\)=π\(p\)\\pi^\{\(p\+1\)\}=\\pi^\{\(p\)\}\. The refined local ordering isπc←π\(P\)\\pi\_\{c\}\\leftarrow\\pi^\{\(P\)\}\. In Eq\.[4](https://arxiv.org/html/2606.05441#S3.E4),swapt​\(π\)\\mathrm\{swap\}\_\{t\}\(\\pi\)is the adjacent\-transposition operator that returns the permutation obtained by swapping the entries at positionsttandt\+1t\+1inπ\\pi\.

𝒩​\(π\)=\{swapt​\(π\):t=0,…,m−2\}\\mathcal\{N\}\(\\pi\)=\\\{\\mathrm\{swap\}\_\{t\}\(\\pi\):t=0,\\dots,m\-2\\\}\(4\)
π\(0\)←arg⁡min⁡\{DGc​\(π\(0\)\),DGc​\(rev​\(π\(0\)\)\)\}\\pi^\{\(0\)\}\\leftarrow\\arg\\min\\Big\\\{D\_\{G\_\{c\}\}\(\\pi^\{\(0\)\}\),\\,D\_\{G\_\{c\}\}\(\\mathrm\{rev\}\(\\pi^\{\(0\)\}\)\)\\Big\\\}\(5\)
π\(p\+1\)←SweepRefine​\(π\(p\),Gc\),p=0,…,P−1\\pi^\{\(p\+1\)\}\\leftarrow\\mathrm\{SweepRefine\}\\\!\\left\(\\pi^\{\(p\)\},G\_\{c\}\\right\),\\qquad p=0,\\dots,P\-1\(6\)SweepRefine\.Initializeπ~←π\(p\)\\tilde\{\\pi\}\\leftarrow\\pi^\{\(p\)\}and scant=0,…,m−2t=0,\\dots,m\-2\. Compute swap gainΔt:=DGc​\(swapt​\(π~\)\)−DGc​\(π~\)\\Delta\_\{t\}:=D\_\{G\_\{c\}\}\(\\mathrm\{swap\}\_\{t\}\(\\tilde\{\\pi\}\)\)\-D\_\{G\_\{c\}\}\(\\tilde\{\\pi\}\)via an incremental update \(no full recomputation\)\. IfΔt<0\\Delta\_\{t\}<0, setπ~←swapt​\(π~\)\\tilde\{\\pi\}\\leftarrow\\mathrm\{swap\}\_\{t\}\(\\tilde\{\\pi\}\)immediately\. Returnπ\(p\+1\)←π~\\pi^\{\(p\+1\)\}\\leftarrow\\tilde\{\\pi\}and early\-stop if a full sweep makes no changes\.

Complexity: Equivalence to Minimum Linear Arrangement\.The MinLA problem is a classical graph layout problem, and has been well studied\(Shiloach,[1979](https://arxiv.org/html/2606.05441#bib.bib10)\)\.

###### Definition 3\.7\(Weighted Minimum Linear Arrangement \(MinLA\)\)\.

Given a weighted graphG=\(V,E,w\)G=\(V,E,w\), the weighted MinLA problem is

minπ:V→\{0,…,\|V\|−1\}​bijective​∑\(i,j\)∈Ewi​j​\|π​\(i\)−π​\(j\)\|\\min\_\{\\pi:V\\to\\\{0,\\dots,\|V\|\-1\\\}\\ \\text\{bijective\}\}\\;\\sum\_\{\(i,j\)\\in E\}w\_\{ij\}\\,\|\\pi\(i\)\-\\pi\(j\)\|\(7\)

###### Lemma 3\.8\(GO\-LR Local Ordering is MinLA\)\.

For each clustercc, the GO\-LR local ordering objective in Eq\. \([3](https://arxiv.org/html/2606.05441#S3.E3)\) is exactly the weighted MinLA objective onGcG\_\{c\}\.

###### Proof sketch\.

Both problems optimize over bijectionsπ:V→\{0,…,m−1\}\\pi:V\\to\\\{0,\\dots,m\-1\\\}and share the identical objective∑\(i,j\)∈Ewi​j​\|π​\(i\)−π​\(j\)\|\\sum\_\{\(i,j\)\\in E\}w\_\{ij\}\|\\pi\(i\)\-\\pi\(j\)\|\(Eq\. \([2](https://arxiv.org/html/2606.05441#S3.E2)\) and Eq\. \([7](https://arxiv.org/html/2606.05441#S3.E7)\)\)\. Hence they are the same optimization problem\. ∎

###### Lemma 3\.9\(NP\-hardness\)\.

The GO\-LR local feature ordering problem \(Eq\. \([3](https://arxiv.org/html/2606.05441#S3.E3)\)\) is NP\-hard inmm\.

###### Proof sketch\.

Weighted MinLA is NP\-hard\(Gareyet al\.,[1976](https://arxiv.org/html/2606.05441#bib.bib9)\); since GO\-LR local ordering is exactly MinLA, it is NP\-hard\. ∎

An Exact Equivalence Case: TSP\-path as a Special Case of Feature Ordering\.

###### Definition 3\.10\(TSP\-path Objective on a Complete Graph\)\.

Given a complete weighted graph𝒦=\(V,\(V2\),d\)\\mathcal\{K\}=\(V,\\binom\{V\}\{2\},d\)with edge weightsdi​j≥0d\_\{ij\}\\geq 0, define the path cost of a permutationσ=\(σ1,…,σm\)\\sigma=\(\\sigma\_\{1\},\\dots,\\sigma\_\{m\}\)by

PathCost​\(σ\)=∑t=1m−1dσt,σt\+1\\mathrm\{PathCost\}\(\\sigma\)=\\sum\_\{t=1\}^\{m\-1\}d\_\{\\sigma\_\{t\},\\sigma\_\{t\+1\}\}\(8\)

First, we establish GO\-LR as a TSP\-path heuristic that outputs a Hamiltonian path\. \(see Appendix[B](https://arxiv.org/html/2606.05441#A2)\)\. Then, below, we show the connection between our feature ordering and TSP\-path\.

The GO\-LR objective in Eq\. \([2](https://arxiv.org/html/2606.05441#S3.E2)\) is a general seriation / linear arrangement criterion\. We now exhibit a non\-circular special case in which Feature Ordering becomes exactly the TSP\-path problem, implying that Feature Ordering strictly generalizes TSP\-path\(Carmonaet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib15)\)\.

###### Definition 3\.11\(Path\-Edge Feature Ordering \(Adjacency\-by\-Position\)\)\.

Fixmmand define the path\-edge set on positions

Epath=\{\(t,t\+1\):t=1,…,m−1\}E\_\{\\text\{path\}\}=\\\{\(t,t\+1\):t=1,\\dots,m\-1\\\}\(9\)
Given a complete weighted graph𝒦=\(V,\(V2\),d\)\\mathcal\{K\}=\(V,\\binom\{V\}\{2\},d\)with\|V\|=m\|V\|=m, a permutationσ=\(σ1,…,σm\)\\sigma=\(\\sigma\_\{1\},\\dots,\\sigma\_\{m\}\)induces an ordering mapπσ:V→\{1,…,m\}\\pi\_\{\\sigma\}:V\\to\\\{1,\\dots,m\\\}viaπσ​\(σt\)=t\\pi\_\{\\sigma\}\(\\sigma\_\{t\}\)=t\. We define the path\-edge feature ordering objective:

Dpath​\(πσ\)=∑t=1m−1dσt,σt\+1D\_\{\\text\{path\}\}\(\\pi\_\{\\sigma\}\)=\\sum\_\{t=1\}^\{m\-1\}d\_\{\\sigma\_\{t\},\\sigma\_\{t\+1\}\}\(10\)

This is equivalent to restricting Eq\. \([2](https://arxiv.org/html/2606.05441#S3.E2)\) to adjacency\-by\-position interactions\.

###### Theorem 3\.12\(Exact Equivalence to TSP\-path\)\.

Minimizing the path\-edge feature ordering objective in Eq\. \([10](https://arxiv.org/html/2606.05441#S3.E10)\) over all permutationsσ\\sigmais exactly the TSP\-path problem on𝒦\\mathcal\{K\}with path cost given byPathCost​\(σ\)\\mathrm\{PathCost\}\(\\sigma\)in Eq\. \([8](https://arxiv.org/html/2606.05441#S3.E8)\)\.

###### Proof sketch\.

For any permutationσ\\sigma, Eq\. \([10](https://arxiv.org/html/2606.05441#S3.E10)\) can be re\-written as∑t=1m−1dσt,σt\+1=PathCost​\(σ\)\\sum\_\{t=1\}^\{m\-1\}d\_\{\\sigma\_\{t\},\\sigma\_\{t\+1\}\}=\\mathrm\{PathCost\}\(\\sigma\)by definition\. Thus, the minimizers coincide\. ∎

###### Corollary 3\.13\(TSP\-path embeds into Feature Ordering\)\.

TSP\-path is a special case of feature ordering\. Consequently, feature ordering \(strictly\) generalizes TSP\-path\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/GOTabPFN.png)Figure 3:End\-to\-end architecture of GOTabPFN\.The feature clustering block denotes the discovery of local feature\-dependence groups, implemented by estimating cluster\-wise feature graphsGcG\_\{c\}from local sample contexts; GO\-LR then obtains a global orderΠ∗\\Pi^\{\\ast\}, and NSC compresses contiguous ordered segments into meta\-featuresZ​\(x\)Z\(x\), which are passed to a frozen TabPFN\-2\.5 head\.Global Aggregation \(Mean\-Rank Integration\)\.Letπc\\pi\_\{c\}be a local ordering for clusterccand letrc​\(j\)r\_\{c\}\(j\)be the rank \(position\) of featurejjinπc\\pi\_\{c\}\. With cluster weightsαc≥0\\alpha\_\{c\}\\geq 0and∑c=1kαc=1\\sum\_\{c=1\}^\{k\}\\alpha\_\{c\}=1, GO\-LR forms a global order by Eq\.[11](https://arxiv.org/html/2606.05441#S3.E11)\. This aggregation produces a single global permutation consistent with the set of local cluster\-wise permutations\.

r¯​\(j\)=∑c=1kαc​rc​\(j\),Π∗=argsortj=1mr¯​\(j\)\\bar\{r\}\(j\)=\\sum\_\{c=1\}^\{k\}\\alpha\_\{c\}r\_\{c\}\(j\),\\quad\\Pi^\{\\ast\}=\\operatorname\*\{argsort\}\_\{j=1\}^\{m\}\\bar\{r\}\(j\)\(11\)Algorithm[1](https://arxiv.org/html/2606.05441#alg1)captures the steps in the proposed GO\-LR algorithm\.

Algorithm 1Graph\-guided Ordering with Local Refinement \(GO\-LR\)0:

X∈ℝn×mX\\in\\mathbb\{R\}^\{n\\times m\}, clusters

kk, metric

ϕ\\phi, passes

PP
0:Global order

Π∗\\Pi^\{\\ast\}and local orders

\{πc\}c=1k\\\{\\pi\_\{c\}\\\}\_\{c=1\}^\{k\}
1:

\{X\(c\)\}c=1k←Cluster​\(X,k\)\\\{X^\{\(c\)\}\\\}\_\{c=1\}^\{k\}\\leftarrow\\mathrm\{Cluster\}\(X,k\);

μ\(c\)←mean​\(X\(c\)\)\\mu^\{\(c\)\}\\leftarrow\\mathrm\{mean\}\(X^\{\(c\)\}\)
2:for

c=1c=1to

kkdo

3:

Gc←Sym​\(FeatureDissimilarity​\(X\(c\),ϕ\)\)∈ℝm×mG\_\{c\}\\leftarrow\\mathrm\{Sym\}\(\\mathrm\{FeatureDissimilarity\}\(X^\{\(c\)\},\\phi\)\)\\in\\mathbb\{R\}^\{m\\times m\}// undirected

4:

πc←NNPath​\(Gc\)\\pi\_\{c\}\\leftarrow\\mathrm\{NNPath\}\(G\_\{c\}\)
5:

πc←Refine​\(πc,Gc,P\)\\pi\_\{c\}\\leftarrow\\mathrm\{Refine\}\(\\pi\_\{c\},G\_\{c\},P\)// direction\-select \+PPpasses

6:

rc​\(j\)←rank​\(j​in​πc\)r\_\{c\}\(j\)\\leftarrow\\mathrm\{rank\}\(j\\ \\mathrm\{in\}\\ \\pi\_\{c\}\)for

j=1,…,mj=1,\\dots,m
7:endfor

8:

α~c←\(ε\+meanc′​∥μ\(c\)−μ\(c′\)∥2\)−1\\tilde\{\\alpha\}\_\{c\}\\leftarrow\\Big\(\\varepsilon\+\\mathrm\{mean\}\_\{c^\{\\prime\}\}\\lVert\\mu^\{\(c\)\}\-\\mu^\{\(c^\{\\prime\}\)\}\\rVert\_\{2\}\\Big\)^\{\-1\};

αc←α~c/∑c′α~c′\\alpha\_\{c\}\\leftarrow\\tilde\{\\alpha\}\_\{c\}/\\sum\_\{c^\{\\prime\}\}\\tilde\{\\alpha\}\_\{c^\{\\prime\}\}
9:

r¯​\(j\)←∑c=1kαc​rc​\(j\)\\bar\{r\}\(j\)\\leftarrow\\sum\_\{c=1\}^\{k\}\\alpha\_\{c\}\\,r\_\{c\}\(j\);

Π∗←argsortjr¯​\(j\)\\Pi^\{\\ast\}\\leftarrow\\operatorname\*\{argsort\}\_\{j\}\\bar\{r\}\(j\)
10:returnGlobal feature order

Π∗\\Pi^\{\\ast\}and local orders

\{πc\}c=1k\\\{\\pi\_\{c\}\\\}\_\{c=1\}^\{k\}

Algorithm 2Neuro\-Inspired Subunit Compression \(NSC\)0:Training matrix

Xtrain∈ℝn×mX\_\{\\text\{train\}\}\\in\\mathbb\{R\}^\{n\\times m\}, sample

x∈ℝmx\\in\\mathbb\{R\}^\{m\}, global order

Π∗\\Pi^\{\*\}, ID threshold

τ∈\(0,1\)\\tau\\in\(0,1\), bypass threshold

m0m\_\{0\},

MM\-rule hyperparameters

\(γ,Mmin,Mmax\)\(\\gamma,M\_\{\\min\},M\_\{\\max\}\), segmentation rule

Seg​\(⋅\)\\mathrm\{Seg\}\(\\cdot\)with

ℓmin\\ell\_\{\\min\}, and \(if transition\-aware\) dissimilarities

Δ∈ℝ\+m−1\\Delta\\in\\mathbb\{R\}^\{m\-1\}\_\{\+\}\.

0:Compressed tokens

Z​\(x\)∈ℝMZ\(x\)\\in\\mathbb\{R\}^\{M\}\.

1:Reorder:

XΠ←Xtrain​\[:,Π∗\]X^\{\\Pi\}\\leftarrow X\_\{\\text\{train\}\}\[:,\\Pi^\{\*\}\],

xΠ←x​\[Π∗\]x^\{\\Pi\}\\leftarrow x\[\\Pi^\{\*\}\]\.

2:PCA\-ID:compute

G=1n−1​XΠ​\(XΠ\)⊤G=\\frac\{1\}\{n\-1\}X^\{\\Pi\}\(X^\{\\Pi\}\)^\{\\top\}; let

\{λi\}\\\{\\lambda\_\{i\}\\\}be eigenvalues of

GG\(descending\)\.

3:

d^←min⁡\{k:∑i=1kλi≥τ​∑iλi\}\\hat\{d\}\\leftarrow\\min\\\{k:\\sum\_\{i=1\}^\{k\}\\lambda\_\{i\}\\geq\\tau\\sum\_\{i\}\\lambda\_\{i\}\\\};

IDF←d^/m\\mathrm\{IDF\}\\leftarrow\\hat\{d\}/m
4:if

m≤m0m\\leq m\_\{0\}then

5:

M←mM\\leftarrow m
6:else

7:

M←clip​\(⌈2​d^⌉,Mmin,min⁡\(Mmax,m\)\)M\\leftarrow\\mathrm\{clip\}\(\\lceil 2\\hat\{d\}\\rceil,\\;M\_\{\\min\},\\;\\min\(M\_\{\\max\},m\)\)// or⌈γ​d^⌉\\lceil\\gamma\\hat\{d\}\\rceil/ IDF\-rule

8:endif

9:Segment:

\{𝒮t\}t=1M←Seg​\(m,M,Δ,ℓmin\)\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}\\leftarrow\\mathrm\{Seg\}\(m,M,\\Delta,\\ell\_\{\\min\}\)// uniform / equal\-mass / largest\-jump

10:for

t=1t=1to

MMdo

11:Fit \(once\):center

X\[:,𝒮t\]ΠX^\{\\Pi\}\_\{\[:,\\mathcal\{S\}\_\{t\}\]\}to get mean

μt\\mu\_\{t\}; compute first PC direction

vtv\_\{t\}\(CPU SVD\), fix sign deterministically\.

12:Tokenize:

zt​\(x\)←\(x𝒮tΠ−μt\)⊤​vtz\_\{t\}\(x\)\\leftarrow\\left\(x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{t\}\}\-\\mu\_\{t\}\\right\)^\{\\top\}v\_\{t\}// scalar

13:endfor

14:return

Z​\(x\)←\(z1​\(x\),…,zM​\(x\)\)Z\(x\)\\leftarrow\(z\_\{1\}\(x\),\\ldots,z\_\{M\}\(x\)\)

### 3\.2Neuro\-Inspired Subunit Compression \(NSC\)

Motivation\.We design a representation interface that allows TabPFN to scale to HDLSS tabular data without retraining or architectural modification\. Cortical pyramidal neurons receive on the order of20,00020\{,\}000\-30,00030\{,\}000synaptic inputs\(Poiraziet al\.,[2003](https://arxiv.org/html/2606.05441#bib.bib16)\), yet these inputs are not integrated as a single linear sum\(Majoret al\.,[2013](https://arxiv.org/html/2606.05441#bib.bib18)\)\. Instead, inputs are organized into multiple dendritic subunits\(Kastellakiset al\.,[2015](https://arxiv.org/html/2606.05441#bib.bib19)\), each acting as a nonlinear integration compartment\. Here, correlated synapses may exhibit local clustering\(Ujfalussy and Makara,[2020](https://arxiv.org/html/2606.05441#bib.bib24)\)and trigger N\-methyl\-D\-aspartate \(NMDA\)\-mediated plateau potentials\(Schilleret al\.,[2000](https://arxiv.org/html/2606.05441#bib.bib17)\)that pool dozens to hundreds of inputs into a single subunit\-level signal\(Kirchner and Gjorgjieva,[2021](https://arxiv.org/html/2606.05441#bib.bib20)\)\. This subunit\-based organization provides a canonical biological mechanism\(Beniaguevet al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib44)\)for compressing extremely high\-dimensional inputs into a compact set of functional representations\(Wuet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib25)\)\. This locality\-driven compression view is also consistent with prior signal\-compression work, where edge\-aware prediction has been used to exploit local structure in high\-dimensional hyperspectral imagery\(Jain and Adjeroh,[2007](https://arxiv.org/html/2606.05441#bib.bib171)\)\. We adopt this principle as an algorithmic inductive bias for HDLSS tabular data\(Balınet al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib45)\)\. See Alg\.[2](https://arxiv.org/html/2606.05441#alg2)for steps in NSC\. Ordered\-Axis Segmentation\.Letx∈ℝmx\\in\\mathbb\{R\}^\{m\}denote a tabular sample and letΠ∗\\Pi^\{\*\}be the global feature permutation produced by GO\-LR \(Section[3](https://arxiv.org/html/2606.05441#S3.F3)\)\. We define the reordered feature vector in Eq\.[12](https://arxiv.org/html/2606.05441#S3.E12)\. Given a target number of meta\-featuresMM, we set the segment lengths=⌈m/M⌉s=\\lceil m/M\\rceil\(Eq\.[13](https://arxiv.org/html/2606.05441#S3.E13)\) and define contiguous segments\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}by Eq\.[14](https://arxiv.org/html/2606.05441#S3.E14), which partition\{1,…,m\}\\\{1,\\dots,m\\\}into ordered neighborhoods \(subunits\)\.

xΠ=\(xΠ∗​\(1\),xΠ∗​\(2\),…,xΠ∗​\(m\)\)x^\{\\Pi\}=\\big\(x\_\{\\Pi^\{\*\}\(1\)\},\\,x\_\{\\Pi^\{\*\}\(2\)\},\\,\\dots,\\,x\_\{\\Pi^\{\*\}\(m\)\}\\big\)\(12\)s=⌈mM⌉s=\\left\\lceil\\frac\{m\}\{M\}\\right\\rceil\(13\)𝒮t=\{\(t−1\)​s\+1,…,min⁡\(t​s,m\)\},t=1,…,M\\mathcal\{S\}\_\{t\}=\\\{\(t\-1\)s\+1,\\dots,\\min\(ts,m\)\\\},\\qquad t=1,\\dots,M\(14\)Adaptive segmentation\.To let segment boundaries follow “transitions” in the ordered feature axis, we first summarize pairwise dissimilarities into a 1D signal\. We reuse the global feature dissimilarity matrixW¯∈ℝm×m\\bar\{W\}\\in\\mathbb\{R\}^\{m\\times m\}already computed for GO\-LR on the dataset \(and keep it fixed at inference\), e\.g\.,W¯:=FeatureDissimilarity​\(X,ϕ\)\\bar\{W\}:=\\mathrm\{FeatureDissimilarity\}\(X,\\phi\)orW¯:=∑c=1kαc​Wc\\bar\{W\}:=\\sum\_\{c=1\}^\{k\}\\alpha\_\{c\}W\_\{c\}, so NSC itself does not introduce any additionalO​\(m2\)O\(m^\{2\}\)cost\. For adjacent positions along the GO\-LR order, we define the transition dissimilarityδt:=W¯Π∗​\(t\),Π∗​\(t\+1\)\\delta\_\{t\}:=\\bar\{W\}\_\{\\Pi^\{\\ast\}\(t\),\\,\\Pi^\{\\ast\}\(t\+1\)\}fort=1,…,m−1t=1,\\dots,m\-1; largeδt\\delta\_\{t\}indicates a sharp change between neighboring features\. We then form the cumulative transition massct:=∑i=1t−1δic\_\{t\}:=\\sum\_\{i=1\}^\{t\-1\}\\delta\_\{i\}fort=1,…,mt=1,\\dots,mwith totalC:=cm=∑i=1m−1δiC:=c\_\{m\}=\\sum\_\{i=1\}^\{m\-1\}\\delta\_\{i\}\. Given a desired number of segmentsMM, we place cutpoints1≤τ1<⋯<τM−1<m1\\leq\\tau\_\{1\}<\\cdots<\\tau\_\{M\-1\}<malong this 1D signal in two ways: \(i\) a largest\-jump rule that selects the indices of theM−1M\{\-\}1largestδt\\delta\_\{t\}values \(subject to a minimum segment lengthℓmin\\ell\_\{\\min\}\), and \(ii\) an equal\-mass rule that treatsctc\_\{t\}as a discrete CDF and chooses cutpoints by Eq\.[15](https://arxiv.org/html/2606.05441#S3.E15)\.

τℓ\\displaystyle\\tau\_\{\\ell\}:=min⁡\{t∈\{2,…,m−1\}:ct≥\(ℓ/M\)​C\},\\displaystyle=\\min\\Big\\\{t\\in\\\{2,\\dots,m\-1\\\}:c\_\{t\}\\geq\(\\ell/M\)\\,C\\Big\\\},\(15\)ℓ=1,…,M−1\\displaystyle\\ell=1,\\dots,M\-1Again enforcingℓmin\\ell\_\{\\min\}with a uniform fallback if needed\. Finally, we materialize segments as𝒮1=\{1,…,τ1\}\\mathcal\{S\}\_\{1\}=\\\{1,\\dots,\\tau\_\{1\}\\\},𝒮t=\{τt−1\+1,…,τt\}\\mathcal\{S\}\_\{t\}=\\\{\\tau\_\{t\-1\}\+1,\\dots,\\tau\_\{t\}\\\}fort=2,…,M−1t=2,\\dots,M\-1, and𝒮M=\{τM−1\+1,…,m\}\\mathcal\{S\}\_\{M\}=\\\{\\tau\_\{M\-1\}\+1,\\dots,m\\\}, so that each subunit is an ordered neighborhood bounded by large transitions in the feature axis\. Subunit Pooling and Meta\-Feature Construction\.

Segment descriptors\.Beyond mean and variance, we optionally summarize each segmentutu\_\{t\}using a richer descriptorψ​\(ut\)\\psi\(u\_\{t\}\)that includes higher\-order and robust statistics \(e\.g\., skewness, kurtosis, median, and interquartile range\), enabling NSC to capture distributional shape within each ordered region at negligible extra cost\.

Letψ:ℝ\|𝒮t\|→ℝq\\psi:\\mathbb\{R\}^\{\|\\mathcal\{S\}\_\{t\}\|\}\\rightarrow\\mathbb\{R\}^\{q\}denote a \(possibly learn\-free\) segment descriptor of dimensionqq, and letgθ:ℝq→ℝdg\_\{\\theta\}:\\mathbb\{R\}^\{q\}\\rightarrow\\mathbb\{R\}^\{d\}be a shared lightweight pooling network that maps each descriptor to add\-dimensional meta\-feature\. In practice,gθg\_\{\\theta\}may be a shallow MLP \(or linear map\) applied toψ​\(ut\)\\psi\(u\_\{t\}\); the samegθg\_\{\\theta\}is reused across segments to enforce parameter sharing and stability\. Thett\-th meta\-feature is defined in Eq\.[16](https://arxiv.org/html/2606.05441#S3.E16)\. NSC outputs the compressed meta\-feature sequenceZ​\(x\)=\(z1,…,zM\)Z\(x\)=\(z\_\{1\},\\dots,z\_\{M\}\), which is subsequently provided to the TabPFN predictor head\.

zt=gθ​\(ψ​\(ut\)\),ut:=x𝒮tΠz\_\{t\}\\;=\\;g\_\{\\theta\}\\\!\\Big\(\\psi\(u\_\{t\}\)\\Big\),\\qquad u\_\{t\}:=x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{t\}\}\(16\)Z​\(x\)=\(z1,…,zM\)∈ℝM×dZ\(x\)=\(z\_\{1\},\\dots,z\_\{M\}\)\\in\\mathbb\{R\}^\{M\\times d\}\(17\)NSC variants\.We instantiate NSC in four variants \(details in Appendix[D](https://arxiv.org/html/2606.05441#A4)\): \(i\) NSC: uniform segments \+ learned pooling, \(ii\) NSC\-P: same with PCA\-based intrinsic\-dimension rule forMM, \(iii\) NSC\-SP: PCA\-based segment \(SegPCA\) pooling with a fixedMM, and \(iv\) NSC\-pSP: PCA\-based intrinsic\-dimension rule forMMcombined with SegPCA pooling\. GOTabPFN uses NSC\-pSP in the experiments\. Choosing the Number of Meta\-Features \(NSC\-pSP\)\.To adapt the compression level to dataset complexity, we tie the meta\-feature budgetMMto an estimate of the intrinsic dimensionalityd^\\hat\{d\}of the training data\. LetX~∈ℝn×m\\tilde\{X\}\\in\\mathbb\{R\}^\{n\\times m\}denote the standardized training matrix \(zero mean, unit variance per feature\), and letΣ=1n−1​X~⊤​X~\\Sigma=\\frac\{1\}\{n\-1\}\\tilde\{X\}^\{\\top\}\\tilde\{X\}be its empirical covariance \(or correlation\) matrix with nonzero eigenvalues\{λi\}i=1r\\\{\\lambda\_\{i\}\\\}\_\{i=1\}^\{r\},r≤min⁡\(n,m\)r\\leq\\min\(n,m\)\.

For the NSC\-pSP variant used in our main experiments, we estimated^\\hat\{d\}via a PCA cumulative\-variance rule\(Hotelling,[1933](https://arxiv.org/html/2606.05441#bib.bib46)\)\. We define the explained\-variance ratio and its cumulative sum in Eqns\.[18](https://arxiv.org/html/2606.05441#S3.E18)and[19](https://arxiv.org/html/2606.05441#S3.E19)\. Given a target variance\-retention levelτ∈\(0,1\)\\tau\\in\(0,1\)\(e\.g\.,τ∈\{0\.90,0\.95,0\.99,0\.9975\}\\tau\\in\\\{0\.90,0\.95,0\.99,0\.9975\\\}\), the PCA\-based intrinsic dimension is defined by Eq\.[20](https://arxiv.org/html/2606.05441#S3.E20)and NSC\-pSP setsd^=d^PCA​\(τ\)\\hat\{d\}=\\hat\{d\}\_\{\\mathrm\{PCA\}\}\(\\tau\)\. We then choose the meta\-feature budget via Eq\.[21](https://arxiv.org/html/2606.05441#S3.E21)whereclip​\(x,a,b\)=min⁡\(max⁡\(x,a\),b\)\\mathrm\{clip\}\(x,a,b\)=\\min\(\\max\(x,a\),b\)\. For non\-HDLSS regimes \(e\.g\.,m≤400m\\leq 400\), we bypass compression by settingM=mM=m\. This rule ensures that the number of meta\-features scales with intrinsic, rather than ambient, dimensionality, yielding aggressive compression in highly redundant HDLSS settings while avoiding unnecessary bottlenecks when features are already compact\. Implementation details and alternative intrinsic\-dimension rules used by the other NSC variants are given in Appendix[C](https://arxiv.org/html/2606.05441#A3)\.

EVRi=λi∑j=1rλj,i=1,…,r\\mathrm\{EVR\}\_\{i\}\\;=\\;\\frac\{\\lambda\_\{i\}\}\{\\sum\_\{j=1\}^\{r\}\\lambda\_\{j\}\},\\qquad i=1,\\dots,r\(18\)CUM​\(k\)=∑i=1kEVRi,k=1,…,r\\mathrm\{CUM\}\(k\)\\;=\\;\\sum\_\{i=1\}^\{k\}\\mathrm\{EVR\}\_\{i\},\\qquad k=1,\\dots,r\(19\)d^PCA​\(τ\)=min⁡\{k∈\{1,…,r\}:CUM​\(k\)≥τ\}\\hat\{d\}\_\{\\mathrm\{PCA\}\}\(\\tau\)\\;=\\;\\min\\Big\\\{k\\in\\\{1,\\dots,r\\\}:\\ \\mathrm\{CUM\}\(k\)\\geq\\tau\\Big\\\}\(20\)M=clip​\(⌈2​d^⌉,32,min⁡\(512,m\)\)M=\\mathrm\{clip\}\\big\(\\lceil 2\\hat\{d\}\\rceil,\\;32,\\;\\min\(512,m\)\\big\)\(21\)PCA\-centric post\-segmentation pooling \(SegPCA\)\.For the PCA\-centric NSC variants \(NSC\-SP and NSC\-pSP\), once the meta\-feature budgetMMis determined \(Sec\.[17](https://arxiv.org/html/2606.05441#S3.E17)\) and the ordered segments\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}are formed \(Eqs\.[13](https://arxiv.org/html/2606.05441#S3.E13)\-[14](https://arxiv.org/html/2606.05441#S3.E14)\), we construct one scalar token per segment by projecting onto a segment specific first principal direction learned on the training set\. Letut​\(x\):=x𝒮tΠ∈ℝ\|𝒮t\|u\_\{t\}\(x\):=x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{t\}\}\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\_\{t\}\|\}and letXΠ∈ℝn×mX^\{\\Pi\}\\in\\mathbb\{R\}^\{n\\times m\}denote the standardized training matrix after applyingΠ∗\\Pi^\{\\ast\}\. We define the training submatrix for segmentttasXt:=X:,𝒮tΠ∈ℝn×\|𝒮t\|X\_\{t\}:=X^\{\\Pi\}\_\{:\\,,\\mathcal\{S\}\_\{t\}\}\\in\\mathbb\{R\}^\{n\\times\|\\mathcal\{S\}\_\{t\}\|\}\. We compute the segment mean and covariance by Eq\.[22](https://arxiv.org/html/2606.05441#S3.E22)and take the first principal direction using Eq\.[23](https://arxiv.org/html/2606.05441#S3.E23)\. Thett\-th meta\-feature is then the centered projection \(Eq\.[24](https://arxiv.org/html/2606.05441#S3.E24)\), yielding ad=1d\{=\}1token sequenceZSegPCA​\(x\)Z\_\{\\mathrm\{SegPCA\}\}\(x\)\(Eq\.[25](https://arxiv.org/html/2606.05441#S3.E25)\)\. Optionally, we apply a deterministic sign convention tovtv\_\{t\}\(e\.g\., flippingvtv\_\{t\}so that segment scores positively correlate with a fixed reference such as the within\-segment sample mean\), which leaves the subspace unchanged but improves reproducibility\.

μt\\displaystyle\\mu\_\{t\}=1n​∑i=1nXt,i:∈ℝ\|𝒮t\|,\\displaystyle=\\;\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}X\_\{t,i:\}\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\_\{t\}\|\},\(22\)Σt\\displaystyle\\Sigma\_\{t\}=1n−1​\(Xt−𝟏​μt⊤\)⊤​\(Xt−𝟏​μt⊤\)\\displaystyle=\\;\\frac\{1\}\{n\-1\}\\big\(X\_\{t\}\-\\mathbf\{1\}\\mu\_\{t\}^\{\\top\}\\big\)^\{\\top\}\\big\(X\_\{t\}\-\\mathbf\{1\}\\mu\_\{t\}^\{\\top\}\\big\)vt=arg⁡max‖v‖2=1⁡v⊤​Σt​v∈ℝ\|𝒮t\|v\_\{t\}\\;=\\;\\arg\\max\_\{\\\|v\\\|\_\{2\}=1\}v^\{\\top\}\\Sigma\_\{t\}v\\;\\in\\;\\mathbb\{R\}^\{\|\\mathcal\{S\}\_\{t\}\|\}\(23\)zt​\(x\)=\(ut​\(x\)−μt\)⊤​vt∈ℝ,t=1,…,Mz\_\{t\}\(x\)\\;=\\;\\big\(u\_\{t\}\(x\)\-\\mu\_\{t\}\\big\)^\{\\top\}v\_\{t\}\\;\\in\\;\\mathbb\{R\},\\qquad t=1,\\dots,M\(24\)ZSegPCA​\(x\)=\(z1​\(x\),…,zM​\(x\)\)∈ℝM×1Z\_\{\\mathrm\{SegPCA\}\}\(x\)\\;=\\;\(z\_\{1\}\(x\),\\dots,z\_\{M\}\(x\)\)\\in\\mathbb\{R\}^\{M\\times 1\}\(25\)Summary\.NSC acts as a shared piecewise pooling operator, defined by Prop\.[C\.1](https://arxiv.org/html/2606.05441#A3.Thmtheorem1)\(App\.[C](https://arxiv.org/html/2606.05441#A3)\)\. NSC transforms GO\-LR\-ordered high\-dimensional tabular inputs into a compact sequence of structured meta\-features through contiguous segmentation and shared pooling, introducing an HDLSS\-friendly inductive bias inspired by subunit\-based cortical computation while remaining purely statistical and computationally efficient\. By compressingmmraw features intoM≪mM\\ll mmeta\-features, NSC reduces effective sequence length presented to TabPFN\-style backbones \(e\.g\., TabPFN\-2\.5\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115)\)or other variants\), yielding lower compute & memory cost while preserving order\-induced locality to make original TabPFN versions usable for HDLSS regime\. Per sample, NSC isO​\(m\)O\(m\)whenψ\\psiuses linear\-time statistics \(e\.g\., moments\); robust summaries such as quantiles can be computed approximately in linear time/exactly with a mildO​\(\|𝒮t\|​log⁡\|𝒮t\|\)O\(\|\\mathcal\{S\}\_\{t\}\|\\log\|\\mathcal\{S\}\_\{t\}\|\)overhead if sorting is used\. TabPFN\-2\.5 Head \(non\-differentiable\)\.NSC module compresses each sample into a fixed\-dimensional representationZ​\(x\)∈ℝMZ\(x\)\\in\\mathbb\{R\}^\{M\}\(orZ​\(x\)∈ℝM×dZ\(x\)\\in\\mathbb\{R\}^\{M\\times d\}, flattened toℝM​d\\mathbb\{R\}^\{Md\}\)\. We then use TabPFN\-2\.5 as the predictor head: for each train/validation split, we fit TabPFN\-2\.5 on\{\(Z​\(xi\),yi\)\}i∈ℐtrain\\\{\(Z\(x\_\{i\}\),y\_\{i\}\)\\\}\_\{i\\in\\mathcal\{I\}\_\{\\text\{train\}\}\}and evaluate onZ​\(xj\)Z\(x\_\{j\}\)forj∈ℐvalj\\in\\mathcal\{I\}\_\{\\text\{val\}\}without backpropagation through the head\. This design treats NSC as a compression interface that maps HDLSS inputs into a feature budget compatible with TabPFN variants, while retaining strong tabular foundation models byHollmannet al\.\([2023](https://arxiv.org/html/2606.05441#bib.bib114),[2025](https://arxiv.org/html/2606.05441#bib.bib113)\); Grinsztajnet al\.\([2025](https://arxiv.org/html/2606.05441#bib.bib115)\)\.

Table 1:Top\-10 performance on 8 HDLSS datasets \(mean accuracy with subscripted standard deviation over5×55\\times 5CV\)\.Bolddenotes the best result per dataset andunderlinedenotes the second\-best\. Rank is the average rank across datasets \(lower is better\), computed with standard tie\-breaking\. Dataset abbreviations: COL = Colon, LNG = Lung, GLI = GLI\-85, SMK = SMK\_CAN\_187, AML = ALLAML, PRS = Prostate\-GE, ARC = Arcene, TOX = TOX\-171\. Model abbreviations: \*GOTabPFN = our method, TabPFN\-W = TabPFN Wide, TTables = TuneTables, BETA = TabPFN Unleashed\. See Table[G\.1](https://arxiv.org/html/2606.05441#A7.T1)in Appendix[G](https://arxiv.org/html/2606.05441#A7)for full results against 55 baselines\.Table 2:Performance on 8 cross\-domain datasets \(mean accuracy with subscripted standard deviation over5×55\\times 5CV\)\.Bolddenotes the best result per dataset andunderlinedenotes the second\-best\. Rank is the avg\. rank across datasets \(lower is better\), computed with standard tie\-breaking; Some datasets use only 50\- 60 Optuna trials due to compute limits; “\-” denotes OOM/unsupported runs, ranked last\. ProtoGate targets very few sample and scales less favorably with largernn\. Dataset abbreviations: ORL = orlraws10P, BAS = BASEHOCK, REL = RELATHE, PCM = PCMAC, CCY = Cell Cycle, CIF = CIFAR\-10, DF\-R = DrivFace\-Regression, DF\-C = DrivFace\-Classification, REG = Regression \(R2R^\{2\}\)\. Model abbreviations: \*GOTabPFN = our method, TabPFN\-W = TabPFN Wide, TTables = TuneTables\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/golr_nsc_ablation_accuracies.png)Figure 4:HDLSS ablation accuracy\.GOTabPFN vs\. tabular foundation models on 8 HDLSS datasets\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/golr_nsc_ablation_gains.png)Figure 5:Ablation gains\.Absolute and relative gains of GOTabPFN over the best original foundation\-model head\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_perf_resource.png)Figure 6:Accuracy\-resource profile on Colon\.Wall\-clock time, peak GPU memory, and CPU RSS for GOTabPFN\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Dolan-More.png)Figure 7:Dolan\-Moré profiles\.Performance profiles over 8 HDLSS datasets against the top\-10 baselines\.Table 3:Colon ablation\.Accuracy over5×55\\times 5CV\. All NSC variants use NSC\-pSP unless noted\.

## 4Experimental Results

Algorithms and datasets used\.We use eight biomedical HDLSS datasets \(e\.g\., Arcene, Colon, GLI\-85, Lung, etc\.\) from the repository ofLiet al\.\([2018](https://arxiv.org/html/2606.05441#bib.bib93)\), also used by ProtoGate\(Jianget al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib98)\)\. We follow the standard HDLSS setting where features far exceed samples \(m≫nm\\gg n\)\. To study when feature ordering helps, we propose an empirical locality\-based criterion; see Appendix[F](https://arxiv.org/html/2606.05441#A6)for the criterion and usage guidance\. We compare against 55 baselines spanning classical ML/GBDT, HDLSS feature selection, deep tabular models, and small tabular foundation models\. Modern methods including TANDEM\(Naor and Lindenbaum,[2025](https://arxiv.org/html/2606.05441#bib.bib111)\), TabPFN Wide\(Kolberget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib117)\), TabDPT\(Maet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib97)\), TabICL\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\), BETA\(Liu and Ye,[2025](https://arxiv.org/html/2606.05441#bib.bib116)\), TuneTables\(Feueret al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib112)\), and ProtoGate perform strongly on HDLSS classification, while classical baselines \(MLP, Lasso\) remain competitive, consistent with ProtoGate\. Full baseline details are in Appendix[G](https://arxiv.org/html/2606.05441#A7)\. Experimental set up\.We use 5×\\times5 nested cross\-validation \(25 repeats\) on the HDLSS datasets, matching ProtoGate’s protocol, for all baselines\. Experiments run on the TITAN cluster \(x86\_64 CPU, 188,GB RAM, TITAN RTX 24,GB\) with PyTorch 2\.4\.1\+cu121\. We disable AMP for GOTabPFN, but some transformer baselines require AMP and/or DP/DDP to fit GPU memory\. We tune only GO\-LR and NSC in GOTabPFN via Optuna\(Akibaet al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib43)\)\(150 trials/dataset\), following standard tabular tuning practice\(Gorishniyet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib104),[2024](https://arxiv.org/html/2606.05441#bib.bib108),[2022](https://arxiv.org/html/2606.05441#bib.bib90),[2021](https://arxiv.org/html/2606.05441#bib.bib79)\); the TabPFN\-2\.5 head remains frozen\. For baselines, we use authors’ recommended settings when tuning is unnecessary; when tuning is recommended, we also run Optuna \(150 trials\) for fairness\. Among TabPFN\-based models, TuneTables likewise uses lightweight Optuna tuning\. See Fig\.[A\.1](https://arxiv.org/html/2606.05441#A1.F1)\. HDLSS classification performance\.Table[1](https://arxiv.org/html/2606.05441#S3.T1)reports mean accuracy \(±\\pmstd\) over5×55\\times 5repeated CV on eight HDLSS benchmarks\. GOTabPFN is best on all datasets \(8/8\) with the lowest average rank \(1\.00±0\.001\.00\\pm 0\.00\), indicating consistent dominance\. Relative to the strongest TabPFN variants \(TabPFN\-Wide, TuneTables, BETA\), gains are largest on harder/noisier datasets \(e\.g\., SMK, TOX\) and smaller on near\-saturated ones \(e\.g\., ALLAML, Prostate\), where headroom is limited\. GOTabPFN also shows comparable or lower split\-to\-split variance, suggesting improved robustness in the low\-sample regime\. Full results against 55 baselines are in Table[G\.1](https://arxiv.org/html/2606.05441#A7.T1)\(Appendix[G](https://arxiv.org/html/2606.05441#A7)\)\. Across the 8 cross\-domain\(App\.[T](https://arxiv.org/html/2606.05441#A20)\) high\-dimensional datasets \(Table[2](https://arxiv.org/html/2606.05441#S3.T2)\), GOTabPFN achieves the best average rank \(1\.25±0\.661\.25\\pm 0\.66\) and obtains the top result on 7/8 tasks, including image\-derived, biological, text\-like, and camera\-sensor datasets\. The strongest competing methods are TabPFN\-W, TANDEM, TabDPT, and MLP, but their gains are less consistent across domains\. These results suggest that GO\-LR\+NSC provides a robust ordering\-aware compression interface for diverse HDLSS and related high\-dimensional regimes\. Statistical significance\.Across 8 HDLSS datasets, GOTabPFN achieves the best average rank and is separated from competing baselines by Friedman\(Friedman,[1937](https://arxiv.org/html/2606.05441#bib.bib96)\)/Nemenyi\(Nemenyi,[1963](https://arxiv.org/html/2606.05441#bib.bib92)\)critical\-difference analysis \(Fig\.[I\.1](https://arxiv.org/html/2606.05441#A9.F1)\)\. While paired Wilcoxon signed\-rank tests\(Demšar,[2006](https://arxiv.org/html/2606.05441#bib.bib95)\)yield consistent directional improvements \(allpraw=0\.00781p\_\{\\text\{raw\}\}=0\.00781\), significance does not always survive Holm correction due to the small number of datasets and the resulting conservativeness of multiple\-comparison control \(Table[I\.1](https://arxiv.org/html/2606.05441#A9.T1)\)\. Additional statistical details are in Appendix[I](https://arxiv.org/html/2606.05441#A9)\. Runtime and computational complexity\.Fornnsamples,mmfeatures,kksample\-clusters, andMMNSC tokens, GO\-LR costs𝒪​\(n​m​k​I\+m2​n\)\\mathcal\{O\}\(nmkI\+m^\{2\}n\), plus𝒪​\(k​m2​b\)\\mathcal\{O\}\(km^\{2\}b\)for KL graph construction; refinement/integration adds𝒪​\(k​P​m2\+k2​m\)\\mathcal\{O\}\(kPm^\{2\}\+k^\{2\}m\), wherePPis no\. of Sweep Refine passes\. In HDLSS \(n≪mn\\\!\\ll\\\!m\), NSC fits per\-segment PC1 directions in𝒪​\(n2​m\)\\mathcal\{O\}\(n^\{2\}m\)and tokenizes in𝒪​\(n​m\)\\mathcal\{O\}\(nm\)\. The TabPFN\-2\.5 head onMMtokens is dominated by attention overnncontext points, scaling as𝒪~​\(n2​M\)\\tilde\{\\mathcal\{O\}\}\(n^\{2\}M\)per split\. On Colon, GOTabPFN achieves88\.2%88\.2\\%accuracy in31\.431\.4s with modest peak GPU use \(115\.6115\.6MB; Fig\.[6](https://arxiv.org/html/2606.05441#S3.F6)\); memory is primarily CPU\-side \(RSS≈2202\.5\\approx 2202\.5MB\), consistent with GO\-LR graph construction/refinement\. Ablations\.Figs\.[4](https://arxiv.org/html/2606.05441#S3.F4)\-[5](https://arxiv.org/html/2606.05441#S3.F5)and Table[4](https://arxiv.org/html/2606.05441#S4.T4)quantify adding GO\-LR ordering and NSC compression before a frozen TabPFN\-2\.5 head: GOTabPFN matches or exceeds the best original tabular foundation model on all 8 HDLSS datasets, with largest gains on GLI\-85 \(\+4\.16 pp\), Arcene \(\+2\.60 pp\), and SMK \(\+2\.24 pp\), and near\-saturated improvements on ALLAML/Prostate\. On Colon \(Table[3](https://arxiv.org/html/2606.05441#S3.T3)\), removing NSC lowers accuracy, and replacing GO\-LR with identity/random orders drops more, indicating NSC benefits from structure\-revealing orderings; transition\-aware segmentation and PCA\-based token embeddings outperform uniform segmentation or mean\-pooling, while swapping TabPFN\-2\.5 for logistic regression substantially degrades performance, suggesting both the ordering\+compression pipeline and a strong TabPFN\-style predictor are needed\. The Dolan\-Mor’e profile\(Dolan and Moré,[2002](https://arxiv.org/html/2606.05441#bib.bib94)\)\(Fig\.[7](https://arxiv.org/html/2606.05441#S3.F7), following TANDEM\(Naor and Lindenbaum,[2025](https://arxiv.org/html/2606.05441#bib.bib111)\)\) shows stronger cross\-dataset consistency: GOTabPFN stays closest to the per\-dataset best \(curve at 1\.0\), whereas others need larger toleranceθ\\theta\. Additional ablations appear in App\.[J](https://arxiv.org/html/2606.05441#A10)\. Limitations\.GOTabPFN inherits constraints from its frozen TabPFN\-2\.5 backbone, including its limit of up to 10 classes and sample\-size limit of 50K samples\. GO\-LR\+NSC adds ordering and compression before TabPFN inference; runtime can increase for larger sample sizes\. Thus, GOTabPFN is most suitable for HDLSS and related low\-sample, high\-dimensional regimes, rather than high\-sample regimes\.

Table 4:GOTabPFN gains over foundation\-model heads\.Accuracy on 8 HDLSS datasets under5×55\\times 5CV\. “Best orig” is the best among TabDPT, TabPFN\-Wide, BETA, TuneTables, and TabICL\.
## 5Conclusion

We presentGOTabPFN, which makes TabPFN\-style small tabular foundation models effective in HDLSS regimes\. GOTabPFN couples MinLA\-grounded feature ordering \(GO\-LR\) with a neuro\-inspired stable, locality\-preserving compression interface \(NSC\) that converts high\-dimensional tables into compact token sequences\. Without retraining or modifying the TabPFN\-2\.5 backbone, this ordering\-to\-tokenization pipeline improves accuracy and robustness under tight token budgets across diverse HDLSS benchmarks\. GOTabPFN provides a theory\-grounded, practical route to scalable in\-context tabular prediction whenm≫nm\\gg n\.

## Acknowledgements

This work was supported in part by the US National Science Foundation under Awards \#1920920, \#2125872, and \#2223793\. We thank the anonymous ICML reviewers for their valuable feedback and suggestions\.

## Impact Statement

GOTabPFN aims to make tabular foundation models more usable in HDLSS settings by introducing an ordering\-aware compression interface that reduces feature dimensionality while preserving local structure\. This can benefit scientific and biomedical domains where data are scarce but feature spaces are large\. However, as with any predictive model, deployment in sensitive domains should include careful validation, bias assessment, and domain\-expert oversight\.

## Software and Data

The project webpage is available at[https://www\.zadidhabib\.com/gotabpfn\.html](https://www.zadidhabib.com/gotabpfn.html)\. Code, notebooks, and installation instructions are available at[https://github\.com/zadid6pretam/GOTabPFN](https://github.com/zadid6pretam/GOTabPFN); the package can also be installed withpip install gotabpfn\. Our experiments use TabPFN\-2\.5 as the frozen backbone withtabpfn==6\.3\.1; newer defaulttabpfninstallations may install TabPFN\-3 or later, so reproducing our results requirespip install tabpfn==6\.3\.1and may require Prior Labs/Hugging Face checkpoint access\. Most HDLSS datasets are from the scikit\-feature dataset repository\(Liet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib93)\)\([https://jundongl\.github\.io/scikit\-feature/datasets\.html](https://jundongl.github.io/scikit-feature/datasets.html)\); DrivFace is from UCI\(Hernández\-Sabatet al\.,[2016](https://arxiv.org/html/2606.05441#bib.bib160)\); CIFAR\-10 embeddings are derived from the Kaggle CIFAR\-10 dataset\(Cukierski,[2013](https://arxiv.org/html/2606.05441#bib.bib161)\); and Cell Cycle is fromMahdessianet al\.\([2021](https://arxiv.org/html/2606.05441#bib.bib162)\)via GEO\(NCBI,[2021](https://arxiv.org/html/2606.05441#bib.bib163)\)\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)GPT\-4 Technical Report\.arXiv preprint arXiv:2303\.08774\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2303.08774)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- M\. A\. Ahamed and Q\. Cheng \(2024\)MambaTab: A Plug\-and\-Play Model for Learning Tabular Data\.In2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval \(MIPR\),pp\. 369–375\.External Links:[Document](https://dx.doi.org/10.1109/MIPR62202.2024.00065)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, and M\. Koyama \(2019\)Optuna: A Next\-Generation Hyperparameter Optimization Framework\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pp\. 2623–2631\.External Links:[Document](https://dx.doi.org/10.1145/3292500.3330701)Cited by:[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1),[§D\.3](https://arxiv.org/html/2606.05441#A4.SS3.p1.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- M\. Aoshima, D\. Shen, H\. Shen, K\. Yata, Y\. Zhou, and J\. S\. Marron \(2018\)A Survey of High Dimension Low Sample Size Asymptotics\.Australian & New Zealand Journal of Statistics60\(1\),pp\. 4–19\.External Links:[Document](https://dx.doi.org/10.1111/anzs.12212)Cited by:[Appendix F](https://arxiv.org/html/2606.05441#A6.SS0.SSS0.Px2.p1.10)\.
- P\. Arabie and L\. J\. Hubert \(1992\)Combinatorial Data Analysis\.Annual Review of Psychology43,pp\. 169–203\.External Links:[Document](https://dx.doi.org/10.1146/annurev.ps.43.020192.001125)Cited by:[§E\.3](https://arxiv.org/html/2606.05441#A5.SS3.p1.1),[Appendix E](https://arxiv.org/html/2606.05441#A5.p1.1)\.
- S\. Ö\. Arik and T\. Pfister \(2021\)TabNet: Attentive Interpretable Tabular Learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 6679–6687\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v35i8.16826)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. E\. Atkins, E\. G\. Boman, and B\. Hendrickson \(1998\)A Spectral Algorithm for Seriation and the Consecutive Ones Problem\.SIAM Journal on Computing28\(1\),pp\. 297–310\.External Links:[Document](https://dx.doi.org/10.1137/S0097539795285771)Cited by:[§E\.1](https://arxiv.org/html/2606.05441#A5.SS1.p1.1),[§E\.3](https://arxiv.org/html/2606.05441#A5.SS3.p1.1),[Appendix E](https://arxiv.org/html/2606.05441#A5.p1.1)\.
- M\. F\. Balın, A\. Abid, and J\. Zou \(2019\)Concrete Autoencoders: Differentiable Feature Selection and Reconstruction\.InInternational Conference on Machine Learning,pp\. 444–453\.Cited by:[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- K\. U\. Barthel, F\. T\. Barthel, and P\. Eisert \(2025\)Permutation Learning with Only N Parameters: From SoftSort to Self\-Organizing Gaussians\.In2025 33rd European Signal Processing Conference \(EUSIPCO\),pp\. 1892–1896\.External Links:[Document](https://dx.doi.org/10.23919/EUSIPCO63237.2025.11226796)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- M\. Behrisch, B\. Bach, N\. Henry Riche, T\. Schreck, and J\. Fekete \(2016\)Matrix Reordering Methods for Table and Network Visualization\.InComputer Graphics Forum,Vol\.35,pp\. 693–716\.External Links:[Document](https://dx.doi.org/10.1111/cgf.12935)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- I\. Beltagy, M\. E\. Peters, and A\. Cohan \(2020\)Longformer: The Long\-Document Transformer\.arXiv preprint arXiv:2004\.05150\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2004.05150)Cited by:[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px1.p1.2)\.
- D\. Beniaguev, I\. Segev, and M\. London \(2021\)Single Cortical Neurons as Deep Artificial Neural Networks\.Neuron109\(17\),pp\. 2727–2739\.External Links:[Document](https://dx.doi.org/10.1016/j.neuron.2021.07.002)Cited by:[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- X\. Bouthillier, P\. Delaunay, M\. Bronzi, A\. Trofimov, B\. Nichyporuk, J\. Szeto, N\. Mohammadi Sepahvand, E\. Raff, K\. Madan, V\. Voleti, S\. E\. Kahou, V\. Michalski, T\. Arbel, C\. Pal, G\. Varoquaux, and P\. Vincent \(2021\)Accounting for Variance in Machine Learning Benchmarks\.Proceedings of Machine Learning and Systems3,pp\. 747–769\.Cited by:[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px1.p1.2)\.
- S\. B\. Brahmavar, Y\. Li, and J\. Oliva \(2025\)Towards Universal Neural Inference\.arXiv preprint arXiv:2508\.09100\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2508.09100)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language Models Are Few\-Shot Learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- M\. Carmona, V\. Chepoi, G\. Naves, and P\. Préa \(2023\)A Simple and Optimal Algorithm for Strict Circular Seriation\.SIAM Journal on Mathematics of Data Science5\(1\),pp\. 201–221\.External Links:[Document](https://dx.doi.org/10.1137/22M1495342)Cited by:[§3\.1](https://arxiv.org/html/2606.05441#S3.SS1.p6.1)\.
- G\. Casella and R\. Berger \(2024\)Statistical Inference\.Chapman and Hall/CRC\.External Links:[Document](https://dx.doi.org/10.1201/9781003456285)Cited by:[Proposition D\.2](https://arxiv.org/html/2606.05441#A4.Thmtheorem2.p1.10.10)\.
- J\. Chen, L\. Song, M\. Wainwright, and M\. Jordan \(2018\)Learning to Explain: An Information\-Theoretic Perspective on Model Interpretation\.InInternational Conference on Machine Learning,pp\. 883–892\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. Chen, K\. Liao, Y\. Wan, D\. Z\. Chen, and J\. Wu \(2022\)DANets: Deep Abstract Networks for Tabular Data Classification and Regression\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 3930–3938\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v36i4.20309)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§F\.1](https://arxiv.org/html/2606.05441#A6.SS1.p1.4)\.
- K\. Chen, P\. Chiang, H\. Chou, T\. Chen, and T\. Chang \(2023\)Trompt: Towards A Better Deep Neural Network for Tabular Data\.InInternational Conference on Machine Learning,pp\. 4392–4434\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: A Scalable Tree Boosting System\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.External Links:[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1)\.
- R\. Child, S\. Gray, A\. Radford, and I\. Sutskever \(2019\)Generating Long Sequences with Sparse Transformers\.arXiv preprint arXiv:1904\.10509\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1904.10509)Cited by:[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.p1.1)\.
- N\. Christofides \(2022\)Worst\-Case Analysis of A New Heuristic for the Travelling Salesman Problem\.InOperations Research Forum,Vol\.3,pp\. 20\.External Links:[Document](https://dx.doi.org/10.1007/s43069-021-00101-z)Cited by:[§B\.1](https://arxiv.org/html/2606.05441#A2.SS1.SSS0.Px1.p1.11)\.
- T\. Cover and P\. Hart \(1967\)Nearest Neighbor Pattern Classification\.IEEE Transactions on Information Theory13\(1\),pp\. 21–27\.External Links:[Document](https://dx.doi.org/10.1109/TIT.1967.1053964)Cited by:[2nd item](https://arxiv.org/html/2606.05441#A4.I2.i2.p1.1)\.
- T\. M\. Cover and J\. A\. Thomas \(2006\)Elements of Information Theory\.2 edition,John Wiley & Sons,Hoboken, NJ, USA\.External Links:[Document](https://dx.doi.org/10.1002/047174882X)Cited by:[Proposition D\.6](https://arxiv.org/html/2606.05441#A4.Thmtheorem6.p1.11.11)\.
- D\. R\. Cox \(1958\)The Regression Analysis of Binary Sequences\.Journal of the Royal Statistical Society Series B: Statistical Methodology20\(2\),pp\. 215–232\.External Links:[Document](https://dx.doi.org/10.1111/j.2517-6161.1958.tb00292.x)Cited by:[1st item](https://arxiv.org/html/2606.05441#A4.I2.i1.p1.1)\.
- W\. Cukierski \(2013\)CIFAR\-10 \- Object Recognition in Images\.Note:[https://kaggle\.com/competitions/cifar\-10](https://kaggle.com/competitions/cifar-10)KaggleCited by:[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.10.7.2),[Software and Data](https://arxiv.org/html/2606.05441#Sx3.p1.1)\.
- F\. Dangond \(2000\)Chips around the World: Proceedings from the Nature Genetics Microarray Meeting\.Physiological Genomics2\(2\),pp\. 53–58\.External Links:[Document](https://dx.doi.org/10.1152/physiolgenomics.2000.2.2.53)Cited by:[§E\.4](https://arxiv.org/html/2606.05441#A5.SS4.p1.8)\.
- S\. Dasgupta and A\. Gupta \(2003\)An Elementary Proof of A Theorem of Johnson and Lindenstrauss\.Random Structures & Algorithms22\(1\),pp\. 60–65\.External Links:[Document](https://dx.doi.org/10.1002/rsa.10073)Cited by:[Lemma D\.4](https://arxiv.org/html/2606.05441#A4.Thmtheorem4.p1.4.4)\.
- D\. L\. Davies and D\. W\. Bouldin \(1979\)A Cluster Separation Measure\.IEEE Transactions on Pattern Analysis and Machine Intelligence\(2\),pp\. 224–227\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.1979.4766909)Cited by:[3rd item](https://arxiv.org/html/2606.05441#A4.I2.i3.p1.1)\.
- J\. Demšar \(2006\)Statistical Comparisons of Classifiers over Multiple Data Sets\.Journal of Machine Learning Research7,pp\. 1–30\.Cited by:[Appendix I](https://arxiv.org/html/2606.05441#A9.SS0.SSS0.Px1.p1.6),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- R\. Deng, Z\. Li, and M\. Wang \(2025\)GeoAggregator: An Efficient Transformer Model for Geo\-Spatial Tabular data\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 11572–11580\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i11.33259)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- E\. Devijver and M\. Gallopin \(2018\)Block\-Diagonal Covariance Selection for High\-Dimensional Gaussian Graphical Models\.Journal of the American Statistical Association113\(521\),pp\. 306–314\.External Links:[Document](https://dx.doi.org/10.1080/01621459.2016.1247002)Cited by:[§D\.4](https://arxiv.org/html/2606.05441#A4.SS4.p1.2)\.
- L\. Devroye, L\. Györfi, and G\. Lugosi \(1996\)A Probabilistic Theory of Pattern Recognition\.Stochastic Modelling and Applied Probability, Vol\.31,Springer New York, NY\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4612-0711-5)Cited by:[Proposition D\.2](https://arxiv.org/html/2606.05441#A4.Thmtheorem2.p1.10.10),[Lemma D\.4](https://arxiv.org/html/2606.05441#A4.Thmtheorem4.p1.4.4)\.
- J\. Díaz, J\. Petit, and M\. Serna \(2002\)A Survey of Graph Layout Problems\.ACM Computing Surveys \(CSUR\)34\(3\),pp\. 313–356\.External Links:[Document](https://dx.doi.org/10.1145/568522.568523)Cited by:[§E\.1](https://arxiv.org/html/2606.05441#A5.SS1.p1.1),[§E\.2](https://arxiv.org/html/2606.05441#A5.SS2.p1.5),[§E\.3](https://arxiv.org/html/2606.05441#A5.SS3.p1.1),[Appendix E](https://arxiv.org/html/2606.05441#A5.p1.1),[§3\.1](https://arxiv.org/html/2606.05441#S3.SS1.p1.2),[Remark 3\.6](https://arxiv.org/html/2606.05441#S3.Thmtheorem6.p1.1)\.
- T\. Dinh, Y\. Zeng, R\. Zhang, Z\. Lin, M\. Gira, S\. Rajput, J\. Sohn, D\. Papailiopoulos, and K\. Lee \(2022\)LIFT: Language\-Interfaced Fine\-Tuning for Non\-Language Machine Learning Tasks\.Advances in Neural Information Processing Systems35,pp\. 11763–11784\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- E\. D\. Dolan and J\. J\. Moré \(2002\)Benchmarking Optimization Software with Performance Profiles\.Mathematical Programming91\(2\),pp\. 201–213\.External Links:[Document](https://dx.doi.org/10.1007/s101070100263)Cited by:[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- M\. Dorigo and L\. M\. Gambardella \(2002\)Ant Colony System: A Cooperative Learning Approach to the Traveling Salesman Problem\.IEEE Transactions on Evolutionary Computation1\(1\),pp\. 53–66\.External Links:[Document](https://dx.doi.org/10.1109/4235.585892)Cited by:[§B\.1](https://arxiv.org/html/2606.05441#A2.SS1.SSS0.Px1.p1.11)\.
- R\. Eisenberg, J\. Svirsky, and O\. Lindenbaum \(2025\)COPER: Correlation\-based Permutations for Multi\-View Clustering\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- D\. Eremeev, G\. Bazhenov, O\. Platonov, A\. Babenko, and L\. Prokhorenkova \(2025\)Turning Tabular Foundation Models into Graph Foundation Models\.InNeurIPS 2025 New Perspectives in Graph Machine Learning Workshop,Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- B\. Feuer, R\. T\. Schirrmeister, V\. Cherepanova, C\. Hegde, F\. Hutter, M\. Goldblum, N\. Cohen, and C\. White \(2024\)TuneTables: Context Optimization for Scalable Prior\-Data Fitted Networks\.Advances in Neural Information Processing Systems37,pp\. 83430–83464\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px2.p1.11),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- F\. Fogel, R\. Jenatton, F\. Bach, and A\. d’Aspremont \(2013\)Convex Relaxations for Permutation Problems\.Advances in Neural Information Processing Systems26\.Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.05441#S3.SS1.p1.2),[Remark 3\.6](https://arxiv.org/html/2606.05441#S3.Thmtheorem6.p1.1)\.
- M\. Friedman \(1937\)The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance\.Journal of the American Statistical Association32\(200\),pp\. 675–701\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1937.10503522)Cited by:[item 4](https://arxiv.org/html/2606.05441#A4.I6.i4.p1.1),[Appendix I](https://arxiv.org/html/2606.05441#A9.SS0.SSS0.Px1.p1.6),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- M\. R\. Garey, D\. S\. Johnson, and L\. Stockmeyer \(1974\)Some Simplified NP\-Complete Problems\.InProceedings of the Sixth Annual ACM Symposium on Theory of Computing,pp\. 47–63\.External Links:[Document](https://dx.doi.org/10.1145/800119.803884)Cited by:[§E\.1](https://arxiv.org/html/2606.05441#A5.SS1.p1.1)\.
- M\. Garey, D\. Johnson, and L\. Stockmeyer \(1976\)Some Simplified NP\-Complete Graph Problems\.Theoretical Computer Science1\(3\),pp\. 237–267\.External Links:[Document](https://dx.doi.org/10.1016/0304-3975%2876%2990059-1)Cited by:[§3\.1](https://arxiv.org/html/2606.05441#S3.SS1.3.p1.1)\.
- Y\. Gorishniy, A\. Kotelnikov, and A\. Babenko \(2025\)TabM: Advancing Tabular Deep Learning with Parameter\-Efficient Ensembling\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- Y\. Gorishniy, I\. Rubachev, and A\. Babenko \(2022\)On Embeddings for Numerical Features in Tabular Deep Learning\.Advances in Neural Information Processing Systems35,pp\. 24991–25004\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- Y\. Gorishniy, I\. Rubachev, N\. Kartashev, D\. Shlenskii, A\. Kotelnikov, and A\. Babenko \(2024\)TabR: Tabular Deep Learning Meets Nearest Neighbors\.In\(The Twelfth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- Y\. Gorishniy, I\. Rubachev, V\. Khrulkov, and A\. Babenko \(2021\)Revisiting Deep Learning Models for Tabular Data\.Advances in Neural Information Processing Systems34,pp\. 18932–18943\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.p1.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- L\. Grinsztajn, K\. Flöge, O\. Key, F\. Birkel, P\. Jund, B\. Roof, B\. Jäger, D\. Safaric, S\. Alessi, A\. Hayler,et al\.\(2025\)TabPFN\-2\.5: Advancing the State of the Art in Tabular Foundation Models\.arXiv preprint arXiv:2511\.08667\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2511.08667)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix R](https://arxiv.org/html/2606.05441#A18.p1.1),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1),[§D\.5](https://arxiv.org/html/2606.05441#A4.SS5.SSS0.Px7.p1.1),[§D\.5](https://arxiv.org/html/2606.05441#A4.SS5.SSS0.Px8.p1.3),[§1](https://arxiv.org/html/2606.05441#S1.p1.5),[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p4.30)\.
- L\. Grinsztajn, E\. Oyallon, and G\. Varoquaux \(2022\)Why Do Tree\-Based Models Still Outperform Deep Learning on Typical Tabular Data?\.Advances in Neural Information Processing Systems35,pp\. 507–520\.Cited by:[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px1.p1.2)\.
- H\. Guo, R\. Tang, Y\. Ye, Z\. Li, and X\. He \(2017\)DeepFM: A Factorization\-Machine based Neural Network for CTR Prediction\.InProceedings of the Twenty\-Sixth International Joint Conference on Artificial Intelligence,External Links:[Document](https://dx.doi.org/10.24963/ijcai.2017/239)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- S\. Guo, C\. Deng, Y\. Wen, H\. Chen, Y\. Chang, and J\. Wang \(2024\)DS\-Agent: Automated Data Science by Empowering Large Language Models with Case\-Based Reasoning\.InInternational Conference on Machine Learning,pp\. 16813–16848\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- S\. Guo, H\. Liu, X\. Chen, Y\. Xie, L\. Zhang, T\. Han, H\. Chen, Y\. Chang, and J\. Wang \(2025\)Optimizing Case\-Based Reasoning System for Functional Test Script Generation with Large Language Models\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD ’25\), Volume 2,pp\. 4487–4498\.External Links:[Document](https://dx.doi.org/10.1145/3711896.3737254)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- A\. Z\. S\. B\. Habib, M\. Y\. Ahamed, P\. K\. Gyawali, G\. Doretto, and D\. A\. Adjeroh \(2026a\)BSTabDiff: Block\-Subunit Diffusion Priors for High\-Dimensional Tabular Data Generation\.InICLR 2026 2nd Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy,Cited by:[§D\.4\.1](https://arxiv.org/html/2606.05441#A4.SS4.SSS1.p1.1),[§D\.4](https://arxiv.org/html/2606.05441#A4.SS4.p1.2)\.
- A\. Z\. S\. B\. Habib, G\. Doretto, and D\. A\. Adjeroh \(2026b\)DynaTab: Dynamic Feature Ordering as Neural Rewiring for High\-Dimensional Tabular Data\.InProceedings of the First Workshop on NeuroAI Multimodal Intelligence @ AAAI 2026,Proceedings of Machine Learning Research, Vol\.308,pp\. 27–57\.External Links:[Link](https://proceedings.mlr.press/v308/habib26a.html)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px4.p1.1),[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1),[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.6.3.3),[§C\.1](https://arxiv.org/html/2606.05441#A3.SS1.SSS0.Px2.p1.3),[Appendix F](https://arxiv.org/html/2606.05441#A6.SS0.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2606.05441#A6.SS0.SSS0.Px2.p1.10),[§F\.1](https://arxiv.org/html/2606.05441#A6.SS1.SSS0.Px1.p1.3),[§F\.1](https://arxiv.org/html/2606.05441#A6.SS1.p1.4),[§F\.2](https://arxiv.org/html/2606.05441#A6.SS2.p1.9),[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- A\. Z\. S\. B\. Habib, T\. Tasnim, M\. E\. Islam, and M\. Tabasum \(2026c\)ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data\.arXiv preprint arXiv:2604\.27606\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2604.27606)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- A\. Z\. S\. B\. Habib, K\. Wang, M\. Hartley, G\. Doretto, and D\. A\. Adjeroh \(2024\)TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering\.InInternational Conference on Pattern Recognition,pp\. 418–434\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-78128-5%5F27)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px1.p1.2),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- A\. A\. Hagberg, D\. A\. Schult, and P\. J\. Swart \(2008\)Exploring Network Structure, Dynamics, and Function Using NetworkX\.InProceedings of the Python in Science Conference,pp\. 11–15\.External Links:[Document](https://dx.doi.org/10.25080/tcwv9851)Cited by:[Table B\.1](https://arxiv.org/html/2606.05441#A2.T1),[Table B\.1](https://arxiv.org/html/2606.05441#A2.T1.2.1.1)\.
- M\. Hahsler, K\. Hornik, and C\. Buchta \(2008\)Getting Things in Order: An Introduction to the R Package Seriation\.Journal of Statistical Software25,pp\. 1–34\.External Links:[Document](https://dx.doi.org/10.18637/jss.v025.i03)Cited by:[§E\.3](https://arxiv.org/html/2606.05441#A5.SS3.p1.1),[§E\.4](https://arxiv.org/html/2606.05441#A5.SS4.SSS0.Px1.p1.4),[Appendix E](https://arxiv.org/html/2606.05441#A5.p1.1)\.
- N\. Halko, P\. Martinsson, and J\. A\. Tropp \(2011\)Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions\.SIAM Review53\(2\),pp\. 217–288\.External Links:[Document](https://dx.doi.org/10.1137/090771806)Cited by:[§C\.1](https://arxiv.org/html/2606.05441#A3.SS1.SSS0.Px1.p1.3),[§C\.1](https://arxiv.org/html/2606.05441#A3.SS1.SSS0.Px4.p1.5),[§1](https://arxiv.org/html/2606.05441#S1.p4.6)\.
- P\. Hall, J\. S\. Marron, and A\. Neeman \(2005\)Geometric Representation of High Dimension, Low Sample Size Data\.Journal of the Royal Statistical Society Series B: Statistical Methodology67\(3\),pp\. 427–444\.External Links:[Document](https://dx.doi.org/10.1111/j.1467-9868.2005.00510.x)Cited by:[Appendix F](https://arxiv.org/html/2606.05441#A6.SS0.SSS0.Px2.p1.10)\.
- S\. Han, J\. Yoon, S\. O\. Arik, and T\. Pfister \(2024\)Large Language Models Can Automatically Engineer Features for Few\-Shot Tabular Learning\.InInternational Conference on Machine Learning,pp\. 17454–17479\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. A\. Hanley and B\. J\. McNeil \(1982\)The Meaning and Use of the Area under a Receiver Operating Characteristic \(ROC\) Curve\.\.Radiology143\(1\),pp\. 29–36\.External Links:[Document](https://dx.doi.org/10.1148/radiology.143.1.7063747)Cited by:[§F\.2](https://arxiv.org/html/2606.05441#A6.SS2.p1.9)\.
- S\. Hegselmann, A\. Buendia, H\. Lang, M\. Agrawal, X\. Jiang, and D\. Sontag \(2023\)TabLLM: Few\-Shot Classification of Tabular Data with Large Language Models\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 5549–5581\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- A\. Hernández\-Sabat, A\. M\. López, and K\. Diaz\-Chito \(2016\)DrivFace\.Note:UCI Machine Learning Repositorydoi: 10\.24432/C5XC7QCited by:[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.8.5.2),[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.9.6.2),[Software and Data](https://arxiv.org/html/2606.05441#Sx3.p1.1)\.
- G\. E\. Hinton and R\. R\. Salakhutdinov \(2006\)Reducing the Dimensionality of Data with Neural Networks\.Science313\(5786\),pp\. 504–507\.External Links:[Document](https://dx.doi.org/10.1126/science.1127647)Cited by:[§D\.2](https://arxiv.org/html/2606.05441#A4.SS2.SSS0.Px2.p1.4),[Appendix D](https://arxiv.org/html/2606.05441#A4.p1.1),[§F\.2](https://arxiv.org/html/2606.05441#A6.SS2.p1.9)\.
- N\. Hollmann, S\. Müller, K\. Eggensperger, and F\. Hutter \(2023\)TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second\.InThe Eleventh International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p4.30)\.
- N\. Hollmann, S\. Müller, L\. Purucker, A\. Krishnakumar, M\. Körfer, S\. B\. Hoo, R\. T\. Schirrmeister, and F\. Hutter \(2025\)Accurate Predictions on Small Data with a Tabular Foundation Model\.Nature637\(8045\),pp\. 319–326\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-08328-6)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p4.30)\.
- D\. Holzmüller, L\. Grinsztajn, and I\. Steinwart \(2024\)Better by Default: Strong Pre\-Tuned MLPs and Boosted Trees on Tabular Data\.Advances in Neural Information Processing Systems37,pp\. 26577–26658\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1)\.
- H\. Hotelling \(1933\)Analysis of A Complex of Statistical Variables into Principal Components\.Journal of Educational Psychology24\(6\),pp\. 417\.External Links:[Document](https://dx.doi.org/10.1037/h0071325)Cited by:[§D\.1](https://arxiv.org/html/2606.05441#A4.SS1.SSS0.Px1.p1.5),[§D\.2](https://arxiv.org/html/2606.05441#A4.SS2.SSS0.Px1.p1.4),[§D\.5](https://arxiv.org/html/2606.05441#A4.SS5.SSS0.Px7.p1.1),[Appendix D](https://arxiv.org/html/2606.05441#A4.p1.1),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p4.7)\.
- X\. Huang, A\. Khetan, M\. Cvitkovic, and Z\. Karnin \(2020\)TabTransformer: Tabular Data Modeling Using Contextual Embeddings\.arXiv preprint arXiv:2012\.06678\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2012.06678)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.p1.1)\.
- S\. K\. Jain and D\. A\. Adjeroh \(2007\)Edge\-based Prediction for Lossless Compression of Hyperspectral Images\.In2007 Data Compression Conference \(DCC’07\),pp\. 153–162\.External Links:[Document](https://dx.doi.org/10.1109/DCC.2007.36)Cited by:[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- A\. Jeffares, T\. Liu, J\. Crabbé, F\. Imrie, and M\. van der Schaar \(2023\)TANGOS: Regularizing Tabular Neural Networks through Gradient Orthogonalization and Specialization\.InThe Eleventh International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- N\. Jethani, M\. Sudarshan, Y\. Aphinyanaphongs, and R\. Ranganath \(2021\)Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in Their Interpretations\.\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1459–1467\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- X\. Jiang, A\. Margeloiu, N\. Simidjievski, and M\. Jamnik \(2024\)ProtoGate: Prototype\-based Neural Networks with Global\-to\-local Feature Selection for Tabular Biomedical Data\.InInternational Conference on Machine Learning,pp\. 21844–21878\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px1.p1.2),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px4.p1.1),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px7.p1.1),[§1](https://arxiv.org/html/2606.05441#S1.p2.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- Q\. Jingang, D\. Holzmüller, G\. Varoquaux, and M\. Le Morvan \(2025\)TabICL: A Tabular Foundation Model for In\-Context Learning on Large Data\.InInternational Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix R](https://arxiv.org/html/2606.05441#A18.p1.1),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px1.p1.2),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px2.p1.1),[Appendix G](https://arxiv.org/html/2606.05441#A7.p5.1),[§1](https://arxiv.org/html/2606.05441#S1.p3.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- W\. B\. Johnson and J\. Lindenstrauss \(1984\)Extensions of Lipschitz Mappings into A Hilbert Space\.Contemporary Mathematics26\(189\-206\),pp\. 1\.Cited by:[§D\.2](https://arxiv.org/html/2606.05441#A4.SS2.SSS0.Px1.p1.4),[Lemma D\.4](https://arxiv.org/html/2606.05441#A4.Thmtheorem4.p1.4.4),[Appendix D](https://arxiv.org/html/2606.05441#A4.p1.1)\.
- K\. Jordan \(2024\)On the Variance of Neural Network Training with respect to Test Sets and Distributions\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px1.p1.2)\.
- S\. Jung and J\. Marron \(2009\)PCA Consistency in High Dimension, Low Sample Size Context\.The Annals of Statistics37\(6B\),pp\. 4104–4130\.External Links:[Document](https://dx.doi.org/10.1214/09-AOS709)Cited by:[Proposition D\.6](https://arxiv.org/html/2606.05441#A4.Thmtheorem6.p1.11.11),[Appendix F](https://arxiv.org/html/2606.05441#A6.SS0.SSS0.Px2.p1.10)\.
- M\. Jurewicz and L\. Derczynski \(2022\)Set Interdependence Transformer: Set\-to\-Sequence Neural Networks for Permutation Learning and Structure Prediction\.In Proceedings on the International Joint Conferences on Artificial Intelligence \(IJCAI\-22\)\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2022/434)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- G\. Kastellakis, D\. J\. Cai, S\. C\. Mednick, A\. J\. Silva, and P\. Poirazi \(2015\)Synaptic Clustering within Dendrites: An Emerging Theory of Memory Formation\.Progress in Neurobiology126,pp\. 19–35\.External Links:[Document](https://dx.doi.org/10.1016/j.pneurobio.2014.12.002)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu \(2017\)LightGBM: A Highly Efficient Gradient Boosting Decision Tree\.Advances in Neural Information Processing Systems30\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1)\.
- J\. H\. Kirchner and J\. Gjorgjieva \(2021\)Emergence of Local and Global Synaptic Organization on Cortical Dendrites\.Nature Communications12\(1\),pp\. 4005\.External Links:[Document](https://dx.doi.org/10.1038/s41467-021-23557-3)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- S\. Kirkpatrick, C\. D\. Gelatt Jr, and M\. P\. Vecchi \(1983\)Optimization by Simulated Annealing\.Science220\(4598\),pp\. 671–680\.External Links:[Document](https://dx.doi.org/10.1126/science.220.4598.671)Cited by:[§B\.1](https://arxiv.org/html/2606.05441#A2.SS1.SSS0.Px1.p1.11)\.
- C\. Kolberg, K\. Eggensperger, and N\. Pfeifer \(2025\)TabPFN\-Wide: Continued Pre\-Training for Extreme Feature Counts\.InEurIPS 2025 Workshop: AI for Tabular Data,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px2.p1.11),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- P\. Kontschieder, M\. Fiterau, A\. Criminisi, and S\. R\. Bulo \(2015\)Deep Neural Decision Forests\.InProceedings of the IEEE International Conference on Computer Vision,pp\. 1467–1475\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- P\. Larranaga, C\. M\. H\. Kuijpers, R\. H\. Murga, I\. Inza, and S\. Dizdarevic \(1999\)Genetic Algorithms for the Travelling Salesman Problem: A Review of Representations and Operators\.Artificial Intelligence Review13\(2\),pp\. 129–170\.External Links:[Document](https://dx.doi.org/10.1023/A%3A1006529012972)Cited by:[§B\.1](https://arxiv.org/html/2606.05441#A2.SS1.SSS0.Px1.p1.11)\.
- E\. Levina and P\. Bickel \(2004\)Maximum Likelihood Estimation of Intrinsic Dimension\.Advances in Neural Information Processing Systems17\.Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6)\.
- J\. Li, K\. Cheng, S\. Wang, F\. Morstatter, R\. P\. Trevino, J\. Tang, and H\. Liu \(2018\)Feature Selection: A Data Perspective\.ACM Computing Surveys \(CSUR\)50\(6\),pp\. 94\.External Links:[Document](https://dx.doi.org/10.1145/3136625)Cited by:[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.4.1.2),[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.5.2.2),[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.6.3.2),[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.7.4.2),[§4](https://arxiv.org/html/2606.05441#S4.p1.26),[Software and Data](https://arxiv.org/html/2606.05441#Sx3.p1.1)\.
- S\. Li, E\. J\. Harner, and D\. A\. Adjeroh \(2011\)Random KNN Feature Selection\-A Fast and Stable Alternative to Random Forests\.BMC Bioinformatics12\(1\),pp\. 450\.External Links:[Document](https://dx.doi.org/10.1186/1471-2105-12-450)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- I\. Liiv \(2010\)Seriation and Matrix Reordering Methods: An Historical Overview\.Statistical Analysis and Data Mining: The ASA Data Science Journal3\(2\),pp\. 70–91\.External Links:[Document](https://dx.doi.org/10.1002/sam.10071)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- J\. R\. Lima, V\. G\. M\. Santos, and M\. A\. M\. Carvalho \(2024\)AΔ\\Delta\-Evaluation Function for Column Permutation Problems\.arXiv preprint arXiv:2409\.04926\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2409.04926)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- S\. Liu and H\. Ye \(2025\)TabPFN Unleashed: A Scalable and Effective Solution to Tabular Classification Problems\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px2.p1.11),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- J\. Ma, V\. Thomas, R\. Hosseinzadeh, A\. Labach, J\. C\. Cresswell, K\. Golestan, G\. Yu, A\. L\. Caterini, and M\. Volkovs \(2025\)TabDPT: Scaling Tabular Foundation Models on Real Data\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1),[Appendix G](https://arxiv.org/html/2606.05441#A7.p5.1),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- L\. v\. d\. Maaten and G\. Hinton \(2008\)Visualizing Data Using t\-SNE\.Journal of Machine Learning Research9\(Nov\),pp\. 2579–2605\.Cited by:[Appendix K](https://arxiv.org/html/2606.05441#A11.p1.1)\.
- L\. H\. Maguire, S\. K\. Handelman, X\. Du, Y\. Chen, T\. H\. Pers, and E\. K\. Speliotes \(2018\)Genome\-wide Association Analyses Identify 39 New Susceptibility Loci for Diverticular Disease\.Nature Genetics50\(10\),pp\. 1359–1365\.External Links:[Document](https://dx.doi.org/10.1038/s41588-018-0203-z)Cited by:[§E\.4](https://arxiv.org/html/2606.05441#A5.SS4.p1.8)\.
- D\. Mahdessian, A\. J\. Cesnik, C\. Gnann, F\. Danielsson, L\. Stenström, M\. Arif, C\. Zhang, T\. Le, F\. Johansson, R\. Schutten,et al\.\(2021\)Spatiotemporal Dissection of the Cell Cycle with Single\-Cell Proteogenomics\.Nature590\(7847\),pp\. 649–654\.External Links:[Document](https://dx.doi.org/10.1038/s41586-021-03232-9)Cited by:[§B\.1](https://arxiv.org/html/2606.05441#A2.SS1.SSS0.Px1.p1.11),[Table T\.1](https://arxiv.org/html/2606.05441#A20.T1.9.3.11.8.2),[Software and Data](https://arxiv.org/html/2606.05441#Sx3.p1.1)\.
- G\. Major, M\. E\. Larkum, and J\. Schiller \(2013\)Active Properties of Neocortical Pyramidal Neuron Dendrites\.Annual Review of Neuroscience36\(1\),pp\. 1–24\.External Links:[Document](https://dx.doi.org/10.1146/annurev-neuro-062111-150343)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- H\. Manikandan, Y\. Jiang, and J\. Z\. Kolter \(2023\)Language Models Are Weak Learners\.Advances in Neural Information Processing Systems36,pp\. 50907–50931\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- D\. McElfresh, S\. Khandagale, J\. Valverde, V\. Prasad C, G\. Ramakrishnan, M\. Goldblum, and C\. White \(2023\)When Do Neural Nets Outperform Boosted Trees on Tabular Data?\.Advances in Neural Information Processing Systems36,pp\. 76336–76369\.Cited by:[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px4.p1.1)\.
- L\. McInnes, J\. Healy, and J\. Melville \(2018\)UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction\.arXiv preprint arXiv:1802\.03426\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1802.03426)Cited by:[§D\.2](https://arxiv.org/html/2606.05441#A4.SS2.SSS0.Px3.p1.1),[Appendix D](https://arxiv.org/html/2606.05441#A4.p1.1)\.
- J\. Nam, K\. Kim, S\. Oh, J\. Tack, J\. Kim, and J\. Shin \(2024a\)Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning\.Advances in Neural Information Processing Systems37,pp\. 92352–92380\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. Nam, W\. Song, S\. H\. Park, J\. Tack, S\. Yun, J\. Kim, K\. H\. Oh, and J\. Shin \(2024b\)Tabular Transfer Learning via Prompting LLMs\.InFirst Conference on Language Modeling,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- E\. Naor and O\. Lindenbaum \(2025\)Hybrid Autoencoders for Tabular Data: Leveraging Model\-Based Augmentation in Low\-Label Settings\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- NCBI \(2021\)GEO Accession Viewer: GSE146773 —— National Center for Biotechnology Information\.Note:[https://www\.ncbi\.nlm\.nih\.gov/geo/query/acc\.cgi?acc=GSE146773](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE146773)Accessed: 2026\-04\-22Cited by:[§B\.1](https://arxiv.org/html/2606.05441#A2.SS1.SSS0.Px1.p1.11),[Software and Data](https://arxiv.org/html/2606.05441#Sx3.p1.1)\.
- P\. B\. Nemenyi \(1963\)Distribution\-free multiple comparisons\.Ph\.D\. Thesis,Princeton University,Princeton, NJ, USA\.Cited by:[Appendix I](https://arxiv.org/html/2606.05441#A9.SS0.SSS0.Px1.p1.6),[§4](https://arxiv.org/html/2606.05441#S4.p1.26)\.
- J\. Neyman and E\. S\. Pearson \(1933\)IX\. On the Problem of the Most Efficient Tests of Statistical Hypotheses\.Philosophical Transactions of the Royal Society of London\. Series A, Containing Papers of a Mathematical or Physical Character231\(694\-706\),pp\. 289–337\.External Links:[Document](https://dx.doi.org/10.1098/rsta.1933.0009)Cited by:[Proposition D\.2](https://arxiv.org/html/2606.05441#A4.Thmtheorem2.p1.10.10)\.
- M\. Ohlsson, T\. Hellmark, A\. A\. Bengtsson, E\. Theander, C\. Turesson, C\. Klint, C\. Wingren, and A\. I\. Ekstrand \(2020\)Proteomic Data Analysis for Differential Profiling of the Autoimmune Diseases SLE, RA, SS, and ANCA\-Associated Vasculitis\.Journal of Proteome Research20\(2\),pp\. 1252–1260\.External Links:[Document](https://dx.doi.org/10.1021/acs.jproteome.0c00657)Cited by:[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px1.p1.2)\.
- OpenTabular Contributors \(2025\)deeptab: Tabular Deep Learning Made Simple\.Note:[https://github\.com/OpenTabular/DeepTab](https://github.com/OpenTabular/DeepTab)\[Online; accessed 2025\-07\-05\]Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- R\. C\. Petersen, P\. S\. Aisen, L\. A\. Beckett, M\. C\. Donohue, A\. C\. Gamst, D\. J\. Harvey, C\.R\. Jack Jr, W\. J\. Jagust, L\. M\. Shaw, A\. W\. Toga, J\. Q\. Trojanowski, and M\. W\. Weiner \(2010\)Alzheimer’s Disease Neuroimaging Initiative \(ADNI\): Clinical Characterization\.Neurology74\(3\),pp\. 201–209\.External Links:[Document](https://dx.doi.org/10.1212/WNL.0b013e3181cb3e25)Cited by:[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px1.p1.2)\.
- P\. Poirazi, T\. Brannon, and B\. W\. Mel \(2003\)Pyramidal Neuron as Two\-Layer Neural Network\.Neuron37\(6\),pp\. 989–999\.External Links:[Document](https://dx.doi.org/10.1016/S0896-6273%2803%2900149-1)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- S\. Popov, S\. Morozov, and A\. Babenko \(2020\)Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- L\. Prokhorenkova, G\. Gusev, A\. Vorobev, A\. V\. Dorogush, and A\. Gulin \(2018\)CatBoost: Unbiased Boosting with Categorical Features\.Advances in Neural Information Processing Systems31\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px6.p1.1)\.
- D\. J\. Rosenkrantz, R\. E\. Stearns, and P\. M\. Lewis \(1977\)An Analysis of Several Heuristics for the Traveling Salesman Problem\.SIAM Journal on Computing6\(3\),pp\. 563–581\.External Links:[Document](https://dx.doi.org/10.1137/0206041)Cited by:[Lemma B\.2](https://arxiv.org/html/2606.05441#A2.Thmtheorem2.p1.2.2)\.
- P\. J\. Rousseeuw \(1987\)Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis\.Journal of Computational and Applied Mathematics20,pp\. 53–65\.External Links:[Document](https://dx.doi.org/10.1016/0377-0427%2887%2990125-7)Cited by:[3rd item](https://arxiv.org/html/2606.05441#A4.I2.i3.p1.1)\.
- O\. Roy and M\. Vetterli \(2007\)The Effective Rank: A Measure of Effective Dimensionality\.In2007 15th European Signal Processing Conference,pp\. 606–610\.Cited by:[§C\.1](https://arxiv.org/html/2606.05441#A3.SS1.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2606.05441#S1.p4.6)\.
- I\. Rubachev, N\. Kartashev, Y\. Gorishniy, and A\. Babenko \(2025\)TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 35166–35202\.Cited by:[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px1.p1.2)\.
- V\. Sanh, A\. Webson, C\. Raffel, S\. Bach, L\. Sutawika, Z\. Alyafeai, A\. Chaffin, A\. Stiegler, A\. Raja, M\. Dey,et al\.\(2022\)Multitask Prompted Training Enables Zero\-Shot Task Generalization\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. Schiller, G\. Major, H\. J\. Koester, and Y\. Schiller \(2000\)NMDA Spikes in Basal Dendrites of Cortical Pyramidal Neurons\.Nature404\(6775\),pp\. 285–289\.External Links:[Document](https://dx.doi.org/10.1038/35005094)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- M\. Seminaroti \(2016\)Combinatorial Algorithms for the Seriation Problem\.Ph\.D\. Thesis,Tilburg University\.Cited by:[§3\.1](https://arxiv.org/html/2606.05441#S3.SS1.p1.2)\.
- J\. Shi and J\. Malik \(2000\)Normalized Cuts and Image Segmentation\.IEEE Transactions on Pattern Analysis and Machine Intelligence22\(8\),pp\. 888–905\.External Links:[Document](https://dx.doi.org/10.1109/34.868688)Cited by:[§E\.2](https://arxiv.org/html/2606.05441#A5.SS2.p1.5),[§E\.4](https://arxiv.org/html/2606.05441#A5.SS4.SSS0.Px3.p1.6)\.
- R\. Shi, H\. Gu, H\. Ye, Y\. Dai, X\. Shen, and X\. Wang \(2025\)Latte: Transfering LLMs’ Latent\-level Knowledge for Few\-shot Tabular Learning\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence \(IJCAI\-25\),pp\. 6173–6181\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2025/687)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- Y\. Shiloach \(1979\)A Minimum Linear Arrangement Algorithm for Undirected Trees\.SIAM Journal on Computing8\(1\),pp\. 15–32\.External Links:[Document](https://dx.doi.org/10.1137/0208002)Cited by:[§3\.1](https://arxiv.org/html/2606.05441#S3.SS1.p3.1)\.
- S\. Si, C\. Hsieh, and I\. S\. Dhillon \(2017\)Memory Efficient Kernel Approximation\.Journal of Machine Learning Research18\(20\),pp\. 1–32\.Cited by:[§E\.4](https://arxiv.org/html/2606.05441#A5.SS4.p1.8)\.
- N\. Simon and R\. Tibshirani \(2012\)A Permutation Approach to Testing Interactions in Many Dimensions\.arXiv preprint arXiv:1206\.6519\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1206.6519)Cited by:[§D\.4](https://arxiv.org/html/2606.05441#A4.SS4.p1.2)\.
- G\. Somepalli, A\. Schwarzschild, M\. Goldblum, C\. B\. Bruss, and T\. Goldstein \(2022\)SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre\-Training\.InNeurIPS 2022 First Table Representation Workshop,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.p1.1)\.
- W\. Song, C\. Shi, Z\. Xiao, Z\. Duan, Y\. Xu, M\. Zhang, and J\. Tang \(2019\)AutoInt: Automatic Feature Interaction Learning via Self\-Attentive Neural Networks\.InProceedings of the 28th ACM International Conference on Information and Knowledge Management,pp\. 1161–1170\.External Links:[Document](https://dx.doi.org/10.1145/3357384.3357925)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- W\. E\. Strawderman \(2014\)Sufficient Statistic: Theoretical Background\.Wiley StatsRef: Statistics Reference Online\.External Links:[Document](https://dx.doi.org/10.1002/9781118445112.stat05976)Cited by:[Proposition D\.6](https://arxiv.org/html/2606.05441#A4.Thmtheorem6.p1.11.11)\.
- M\. Tegze and M\. Vlach \(1986\)On the Matrix Permutation Problem\.Zeitschrift für Operations Research30,pp\. A155–A159\.External Links:[Document](https://dx.doi.org/10.1007/BF01919177)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- A\. F\. Thielmann and S\. Samiee \(2024\)On the Efficiency of NLP\-Inspired Methods for Tabular Deep Learning\.InNeurIPS Efficient Natural Language and Speech Processing Workshop,pp\. 532–539\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- A\. F\. Thielmann, M\. Kumar, C\. Weisser, A\. Reuter, B\. Säfken, and S\. Samiee \(2024\)Mambular: A Sequential Model for Tabular Deep Learning\.arXiv preprint arXiv:2408\.06291\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2408.06291)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- V\. Thomas, J\. Ma, R\. Hosseinzadeh, K\. Golestan, G\. Yu, M\. Volkovs, and A\. Caterini \(2024\)Retrieval & fine\-tuning for in\-context tabular models\.Advances in Neural Information Processing Systems37,pp\. 108439–108467\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- B\. B\. Ujfalussy and J\. K\. Makara \(2020\)Impact of Functional Synapse Clusters on Neuronal Response Selectivity\.Nature Communications11\(1\),pp\. 1413\.External Links:[Document](https://dx.doi.org/10.1038/s41467-020-15147-6)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention Is All You Need\.Advances in Neural Information Processing Systems30\.Cited by:[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.p1.1)\.
- P\. Veličković, L\. Buesing, M\. Overlan, R\. Pascanu, O\. Vinyals, and C\. Blundell \(2020\)Pointer Graph Networks\.Advances in Neural Information Processing Systems33,pp\. 2232–2244\.Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- J\. Venna and S\. Kaski \(2001\)Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study\.InInternational Conference on Artificial Neural Networks,pp\. 485–491\.External Links:[Document](https://dx.doi.org/10.1007/3-540-44668-0%5F68)Cited by:[§E\.4](https://arxiv.org/html/2606.05441#A5.SS4.SSS0.Px2.p1.13)\.
- G\. Ver Steeg, H\. Harutyunyan, D\. Moyer, and A\. Galstyan \(2019\)Fast Structure Learning with Modular Regularization\.Advances in Neural Information Processing Systems32\.Cited by:[§D\.4](https://arxiv.org/html/2606.05441#A4.SS4.p1.2)\.
- O\. Vinyals, M\. Fortunato, and N\. Jaitly \(2015\)Pointer Networks\.Advances in Neural Information Processing Systems28\.Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- G\. Wang, Y\. Chen, H\. Chen, X\. Fan, J\. Wang, X\. Li, M\. Hu, C\. Chang, and X\. Hu \(2025\)Advancing Table Understanding of Large Language Models via Feature Re\-ordering\.ACM SIGKDD Explorations Newsletter27\(1\),pp\. 112–123\.External Links:[Document](https://dx.doi.org/10.1145/3748239.3748248)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px1.p1.2),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- R\. Wang, B\. Fu, G\. Fu, and M\. Wang \(2017\)Deep & Cross Network for Ad Click Predictions\.InProceedings of the ADKDD’17,pp\. 1–7\.External Links:[Document](https://dx.doi.org/10.1145/3124749.3124754)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- T\. Wang, S\. Guan, J\. Ma, and F\. Liu \(2015a\)Linear Feature Sensibility for Output Partitioning in Ordered Neural Incremental Attribute Learning\.InInternational Conference on Intelligent Science and Big Data Engineering,pp\. 373–383\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-23862-3%5F37)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- T\. Wang, S\. Guan, K\. L\. Man, J\. H\. Park, and H\. Hsu \(2015b\)Output Effect Evaluation Based on Input Features in Neural Incremental Attribute Learning for Better Classification Performance\.Symmetry7\(1\),pp\. 53–66\.External Links:[Document](https://dx.doi.org/10.3390/sym7010053)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- T\. Wang, S\. Guan, K\. L\. Man, and T\. Ting \(2014\)EEG Eye State Identification Using Incremental Attribute Learning with Time\-Series Classification\.Mathematical Problems in Engineering2014\(1\),pp\. 365101\.External Links:[Document](https://dx.doi.org/10.1155/2014/365101)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- T\. Wang and S\. Guan \(2013\)Feature Ordering for Neural Incremental Attribute Learning Based on Fisher’s Linear Discriminant\.In2013 5th International Conference on Intelligent Human\-Machine Systems and Cybernetics,Vol\.2,pp\. 507–510\.External Links:[Document](https://dx.doi.org/10.1109/IHMSC.2013.268)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- T\. Wang, X\. Zhu, S\. Guan, K\. L\. Man, and T\. Ting \(2015c\)Regression Based on Neural Incremental Attribute Learning with Correlation\-based Feature Ordering\.In2015 IEEE 7th International Conference on Cybernetics and Intelligent Systems \(CIS\) and IEEE Conference on Robotics, Automation and Mechatronics \(RAM\),pp\. 109–113\.External Links:[Document](https://dx.doi.org/10.1109/ICCIS.2015.7274557)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- Y\. Wang, H\. Huang, C\. Rudin, and Y\. Shaposhnik \(2021\)Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t\-SNE, UMAP, TriMAP, and PaCMAP for Data Visualization\.Journal of Machine Learning Research22\(201\),pp\. 1–73\.Cited by:[§D\.2](https://arxiv.org/html/2606.05441#A4.SS2.SSS0.Px4.p1.1),[Appendix D](https://arxiv.org/html/2606.05441#A4.p1.1)\.
- X\. Wen, H\. Zhang, S\. Zheng, W\. Xu, and J\. Bian \(2024\)From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 3323–3333\.External Links:[Document](https://dx.doi.org/10.1145/3637528.3671975)Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- X\. Wu, X\. Liu, W\. Li, and Q\. Wu \(2018\)Improved Expressivity through Dendritic Neural Networks\.Advances in Neural Information Processing Systems31\.Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p4.6),[§3\.2](https://arxiv.org/html/2606.05441#S3.SS2.p1.8)\.
- Y\. Yamada, O\. Lindenbaum, S\. Negahban, and Y\. Kluger \(2020\)Feature Selection Using Stochastic Gates\.InInternational Conference on Machine Learning,pp\. 10648–10659\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. Yan, J\. Chen, C\. Hu, B\. Zheng, Y\. Hu, J\. Sun, and J\. Wu \(2025\)Small Models are LLM Knowledge Triggers for Medical Tabular Prediction\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. Yan, B\. Zheng, H\. Xu, Y\. Zhu, D\. Chen, J\. Sun, J\. Wu, and J\. Chen \(2024\)Making Pre\-Trained Language Models Great on Tabular Prediction\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. Yang, O\. Lindenbaum, and Y\. Kluger \(2022a\)Locally Sparse Neural Networks for Tabular Biomedical Data\.InInternational Conference on Machine Learning,pp\. 25123–25153\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px4.p1.1),[Appendix T](https://arxiv.org/html/2606.05441#A20.SS0.SSS0.Px7.p1.1)\.
- T\. Yang, Y\. Wang, Z\. Yue, Y\. Yang, Y\. Tong, and J\. Bai \(2022b\)Graph Pointer Neural Networks\.InProceedings of the AAAI conference on Artificial Intelligence,Vol\.36,pp\. 8832–8839\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v36i8.20864)Cited by:[§1](https://arxiv.org/html/2606.05441#S1.p2.1)\.
- K\. Yata and M\. Aoshima \(2009\)PCA Consistency for Non\-Gaussian Data in High Dimension, Low Sample Size Context\.Communications in Statistics\-Theory and Methods38\(16\-17\),pp\. 2634–2652\.External Links:[Document](https://dx.doi.org/10.1080/03610910902936083)Cited by:[Proposition D\.6](https://arxiv.org/html/2606.05441#A4.Thmtheorem6.p1.11.11)\.
- H\. Ye, S\. Liu, and W\. H\. Chao \(2025a\)A Closer Look at TabPFN v2: Understanding Its Strengths and Extending Its Capabilities\.Advances in Neural Information Processing Systems38,pp\. 135605–135637\.Cited by:[Appendix S](https://arxiv.org/html/2606.05441#A19.SS0.SSS0.Px1.p1.2)\.
- H\. Ye, H\. Yin, D\. Zhan, and W\. Chao \(2025b\)Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- H\. Ye, J\. Li, H\. Zhao, D\. Guo, and Y\. Chang \(2025c\)LLM Meeting Decision Trees on Tabular Data\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- J\. Yoon, J\. Jordon, and M\. Van der Schaar \(2019\)INVASE: Instance\-Wise Variable Selection Using Neural Networks\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.
- M\. Zaheer, S\. Kottur, S\. Ravanbakhsh, B\. Poczos, R\. R\. Salakhutdinov, and A\. J\. Smola \(2017\)Deep Sets\.Advances in Neural Information Processing Systems30\.Cited by:[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px2.p1.1),[§E\.6](https://arxiv.org/html/2606.05441#A5.SS6.SSS0.Px3.p1.1),[Appendix F](https://arxiv.org/html/2606.05441#A6.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05441#S1.p3.1)\.
- T\. Zhang, Z\. A\. Zhang, Z\. Fan, H\. Luo, F\. Liu, Q\. Liu, W\. Cao, and L\. Jian \(2023\)OpenFE: Automated Feature Generation with Expert\-Level Performance\.InInternational Conference on Machine Learning,pp\. 41880–41901\.Cited by:[Appendix A](https://arxiv.org/html/2606.05441#A1.p1.3)\.

This supplementary document supports our main paperGOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High\-Dimensional Data\(Submitted to the Fourty\-Third International Conference on Machine Learning \(ICML\) 2026\)\. Specifically, it includes:

- •Detailed Related Work in Sec\.[A](https://arxiv.org/html/2606.05441#A1)
- •Theoretical Characterization of GO\-LR: Complexity and TSP Connections in Sec\.[B](https://arxiv.org/html/2606.05441#A2)
- •NSC Variants and NSC as a Shared Piecewise Pooling Operator in Sec\.[C](https://arxiv.org/html/2606.05441#A3)
- •NSC as a Structured Dimensionality Reduction Layer in Sec\.[D](https://arxiv.org/html/2606.05441#A4)
- •Why Feature Ordering? Local Neighborhoods Enable Structure\-Aware Compression in Sec\.[E](https://arxiv.org/html/2606.05441#A5)
- •Feature Ordering \- When to Use? Through the Lens of Locality in Sec\.[F](https://arxiv.org/html/2606.05441#A6)
- •Detailed Comparative Results in Sec\.[G](https://arxiv.org/html/2606.05441#A7)
- •GOTabPFN Hyperparameters in Sec\.[H](https://arxiv.org/html/2606.05441#A8)
- •Statistical Significance Analysis in Sec\.[I](https://arxiv.org/html/2606.05441#A9)
- •Additional Ablation Analysis in Sec\.[J](https://arxiv.org/html/2606.05441#A10)
- •Representation Quality via t\-SNE in Sec\.[K](https://arxiv.org/html/2606.05441#A11)
- •Inference Level Ablation on Calibration and Robustness in Sec\.[L](https://arxiv.org/html/2606.05441#A12)
- •Sanity and Stress Diagnostics in Sec\.[M](https://arxiv.org/html/2606.05441#A13)
- •Additional Reliability and Interpretability Diagnostics in Sec\.[N](https://arxiv.org/html/2606.05441#A14)
- •Theory\-Inspired Representation Diagnostics in Sec\.[O](https://arxiv.org/html/2606.05441#A15)
- •OOD and Local Sensitivity Diagnostics in Sec\.[P](https://arxiv.org/html/2606.05441#A16)
- •Deployment\-Oriented Triage Diagnostics in Sec\.[Q](https://arxiv.org/html/2606.05441#A17)
- •Extension beyond TabPFN in Sec\.[R](https://arxiv.org/html/2606.05441#A18)
- •TabPFN Seed Sensitivity in Sec\.[S](https://arxiv.org/html/2606.05441#A19)
- •Additional Clarifications in Sec\.[T](https://arxiv.org/html/2606.05441#A20)

## Appendix ADetailed Related Work

Tabular Models\.Recent tabular deep learning has also advanced rapidly beyond early attention/MLP baselines\. In\-context learning\-based model such as TabICL\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\)aim to provide strong out\-of\-the\-box performance \(especially in low\-data regimes\), while methods like TabR\(Gorishniyet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib108)\)and TabM\(Gorishniyet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib104)\)improve classical deep tabular pipelines via nearest\-neighbor augmentation or parameter\-efficient ensembling\. Hybrid and low\-label settings are explored by TANDEM\(Naor and Lindenbaum,[2025](https://arxiv.org/html/2606.05441#bib.bib111)\), and context optimization for scalable prior\-fitted networks is studied in TuneTables\(Feueret al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib112)\)\. TabDPT pre\-trains a row\-token tabular foundation model by combining retrieval\-based in\-context learning with self\-supervised column\-masking on large\-scale real datasets, enabling strong generalization to unseen tasks without per\-dataset tuning\(Maet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib97)\)\. At the same time, strong “simple” baselines remain highly competitive: carefully pre\-tuned MLPs such as Real\-MLP\(Holzmülleret al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib110)\)can be surprisingly hard to beat, and gradient\-boosted decision trees, especially XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2606.05441#bib.bib109)\), LightGBM\(Keet al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib106)\), and CatBoost\(Prokhorenkovaet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib105)\)continue to thrive as robust, high\-performing workhorses on many tabular benchmarks\. Feature Selection\-based HDLSS Specific Models\.Feature\-selection methods are particularly relevant for HDLSS tabular learning, where selecting a compact, informative subset of features is often more critical than scaling model capacity\. ProtoGate\(Jianget al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib98)\)addresses this regime with prototype\-guided gating that selects features by aligning samples with class prototypes, improving robustness when samples are scarce\. Several works perform instance\-wise or differentiable subset selection: INVASE\(Yoonet al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib100)\)learns a sample\-dependent feature selector trained with prediction and selection regularization, while STG\(Yamadaet al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib102)\)uses stochastic gates to enable end\-to\-end feature selection with sparsity control\. LSPIN/LLSPIN\(Yanget al\.,[2022a](https://arxiv.org/html/2606.05441#bib.bib99)\)further emphasizes interpretable, per\-instance gating for tabular inputs\. Complementarily, L2X\(Chenet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib101)\)and REAL\-X\(Jethaniet al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib103)\)learn differentiable selection/explanation mechanisms that identify a small set of features sufficient for prediction, offering a principled way to trade accuracy for sparsity and interpretability in low\-sample settings\. RKNN\-FS\(Liet al\.,[2011](https://arxiv.org/html/2606.05441#bib.bib172)\)addresses HDLSS learning through feature selection, using a randomkk\-nearest neighbor \(RKNN\) strategy to identify informative features before prediction\. LLMs for Tabular Data\.Motivated by general\-purpose LLMs\(Brownet al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib26); Achiamet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib27); Guoet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib28),[2025](https://arxiv.org/html/2606.05441#bib.bib29)\), recent work adapts them to tabular prediction via fine\-tuning \(e\.g\., LIFT\(Dinhet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib30)\)on GPT\-3\(Brownet al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib26)\), TabLLM\(Hegselmannet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib31)\)on T0\(Sanhet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib37)\)\) and tabular\-centric pretraining for transfer/instruction following \(e\.g\., TP\-BERTa\(Yanet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib32)\), GTL\(Wenet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib33)\)\)\. Other lines use prompting or hybrid pipelines, including correlated\-text semi\-supervision \(P2T\(Namet al\.,[2024b](https://arxiv.org/html/2606.05441#bib.bib34)\)\), weak\-learner boosting \(Summary\(Manikandanet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib38)\)\), feature synthesis \(FeatLLM\(Hanet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib35)\)\), synergy learning with tabular backbones \(SERSAL\(Yanet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib36)\)\), LLM\-guided rule/feature generation\(Namet al\.,[2024a](https://arxiv.org/html/2606.05441#bib.bib39); Zhanget al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib40)\), rule refinement without LLM fine\-tuning\(Yeet al\.,[2025c](https://arxiv.org/html/2606.05441#bib.bib41)\), metadata\-driven distillation\(Shiet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib42)\), and order\-bias mitigation \(ROTATOR\-LLM\(Wanget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib128)\)\)\. Despite encouraging efforts, current tabular LLMs remain poorly suited to HDLSS and very high\-dimensional tables, since attention\-based architectures incur quadratic cost in the token/feature dimension\. TabPFN and Variants\.TabPFN introduced the idea of a tabular foundation model that performs in\-context learning for small tabular classification problems\(Hollmannet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib114)\)\. This was later substantially extended and validated at scale in TabPFN v2\(Hollmannet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib113)\)\. Recent follow\-ups push accuracy and scalability along multiple axes: TabPFN\-2\.5 advances the state of the art in tabular foundation models\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115)\), TabPFN Unleashed \(BETA\) targets practical scalability and effectiveness on broader settings\(Liu and Ye,[2025](https://arxiv.org/html/2606.05441#bib.bib116)\), and TabPFN\-Wide explores continued pre\-training for extreme feature counts\(Kolberget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib117)\)\. Orthogonally, LoCalPFN investigates retrieval and fine\-tuning mechanisms for in\-context tabular models\(Thomaset al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib118)\)\. Our work is complementary to these efforts: rather than modifying TabPFN itself, we develop a stable, structure\-aware compression interface that enables TabPFN\-style predictors to operate reliably in HDLSS regimes withm≫nm\\\!\\gg\\\!nand very large feature counts\. Other Models\.MLP\-PLR is a strong MLP baseline for tabular data that replaces raw continuous inputs with learnable piecewise\-linear \(PLR\) feature embeddings, improving expressivity and performance over standard MLPs\(Gorishniyet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib90)\)\. Other tabular models such as TabNet\(Arik and Pfister,[2021](https://arxiv.org/html/2606.05441#bib.bib144)\), TabTransformer\(Huanget al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib78)\)/FT\-Transformer\(Gorishniyet al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib79)\), SAINT\(Somepalliet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib80)\), AutoInt\(Songet al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib145)\), and interaction\-centric networks \(DeepFM\(Guoet al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib146)\), DCN\(Wanget al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib147)\)\) differ primarily in how they represent columns and capture cross\-feature dependencies: TabNet performs step\-wise attentive feature selection via sparse masks for interpretability, while transformer families tokenize columns \(especially categorical features\) and use self\-attention to model contextual interactions, with FT\-Transformer providing a lighter mixed\-type variant; SAINT further strengthens tabular transformers via augmentation and contrastive/self\-supervised pretraining, and AutoInt targets high\-order interactions directly through attention\. A complementary line treats tabular inputs as feature sequences, including ordering\-based approaches \(TabSeq\(Habibet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib142)\)\) and recurrent processing \(TabulaRNN\(Thielmann and Samiee,[2024](https://arxiv.org/html/2606.05441#bib.bib148)\)\), which impose position\-aware inductive bias but may depend on the quality of the chosen order\. More recent sequence backbones replace attention with state\-space modeling e\.g\., Mambular\(Thielmannet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib141)\), MambaTab\(Ahamed and Cheng,[2024](https://arxiv.org/html/2606.05441#bib.bib149)\), and hybrids like MambAttention\(Thielmann and Samiee,[2024](https://arxiv.org/html/2606.05441#bib.bib148)\)leveraging Mamba\-style selective state\-space layers for efficient long\-range dependency modeling\. Tree\-inspired inductive biases remain competitive through differentiable ensembles \(NODE\(Popovet al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib151)\), ENODE\(OpenTabular Contributors,[2025](https://arxiv.org/html/2606.05441#bib.bib155)\), NDTF\(Kontschiederet al\.,[2015](https://arxiv.org/html/2606.05441#bib.bib150)\)\) that emulate decision trees with end\-to\-end training, alongside classical ensembles \(Random Forest, AdaBoost, GBM\) that provide strong, stable baselines\. Finally, representation and regularization advances such as DANets\(Chenet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib87)\), ResNetTabular\(OpenTabular Contributors,[2025](https://arxiv.org/html/2606.05441#bib.bib155)\), CategoryEmbedding\(Gorishniyet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib90)\), TANGOS\(Jeffareset al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib152)\), and metric/contrastive methods \(ModernNCA\(Yeet al\.,[2025b](https://arxiv.org/html/2606.05441#bib.bib154)\), Trompt\(Chenet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib153)\)\) improve robustness and optimization on noisy, small\-sample settings, while standard baselines \(Naive Bayes, KNN, SVM, Decision Tree, Lasso, MLP, 1\-D CNN\) remain important reference points due to their interpretability and well\-characterized trade\-offs\. GeoAggregator\(Denget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib159)\)and ZAYAN\(Habibet al\.,[2026c](https://arxiv.org/html/2606.05441#bib.bib158)\)represent recent domain\-specific geospatial tabular deep learning models, with GeoAggregator targeting spatially aware geospatial regression and ZAYAN focusing on feature\-level contrastive learning for tabular remote sensing and environmental data\. Feature Ordering and Permutation Sensitivity\.Mambular\(Thielmannet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib141)\)first highlighted that random permutations of tabular feature order can induce brittleness in predictive performance, even under fixed seeds\. Concurrently, TabSeq\(Habibet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib142)\)explicitly introduced a feature ordering algorithm for tabular deep learning, establishing column permutation as a learnable design choice rather than a nuisance factor\. Later ROTATOR\-LLM\(Wanget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib128)\)extended this direction by studying feature ordering for LLM\-based tabular inference\. DynaTab\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\)systematically studied when feature ordering matters in high\-dimensional tabular learning, introducing an Intrinsic Dimensionality Factor \(IDF\) and feature\-to\-sample ratioρ=m/n\\rho=m/nbased categorization of dataset regimes\. It proposed a neuroscience\-inspired Dynamic Feature Ordering \(DFO\) algorithm and showed that sequence\-sensitive backbones such as Transformers, LSTMs, denoising autoencoders, and SSM\-style models can benefit from adaptive ordering\. However, its use of vanilla sequence backbones incurs substantial memory and runtime costs, motivating more compact ordering\-aware tokenization and compression strategies\. We use Fig\.[A\.1](https://arxiv.org/html/2606.05441#A1.F1)to position GOTabPFN within this landscape according to hyperparameter tuning requirements for tabular models: unlike PFN/ICL\-style models that require no dataset\-specific tuning and conventional tabular learners that are fully tuned, GOTabPFN occupies the middle regime by tuning only the GO\-LR\+NSC front\-end while retaining a frozen TabPFN\-2\.5 backbone\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/gotabpfn_55baselines_venn.png)Figure A\.1:GOTabPFN tuning regime\.GOTabPFN sits between frozen PFN/ICL\-style tabular foundation models and fully tuned tabular learners: only the GO\-LR\+NSC front\-end is tuned, while TabPFN\-2\.5 remains frozen\. See Appendix[T](https://arxiv.org/html/2606.05441#A20)for more clarifications\.
## Appendix BTheoretical Characterization of GO\-LR: Complexity and TSP Connections

### B\.1GO\-LR as a TSP\-Style Initialization with Local Refinement\.

GO\-LR does not solve MinLA exactly; instead, our implementation constructs a permutation by greedy seriation over a pairwise dissimilarity matrix\. The initialization coincides with a nearest\-neighbor heuristic for a TSP\-path objective defined on a complete graph, and GO\-LR then locally refines the resulting permutation under the dispersion objective\.

###### Definition B\.1\(TSP\-path Objective on a Complete Graph\)\.

Given a complete weighted graph𝒦=\(V,\(V2\),d\)\\mathcal\{K\}=\(V,\\binom\{V\}\{2\},d\)with edge weightsdi​j≥0d\_\{ij\}\\geq 0, we define the path cost of a permutationσ=\(σ1,…,σm\)\\sigma=\(\\sigma\_\{1\},\\dots,\\sigma\_\{m\}\)by Eq\.[8](https://arxiv.org/html/2606.05441#S3.E8), repeated below as Eq\.[26](https://arxiv.org/html/2606.05441#A2.E26)for brevity\.

PathCost​\(σ\)=∑t=1m−1dσt,σt\+1\\mathrm\{PathCost\}\(\\sigma\)\\;=\\;\\sum\_\{t=1\}^\{m\-1\}d\_\{\\sigma\_\{t\},\\sigma\_\{t\+1\}\}\(26\)

###### Lemma B\.2\(Nearest\-Neighbor Heuristic\)\.

The greedy procedure “start atarg⁡mini​∑jdi​j\\arg\\min\_\{i\}\\sum\_\{j\}d\_\{ij\}and repeatedly append the nearest unvisited node” returns a Hamiltonian path in𝒦\\mathcal\{K\}and is a standard nearest\-neighbor heuristic for minimizing Eq\. \([8](https://arxiv.org/html/2606.05441#S3.E8)\)\(Rosenkrantzet al\.,[1977](https://arxiv.org/html/2606.05441#bib.bib12)\)\.

###### Proof sketch\.

At each step exactly one new unvisited node is appended; thus the resulting sequence visits each node exactly once and forms a Hamiltonian path\. The choice “nearest unvisited” is precisely the nearest\-neighbor rule for minimizing Eq\. \([8](https://arxiv.org/html/2606.05441#S3.E8)\)\. ∎

###### Theorem B\.3\(The Initialization Step of GO\-LR Can Be Used as a TSP\-path Heuristic\)\.

For any complete weighted graph𝒦\\mathcal\{K\}, the GO\-LR local seriation rule \(nearest\-neighbor\) can be used as a TSP\-path heuristic that outputs a Hamiltonian pathσ\\sigmafor𝒦\\mathcal\{K\}\. By Lemma[B\.2](https://arxiv.org/html/2606.05441#A2.Thmtheorem2), the output is a Hamiltonian path\.

###### Proof sketch\.

Given𝒦\\mathcal\{K\}, run the greedy nearest\-neighbor construction on its distance matrix\[di​j\]\[d\_\{ij\}\]\. By Lemma[B\.2](https://arxiv.org/html/2606.05441#A2.Thmtheorem2), the output is a Hamiltonian path and corresponds to a permutationσ\\sigma\. ∎

##### GO\-LR vs\. classic metaheuristics\.

To assess whether GO\-LR is overly limited by its greedy construction, we replace GO\-LR with stronger stochastic/metaheuristic orderings while keeping the downstream NSC \+ TabPFN\-2\.5 pipeline fixed\. On Colon, although Simulated Annealing\(Kirkpatricket al\.,[1983](https://arxiv.org/html/2606.05441#bib.bib165)\), Genetic Algorithm\(Larranagaet al\.,[1999](https://arxiv.org/html/2606.05441#bib.bib166)\), Ant Colony Optimization\(Dorigo and Gambardella,[2002](https://arxiv.org/html/2606.05441#bib.bib167)\), and Christofides\-based ordering\(Christofides,[2022](https://arxiv.org/html/2606.05441#bib.bib168)\)often achieve lower TSP\-style surrogate cost, they do not improve downstream accuracy; GO\-LR obtains the best5×55\\times 5CV accuracy \(0\.8818±0\.10050\.8818\\pm 0\.1005\), the lowest runtime \(10\.07s\), and the best MinLA\-style dispersion objective\. We observe the same trend on the larger Cell Cycle\(Mahdessianet al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib162); NCBI,[2021](https://arxiv.org/html/2606.05441#bib.bib163)\)RNA\-seq dataset \(n=1067,m=42728n=1067,m=42728\): GO\-LR\+NSC\+TabPFN\-2\.5 achieves79\.94±2\.5379\.94\\pm 2\.53accuracy,92\.36±1\.3692\.36\\pm 1\.36AUC, and79\.95±2\.5179\.95\\pm 2\.51macro\-F1, compared to76\.45±2\.2976\.45\\pm 2\.29,92\.76±1\.1092\.76\\pm 1\.10, and76\.42±2\.2976\.42\\pm 2\.29for Simulated Annealing\+NSC\+TabPFN\-2\.5\. Consistent with our theory, GO\-LR attains the lower MinLA cost on Cell Cycle \(8\.14×10118\.14\\times 10^\{11\}vs\.8\.51×10118\.51\\times 10^\{11\}\), while Simulated Annealing attains the lower TSP\-path cost; this is expected, since GO\-LR is designed to optimize the MinLA\-style dispersion objective rather than a TSP surrogate alone\. These results suggest that optimizing a TSP surrogate more aggressively does not necessarily yield better feature orderings for NSC\+TabPFN\-2\.5; GO\-LR is better aligned with the MinLA\-style objective and provides a stronger accuracy\-efficiency tradeoff in practice\.

Table B\.1:GO\-LR vs\. classic metaheuristics on Colon\.Lower runtime, TSP cost, and MinLA are better; higher accuracy is better\. Accuracy is reported over5×55\\times 5CV\. Christofides and Simulated Annealing are implemented using NetworkX\(Hagberget al\.,[2008](https://arxiv.org/html/2606.05441#bib.bib169)\)\.
##### On global optimality\.

GO\-LR does not guarantee a globally optimal ordering, since the underlying MinLA\-style feature ordering problem is combinatorial and NP\-hard\. Instead, it is a structured approximation strategy: clustering induces local feature graphs, NNPath provides an efficient initialization, and local refinement explicitly reduces the MinLA\-style dispersion objective\. Empirically, replacing GO\-LR with stronger metaheuristics such as Simulated Annealing, Genetic Algorithm, Ant Colony Optimization, and Christofides\-based ordering does not improve downstream NSC\+TabPFN\-2\.5 performance\. On both Colon and the larger Cell Cycle transcriptomic dataset, GO\-LR achieves better downstream accuracy and lower MinLA cost despite some alternatives attaining lower TSP\-style surrogate cost, suggesting that GO\-LR is better aligned with the objective that matters for ordering\-aware compression and prediction\.

##### GO\-LR cluster\-size sensitivity\.

We ablate the GO\-LR cluster sizekkon Colon while fixing all other hyperparameters to the best tuned configuration\. As shown in Table[B\.2](https://arxiv.org/html/2606.05441#A2.T2), performance peaks atk=10k=10, reproducing the best Colon result \(88\.18±10\.0588\.18\_\{\\pm 10\.05\}\) in a fresh run\. The trend is non\-monotonic: smallkkvalues appear too coarse to capture local sample heterogeneity, while overly largekkvalues fragment the data into noisier local graphs, weakening the aggregated global ordering\. This supports using a moderate cluster size as the best trade\-off\.

Table B\.2:GO\-LR cluster\-size sensitivity on Colon\.All settings exceptkkare fixed to the best tuned configuration\. Values are mean accuracy with subscripted standard deviation over5×55\\times 5CV\.

## Appendix CNSC Variants and NSC as a Shared Piecewise Pooling Operator

### C\.1Intrinsic\-Dimension Rules and Budgets for NSC Variants

For completeness we describe the intrinsic\-dimension estimators and budget rules used by the remaining NSC variants\. Recall thatX~∈ℝn×m\\tilde\{X\}\\in\\mathbb\{R\}^\{n\\times m\}is the standardized training matrix and we get Eq\.[27](https://arxiv.org/html/2606.05441#A3.E27)which denotes denotes its covariance \(or correlation\) matrix with nonzero eigenvalues\{λi\}i=1r\\\{\\lambda\_\{i\}\\\}\_\{i=1\}^\{r\},r≤min⁡\(n,m\)r\\leq\\min\(n,m\)\.

Σ=1n−1​X~⊤​X~∈ℝm×m\\Sigma\\;=\\;\\frac\{1\}\{n\-1\}\\tilde\{X\}^\{\\top\}\\tilde\{X\}\\in\\mathbb\{R\}^\{m\\times m\}\(27\)
##### Effective\-rank intrinsic dimension\.

Besides the PCA cumulative\-variance rule in Eqs\. \([18](https://arxiv.org/html/2606.05441#S3.E18)\)\-\([20](https://arxiv.org/html/2606.05441#S3.E20)\), we also consider an effective\-rank estimate\(Roy and Vetterli,[2007](https://arxiv.org/html/2606.05441#bib.bib21); Halkoet al\.,[2011](https://arxiv.org/html/2606.05441#bib.bib22)\)\. We get Eq\.[28](https://arxiv.org/html/2606.05441#A3.E28)and define Eq\.[29](https://arxiv.org/html/2606.05441#A3.E29)with a small constantϵ\>0\\epsilon\>0for numerical stability\. Some variants setd^=deff\\hat\{d\}=d\_\{\\mathrm\{eff\}\}instead ofd^PCA​\(τ\)\\hat\{d\}\_\{\\mathrm\{PCA\}\}\(\\tau\)\.

pi=λi∑j=1rλjp\_\{i\}\\;=\\;\\frac\{\\lambda\_\{i\}\}\{\\sum\_\{j=1\}^\{r\}\\lambda\_\{j\}\}\(28\)deff=exp⁡\(−∑i=1rpi​log⁡\(pi\+ϵ\)\)d\_\{\\mathrm\{eff\}\}\\;=\\;\\exp\\\!\\left\(\-\\sum\_\{i=1\}^\{r\}p\_\{i\}\\log\\\!\\big\(p\_\{i\}\+\\epsilon\\big\)\\right\)\(29\)

##### IDF\-based budget modulation\.

We define the IDF\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\)in Eq\.[30](https://arxiv.org/html/2606.05441#A3.E30)which compares intrinsic and ambient dimensionality\. An IDF\-based budget rule optionally used in NSC and NSC\-P is defined by Eq\.[31](https://arxiv.org/html/2606.05441#A3.E31)whereγ\\gamma,MminM\_\{\\min\}, andMmaxM\_\{\\max\}control redundancy allowance and token\-budget bounds\.

IDF=d^m,d^∈\{deff,d^PCA​\(τ\)\}\\mathrm\{IDF\}\\;=\\;\\frac\{\\hat\{d\}\}\{m\},\\qquad\\hat\{d\}\\in\\big\\\{d\_\{\\mathrm\{eff\}\},\\ \\hat\{d\}\_\{\\mathrm\{PCA\}\}\(\\tau\)\\big\\\}\(30\)M\\displaystyle M=clip​\(⌈\(1\+β​\(1−IDF\)\)​d^⌉,Mmin,min⁡\(Mmax,m\)\),\\displaystyle=\\mathrm\{clip\}\\Big\(\\big\\lceil\(1\+\\beta\(1\-\\mathrm\{IDF\}\)\)\\,\\hat\{d\}\\big\\rceil,\\;M\_\{\\min\},\\;\\min\(M\_\{\\max\},m\)\\Big\),\(31\)β∈\[0,1\]\\displaystyle\\hskip 16\.00008pt\\beta\\in\[0,1\]

##### Variant summary\.

- •NSC: uses a fixed, user\-chosenMM\(no intrinsic\-dimension estimate\)\.
- •NSC\-P: choosesd^\\hat\{d\}via eitherdeffd\_\{\\mathrm\{eff\}\}ord^PCA​\(τ\)\\hat\{d\}\_\{\\mathrm\{PCA\}\}\(\\tau\)and then applies an IDF\- orγ\\gamma\-based rule such as Eq\.[32](https://arxiv.org/html/2606.05441#A3.E32) M=clip​\(⌈γ​d^⌉,Mmin,min⁡\(Mmax,m\)\)M=\\mathrm\{clip\}\\big\(\\lceil\\gamma\\hat\{d\}\\rceil,\\;M\_\{\\min\},\\;\\min\(M\_\{\\max\},m\)\\big\)\(32\)
- •NSC\-SP: uses a fixedMMbut applies SegPCA pooling \(Sec\.[17](https://arxiv.org/html/2606.05441#S3.E17)\)\.
- •NSC\-pSP: uses the PCA\-based rule in Eqs\. \([20](https://arxiv.org/html/2606.05441#S3.E20)\)\-\([21](https://arxiv.org/html/2606.05441#S3.E21)\) \(described in the main text\)\.

##### Computational note\.

In HDLSS regimes \(m≫nm\\gg n\), we avoid forming a full eigen\-decomposition of them×mm\\times mmatrixΣ\\Sigmain Eq\. \([27](https://arxiv.org/html/2606.05441#A3.E27)\) by computing the nonzero spectrum via then×nn\\times nGram matrixX~​X~⊤\\tilde\{X\}\\tilde\{X\}^\{\\top\}or using randomized methods\(Halkoet al\.,[2011](https://arxiv.org/html/2606.05441#bib.bib22)\)\. The resulting eigenvalues are then re\-used in the intrinsic\-dimension rules above\.

###### Proposition C\.1\(NSC as a Shared Piecewise Pooling Operator\)\.

LetΠ∗\\Pi^\{\*\}be a fixed global feature permutation and let\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}be the contiguous segments defined by NSC\. The Neuro\-Inspired Subunit Compression defines a mapping in Eq\.[33](https://arxiv.org/html/2606.05441#A3.E33)which is given by Eq\.[34](https://arxiv.org/html/2606.05441#A3.E34)where the same pooling functiongθg\_\{\\theta\}is shared across all segments\.

FNSC:ℝm→ℝM×dF\_\{\\text\{NSC\}\}:\\mathbb\{R\}^\{m\}\\rightarrow\\mathbb\{R\}^\{M\\times d\}\(33\)FNSC​\(x\)=\(gθ​\(ψ​\(x𝒮1Π\)\),…,gθ​\(ψ​\(x𝒮MΠ\)\)\)F\_\{\\text\{NSC\}\}\(x\)=\\Big\(g\_\{\\theta\}\\\!\\big\(\\psi\(x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{1\}\}\)\\big\),\\,\\dots,\\,g\_\{\\theta\}\\\!\\big\(\\psi\(x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{M\}\}\)\\big\)\\Big\)\(34\)ThenFNSCF\_\{\\text\{NSC\}\}is a piecewise pooling operator with the following properties: \(1\)Locality\.Each output token depends only on a contiguous subset of the ordered feature axis; \(2\)Parameter sharing\.The number of trainable parameters is independent ofmmand depends only ongθg\_\{\\theta\}; and \(3\)Linear complexity\.For fixed pooling depth,FNSCF\_\{\\text\{NSC\}\}can be evaluated inO​\(m\)O\(m\)time andO​\(M​d\)O\(Md\)space per sample\.

Proposition[C\.1](https://arxiv.org/html/2606.05441#A3.Thmtheorem1)shows that NSC induces a structured, order\-aware compression operator that preserves locality while remaining scalable to extreme HDLSS regimes\.

###### Proof sketch\.

By construction, the ordered feature vectorxΠx^\{\\Pi\}is partitioned intoMMdisjoint contiguous segments whose union covers\{1,…,m\}\\\{1,\\dots,m\\\}\. Each segment is processed independently by the same pooling functiongθg\_\{\\theta\}, establishing locality and parameter sharing\. Since each feature appears in exactly one segment andgθg\_\{\\theta\}has constant depth, the total number of operations scales linearly withmm, yieldingO​\(m\)O\(m\)time complexity\. The output representation storesMMvectors of dimensiondd, givingO​\(M​d\)O\(Md\)space complexity\. ∎

## Appendix DNSC as a Structured Dimensionality Reduction Layer

In this section we view the NSC \(Section[3\.2](https://arxiv.org/html/2606.05441#S3.SS2)\) as an explicit Dimensionality Reduction \(DR\) layer that maps high\-dimensional GO\-LR\-ordered features into a low\-dimensional latent space\. We first formalize NSC as a structured DR map, then relate it to classical DR methods, and finally outline and instantiate an empirical protocol comparing NSC to Principal Component Analysis \(PCA\)\(Hotelling,[1933](https://arxiv.org/html/2606.05441#bib.bib46)\), Random Projections \(RP\)\(Johnson and Lindenstrauss,[1984](https://arxiv.org/html/2606.05441#bib.bib48)\), Uniform Manifold Approximation and Projection \(UMAP\)\(McInneset al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib50)\), Pairwise Controlled Manifold Approximation \(PaCMAP\)\(Wanget al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib164)\), and Autoencoders \(AE\)\(Hinton and Salakhutdinov,[2006](https://arxiv.org/html/2606.05441#bib.bib47)\)quantitatively\.

### D\.1NSC as a Structured DR Map

Recall that GO\-LR produces a global feature permutationΠ∗\\Pi^\{\\ast\}that approximately solves a weighted MinLA problem \(Section[3\.1](https://arxiv.org/html/2606.05441#S3.SS1)\), bringing highly correlated or low\-dissimilarity features into local neighborhoods along the ordered axis\. NSC then segments this ordered axis intoMMcontiguous subunits\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}and applies a shared pooling operatorgθ∘ψg\_\{\\theta\}\\circ\\psito each segment \(Eqs\.[12](https://arxiv.org/html/2606.05441#S3.E12)\-[17](https://arxiv.org/html/2606.05441#S3.E17)\)\. For a samplex∈ℝmx\\in\\mathbb\{R\}^\{m\}we again write Eq\.[35](https://arxiv.org/html/2606.05441#A4.E35)and define segments𝒮t⊂\{1,…,m\}\\mathcal\{S\}\_\{t\}\\subset\\\{1,\\dots,m\\\}that partition the ordered axis\. Thett\-th meta\-feature is defined by Eq\.[36](https://arxiv.org/html/2606.05441#A4.E36)yielding a meta\-feature sequenceZ​\(x\)=\(z1,…,zM\)∈ℝM×dZ\(x\)=\(z\_\{1\},\\dots,z\_\{M\}\)\\in\\mathbb\{R\}^\{M\\times d\}\(Eqs\.[16](https://arxiv.org/html/2606.05441#S3.E16)\-[17](https://arxiv.org/html/2606.05441#S3.E17)\)\. In our implementation we often setd=1d=1\(scalar tokens\), and flattenZ​\(x\)Z\(x\)into a vector inℝM\\mathbb\{R\}^\{M\}\. Formally, NSC again defines a mapping in Eq\.[37](https://arxiv.org/html/2606.05441#A4.E37)as summarized in Proposition[C\.1](https://arxiv.org/html/2606.05441#A3.Thmtheorem1)\. Whend=1d=1and we flatten across segments, this reduces to a DR map by Eq\.[38](https://arxiv.org/html/2606.05441#A4.E38)\. Thus NSC acts as a structured DR layer whose output dimensionalityM≪mM\\ll mis explicitly controlled by the meta\-feature budget\.

xΠ=\(xΠ∗​\(1\),…,xΠ∗​\(m\)\)x^\{\\Pi\}=\\big\(x\_\{\\Pi^\{\*\}\(1\)\},\\dots,x\_\{\\Pi^\{\*\}\(m\)\}\\big\)\(35\)zt=gθ​\(ψ​\(ut\)\),ut:=x𝒮tΠz\_\{t\}=g\_\{\\theta\}\\\!\\big\(\\psi\(u\_\{t\}\)\\big\),\\qquad u\_\{t\}:=x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{t\}\}\(36\)FNSC:ℝm→ℝM×d,FNSC​\(x\)=\(gθ​\(ψ​\(x𝒮1Π\)\),…,gθ​\(ψ​\(x𝒮MΠ\)\)\)F\_\{\\text\{NSC\}\}:\\mathbb\{R\}^\{m\}\\rightarrow\\mathbb\{R\}^\{M\\times d\},\\qquad F\_\{\\text\{NSC\}\}\(x\)=\\Big\(g\_\{\\theta\}\\\!\\big\(\\psi\(x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{1\}\}\)\\big\),\\dots,g\_\{\\theta\}\\\!\\big\(\\psi\(x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{M\}\}\)\\big\)\\Big\)\(37\)ΦNSC:ℝm→ℝM,ΦNSC​\(x\)=flatten​\(FNSC​\(x\)\)\\Phi\_\{\\text\{NSC\}\}:\\mathbb\{R\}^\{m\}\\to\\mathbb\{R\}^\{M\},\\qquad\\Phi\_\{\\text\{NSC\}\}\(x\)=\\mathrm\{flatten\}\\big\(F\_\{\\text\{NSC\}\}\(x\)\\big\)\(38\)
##### Structured locality\.

Unlike global DR methods such as PCA\(Hotelling,[1933](https://arxiv.org/html/2606.05441#bib.bib46)\),ΦNSC\\Phi\_\{\\text\{NSC\}\}has built\-in locality: each coordinate of the compressed representation depends only on a contiguous block of GO\-LR–ordered features\. In particular, thett\-th coordinate ofΦNSC​\(x\)\\Phi\_\{\\text\{NSC\}\}\(x\)is a pooled summary of the subunitut=x𝒮tΠu\_\{t\}=x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{t\}\}, where𝒮t\\mathcal\{S\}\_\{t\}corresponds to a locally redundant neighborhood in the GO\-LR ordering\. This induces a piecewise pooling structure: different coordinates of the latent vector correspond to different ordered blocks rather than global linear mixtures of all features\.

##### Linear and nonlinear regimes\.

If bothψ\\psiandgθg\_\{\\theta\}are linear maps, NSC reduces to a structured linear DR operator\. Writingψ​\(ut\)=At​ut\\psi\(u\_\{t\}\)=A\_\{t\}u\_\{t\}andgθ​\(v\)=w⊤​vg\_\{\\theta\}\(v\)=w^\{\\top\}v, we obtain Eq\.[39](https://arxiv.org/html/2606.05441#A4.E39)wherePΠP\_\{\\Pi\}is the permutation matrix forΠ∗\\Pi^\{\\ast\}andP𝒮tP\_\{\\mathcal\{S\}\_\{t\}\}selects segment indices\. Stacking acrossttyields a linear mapWNSC​xW\_\{\\text\{NSC\}\}xwith a structured block\-sparse pattern aligned with the ordered segments\. Whenψ\\psiorgθg\_\{\\theta\}include nonlinear statistics \(e\.g\., quantiles, skewness, kurtosis, shallow MLP\), NSC becomes a local nonlinear DR layer with the same segmentation structure\.

zt=w⊤​At​ut=w⊤​At​x𝒮tΠ=w⊤​At​P𝒮t​PΠ​xz\_\{t\}=w^\{\\top\}A\_\{t\}u\_\{t\}=w^\{\\top\}A\_\{t\}\\,x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{t\}\}=w^\{\\top\}A\_\{t\}\\,P\_\{\\mathcal\{S\}\_\{t\}\}P\_\{\\Pi\}x\(39\)

##### Intrinsic dimension\-aware budget\.

Section[3\.2](https://arxiv.org/html/2606.05441#S3.SS2)ties the meta\-feature budgetMMto an estimate of intrinsic dimensionalityd^\\hat\{d\}via effective rank \(Eqs\.[27](https://arxiv.org/html/2606.05441#A3.E27)\-[29](https://arxiv.org/html/2606.05441#A3.E29)\)\. Our default rule \(Eq\.[21](https://arxiv.org/html/2606.05441#S3.E21)\) sets Eq\.[40](https://arxiv.org/html/2606.05441#A4.E40)or optionally uses the IDF\-based budget \(Eq\.[31](https://arxiv.org/html/2606.05441#A3.E31)\) to modulateMMby redundancy\. In HDLSS regimes withm≫nm\\gg n, we estimated^\\hat\{d\}via then×nn\\times nGram matrix or randomized methods to avoid forming a fullm×mm\\times mcovariance\. In all cases,MMscales with intrinsic rather than ambient dimension, so NSC compresses more aggressively when the effective rank is low\.

M=clip​\(⌈2​d^⌉,32,min⁡\(512,m\)\)M=\\mathrm\{clip\}\\big\(\\lceil 2\\hat\{d\}\\rceil,\\;32,\\;\\min\(512,m\)\\big\)\(40\)

### D\.2Relation to Classical Dimensionality Reduction

NSC is conceptually related to, but distinct from, standard DR techniques\.

##### PCA and RP\.

PCA\(Hotelling,[1933](https://arxiv.org/html/2606.05441#bib.bib46)\)computes a global linear projectionx↦Uk⊤​xx\\mapsto U\_\{k\}^\{\\top\}xthat optimizes variance preservation inℝk\\mathbb\{R\}^\{k\}\. RP mapsx↦R​xx\\mapsto Rxwith a dense random matrixRRand approximately preserves pairwise distances by Johnson–Lindenstrauss guarantees\(Johnson and Lindenstrauss,[1984](https://arxiv.org/html/2606.05441#bib.bib48)\)\. Both treat coordinates symmetrically and mix all features into each latent dimension\. By contrast, NSC first reorders features via GO\-LR so that correlated variables are neighbors, then pools each contiguous neighborhood into a meta\-feature\. This yields:

- •Local support:Each latent coordinate depends on a small, interpretable block of features rather than allmm\.
- •Order\-aware pooling:GO\-LR ensures that each block tends to group features with small dispersion under the MinLA objective, so pooled statistics are computed over structurally coherent neighborhoods\.
- •Flexible statistics:NSC variants can incorporate robust or higher\-order statistics viaψ\\psiwithout changing the dimensionalityMM\. NSC variants also inherit the power of PCA for IDF or post segmentation summarization\.

In the special case of linearψ\\psiandgθg\_\{\\theta\}, NSC is a constrained linear DR method whose projection matrix has a block structure aligned with the GO\-LR ordering, whereas PCA uses a dense orthogonal mixing and RP uses an unstructured dense matrix\.

##### Autoencoders\.

Autoencoders \(AE\)\(Hinton and Salakhutdinov,[2006](https://arxiv.org/html/2606.05441#bib.bib47)\)learn a parametric encoder\-decoder pair\(fϕ,hψ\)\(f\_\{\\phi\},h\_\{\\psi\}\)with a low\-dimensional bottleneckz∈ℝkz\\in\\mathbb\{R\}^\{k\}optimized to minimize reconstruction error\. While AE can capture nonlinear structure, they require training, are sensitive to sample size, and their bottleneck dimensions often entangle information from all features\. NSC can be seen as a deterministic, data\-dependent encoder with no decoder: it compresses features intoMMmeta\-features using shared, shallow pooling and no reconstruction objective\. In HDLSS regimes, NSC has two advantages: \(i\) it does not require fitting a heavy parametric model on smallnn, and \(ii\) its pooling structure is constrained by GO\-LR ordering, which reduces overfitting and enforces a biologically motivated inductive bias\.

##### UMAP\.

UMAP\(McInneset al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib50)\)is a nonlinear manifold\-learning method that constructs a neighborhood graph and optimizes a low\-dimensional embedding to preserve local topological structure\. Although this makes UMAP effective for visualization and neighborhood preservation, the learned coordinates are global embedding dimensions whose relation to the original features is indirect\. In contrast, NSC preserves an explicit feature\-axis interpretation: each meta\-feature is computed from a contiguous GO\-LR\-induced neighborhood, making the compression more directly tied to feature locality and block structure in HDLSS settings\.

##### PaCMAP\.

PaCMAP\(Wanget al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib164)\)is another nonlinear DR method that balances nearby, mid\-near, and far pair relationships to preserve both local and global geometry in the embedded sample space\. Like UMAP, it learns low\-dimensional coordinates over samples rather than constructing interpretable feature\-block summaries\. NSC instead operates along the feature dimension: after GO\-LR induces a locality\-aware ordering, NSC pools contiguous feature segments into meta\-features, which is better aligned with our goal of preserving order\-induced feature neighborhoods for downstream TabPFN\-style prediction\.

### D\.3Empirical Behavior on HDLSS Benchmarks

We instantiate the protocol below on block structured synthetic HDLSS datasets, usingM=32M=32as an aggressive DR setting\. For each dataset we compare NSC \(GO\-LR ordering \+ Optuna\(Akibaet al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib43)\)tuned segmentation, descriptor, and pooling\) with PCA, AE, UMAP, PaCMAP, and RP using:

- •Linear\-probe accuracy \(Logistic Regression\)\(Cox,[1958](https://arxiv.org/html/2606.05441#bib.bib51)\),
- •kkNN accuracy in latent space\(Cover and Hart,[1967](https://arxiv.org/html/2606.05441#bib.bib52)\),
- •Silhouette\(Rousseeuw,[1987](https://arxiv.org/html/2606.05441#bib.bib53)\)and Davies–Bouldin\(Davies and Bouldin,[1979](https://arxiv.org/html/2606.05441#bib.bib54)\)scores for class separability\.

### D\.4Block\-Structured Synthetic HDLSS Model

To better understand when NSC should dominate classical DR, we also study a synthetic family of HDLSS distributions with explicit block structure\(Ver Steeget al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib56)\)\. This controlled ablation is inspired by the block\-subunit HDLSS modeling philosophy of BSTabDiff\(Habibet al\.,[2026a](https://arxiv.org/html/2606.05441#bib.bib156)\), which views high\-dimensional tabular data as arising from latent feature blocks governed by shared subunit factors\. In our setting, we adapt this idea only as a diagnostic synthetic benchmark: features are generated from block\-correlated latent factors, then randomly permuted, allowing us to test whether GO\-LR \+ NSC can recover useful local neighborhoods for compression and downstream prediction\. Concretely, we construct datasets wheremmfeatures are partitioned intoBBcontiguous blocks, each governed by a shared latent factor plus Gaussian noise\(Devijver and Gallopin,[2018](https://arxiv.org/html/2606.05441#bib.bib57)\)\. Class information is injected via mean shifts on a small subset of blocks, while the remaining blocks are pure noise\. After generation, we randomly permute feature indices so that class\-informative blocks are no longer contiguous in the raw feature space\(Simon and Tibshirani,[2012](https://arxiv.org/html/2606.05441#bib.bib55)\)\. Under this model, GO\-LR with a correlation\-based metric tends to recover an ordering that approximately re\-groups highly correlated features into contiguous neighborhoods, effectively reconstructing the underlying blocks\. NSC then segments along this ordering and pools each block into one or a few meta\-features\. As a result, the NSC embedding approximates block\-level sufficient statistics \(block means and low\-order summaries\), whereas PCA, AE, RP, UMAP, and PaCMAP operate through global mixing, parametric encoding, random projection, or sample\-space manifold embedding, and have no explicit bias toward recovering feature\-block boundaries\.

We instantiate this block\-structured synthetic model in the HDLSS regime \(n≪mn\\ll m\) and compare four NSC variants \(NSC, NSC\-P, NSC\-SP, NSC\-pSP\) against PCA, AE, RP, UMAP, and PaCMAP using the same evaluation suite as above \(linear andkkNN probe accuracy, silhouette, and Davies\-Bouldin; see section[D\.3](https://arxiv.org/html/2606.05441#A4.SS3)\)\. Over1010independent repetitions \(Table[D\.1](https://arxiv.org/html/2606.05441#A4.T1)\), the NSC family achieves the best mean result on all four metrics: NSC\-pSP obtains the highest mean linear accuracy \(0\.8488±0\.04060\.8488\\pm 0\.0406\), best silhouette \(0\.0460±0\.02570\.0460\\pm 0\.0257\), and lowest DB \(4\.1198±0\.96574\.1198\\pm 0\.9657\), while NSC\-SP achieves the highest meankkNN accuracy \(0\.7313±0\.07220\.7313\\pm 0\.0722\)\. Thus, NSC\-pSP is the strongest overall variant, with NSC\-SP providing the best neighborhood\-probe performance \(Table[D\.1](https://arxiv.org/html/2606.05441#A4.T1), Panel B\)\. Consistently, Friedman tests detect significant differences across the nine methods for all metrics \(allp≤9\.79×10−5p\\leq 9\.79\\times 10^\{\-5\}; Table[D\.2](https://arxiv.org/html/2606.05441#A4.T2)\)\. Using one\-sided Wilcoxon tests with NSC\-pSP as the reference, NSC\-pSP shows a clear linear\-probe advantage over all baselines and NSC variants, including PCA \(p=9\.77×10−4p\{=\}9\.77\\times 10^\{\-4\}\), AE \(p=0\.003906p\{=\}0\.003906\), RP \(p=9\.77×10−4p\{=\}9\.77\\times 10^\{\-4\}\), UMAP \(p=0\.004883p\{=\}0\.004883\), PaCMAP \(p=0\.002930p\{=\}0\.002930\), NSC \(p=0\.001953p\{=\}0\.001953\), NSC\-P \(p=0\.003906p\{=\}0\.003906\), and NSC\-SP \(p=0\.03711p\{=\}0\.03711\)\.

For neighborhood\-based structure, NSC\-pSP significantly improveskkNN accuracy over PCA and RP \(bothp=0\.001953p\{=\}0\.001953\), NSC \(p=0\.04492p\{=\}0\.04492\), and NSC\-P \(p=0\.004883p\{=\}0\.004883\), while differences versus AE \(p=0\.3467p\{=\}0\.3467\), NSC\-SP \(p=0\.9561p\{=\}0\.9561\), UMAP \(p=0\.6152p\{=\}0\.6152\), and PaCMAP \(p=0\.5771p\{=\}0\.5771\) are not significant atα=0\.05\\alpha\{=\}0\.05\. For clustering separability, NSC\-pSP yields higher silhouette than PCA, RP, and NSC\-P \(allp=9\.77×10−4p\{=\}9\.77\\times 10^\{\-4\}\), AE \(p=0\.002930p\{=\}0\.002930\), and NSC \(p=0\.01953p\{=\}0\.01953\), whereas gains versus NSC\-SP, UMAP, and PaCMAP are not significant\. Finally, NSC\-pSP achieves significantly lower DB than PCA, RP, and NSC\-P \(allp=9\.77×10−4p\{=\}9\.77\\times 10^\{\-4\}\), AE \(p=0\.002930p\{=\}0\.002930\), NSC \(p=0\.009766p\{=\}0\.009766\), and PaCMAP \(p=0\.01367p\{=\}0\.01367\), with no significant difference versus NSC\-SP or UMAP\. Overall, this controlled experiment highlights that the PCA\-IDF \+ segment\-wise PCA tokenization in NSC\-pSP better matches the block\-correlated generative structure, yielding improved predictive separability and strong local/clustering structure relative to both classical and nonlinear DR baselines, as well as earlier NSC variants\. Table[D\.3](https://arxiv.org/html/2606.05441#A4.T3)further shows that NSC\-pSP consistently outperforms PCA, AE, UMAP, and PaCMAP on linear accuracy, with positiveΔ\\DeltaCIs and favorable effect sizes; it also clearly improves DB over PCA, AE, and PaCMAP, while kNN accuracy and silhouette are essentially tied against UMAP/PaCMAP\.

Table D\.1:NSC variants vs\. PCA/AE/RP/UMAP/PaCMAP on the block\-structured synthetic HDLSS model withM=32M=32\(10 independent reps\)\.*Overall results across repetitions \(Mean±\\pmStd\.; higher is better for Acc\./Sil\., lower is better for DB\)\.*

Table D\.2:Nonparametric comparisons on the block\-structured synthetic HDLSS model \(M=32M\{=\}32, 10 reps\)\. Friedman tests compare all nine methods; Wilcoxon signed\-rank tests are one\-sided withNSC\-pSPas the reference \(higher is better for accuracies/silhouette; lower is better for DB\)\. Here, DB = Davies\-Bouldin, sil\. = silhouette\.Table D\.3:Additional competitiveness analyses of NSC\-pSP vs\. PCA/AE/UMAP/PaCMAP on the block\-structured synthetic HDLSS model \(M=32M\{=\}32, 10 reps\)\.Δ\\Deltadenotes paired improvement of NSC\-pSP over the baseline \(for Acc\./Sil\.:Δ=NSC\-pSP−base\\Delta=\\text\{NSC\-pSP\}\-\\text\{base\}; for DB:Δ=base−NSC\-pSP\\Delta=\\text\{base\}\-\\text\{NSC\-pSP\}so thatΔ\>0\\Delta\>0favors NSC\-pSP\)\. 95% CIs are paired bootstrap percentile intervals of the meanΔ\\Delta\(over reps\)\. W/T/L counts wins/ties/losses across reps\. Wilcoxonppis paired two\-sided;rrbr\_\{\\mathrm\{rb\}\}is the matched\-pairs rank\-biserial effect size\. Avg\. ranks are computed over all 9 methods per rep \(1=best\)\.#### D\.4\.1A Stylized Block–Subunit HDLSS Model

Inspired by BSTabDiff\(Habibet al\.,[2026a](https://arxiv.org/html/2606.05441#bib.bib156)\), we formalize a simple generative model that matches the inductive bias of GO\-LR \+ NSC\.

###### Definition D\.1\(Block–subunit HDLSS model\)\.

Letmmfeatures be partitioned intoMMdisjoint blocks\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}of equal sizess, so thatm=M​sm=Msand𝒮t⊂\{1,…,m\}\\mathcal\{S\}\_\{t\}\\subset\\\{1,\\dots,m\\\},𝒮t∩𝒮t′=∅\\mathcal\{S\}\_\{t\}\\cap\\mathcal\{S\}\_\{t^\{\\prime\}\}=\\emptysetfort≠t′t\\neq t^\{\\prime\}\. For each blocktt, we define a latent block signalht∈ℝh\_\{t\}\\in\\mathbb\{R\}and independent noise variables\{ϵj\}j∈𝒮t\\\{\\epsilon\_\{j\}\\\}\_\{j\\in\\mathcal\{S\}\_\{t\}\}with Eq\.[41](https://arxiv.org/html/2606.05441#A4.E41)\. The observed features are defined by Eq\.[42](https://arxiv.org/html/2606.05441#A4.E42)\. The labelYYis conditionally independent ofXXgiven the block signals in Eq\.[43](https://arxiv.org/html/2606.05441#A4.E43)\.

ϵj∼𝒩​\(0,σ2\),i\.i\.d\. across​j\\epsilon\_\{j\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\),\\quad\\text\{i\.i\.d\. across \}j\(41\)Xj=ht\+ϵj,j∈𝒮t,t=1,…,MX\_\{j\}=h\_\{t\}\+\\epsilon\_\{j\},\\qquad j\\in\\mathcal\{S\}\_\{t\},\\;t=1,\\dots,M\(42\)P​\(Y∣X\)=P​\(Y∣h1,…,hM\)P\(Y\\mid X\)=P\(Y\\mid h\_\{1\},\\dots,h\_\{M\}\)\(43\)

We consider a setting where GO\-LR, applied with a correlation\-based metric, recovers an orderingΠ∗\\Pi^\{\\ast\}that makes the blocks contiguous \(up to permutation of the blocks\)\. NSC then segments the GO\-LR axis so that each segment𝒮t\\mathcal\{S\}\_\{t\}corresponds to one block\.

##### NSC configuration in the block model\.

In this setting, NSC with: \(i\) segmentation aligned with the blocks\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}, and \(ii\) a simple mean descriptorψ​\(ut\)=1\|𝒮t\|​∑j∈𝒮tut​\[j\]\\psi\(u\_\{t\}\)=\\frac\{1\}\{\|\\mathcal\{S\}\_\{t\}\|\}\\sum\_\{j\\in\\mathcal\{S\}\_\{t\}\}u\_\{t\}\[j\]with identity poolinggθ​\(v\)=vg\_\{\\theta\}\(v\)=v, produces meta\-features by Eq\.[44](https://arxiv.org/html/2606.05441#A4.E44)\. Stacking acrossttyields the NSC embeddingΦNSC​\(X\)=\(z1,…,zM\)∈ℝM\\Phi\_\{\\mathrm\{NSC\}\}\(X\)=\(z\_\{1\},\\dots,z\_\{M\}\)\\in\\mathbb\{R\}^\{M\}\.

zt=1\|𝒮t\|​∑j∈𝒮tXj,t=1,…,Mz\_\{t\}=\\frac\{1\}\{\|\\mathcal\{S\}\_\{t\}\|\}\\sum\_\{j\\in\\mathcal\{S\}\_\{t\}\}X\_\{j\},\\qquad t=1,\\dots,M\(44\)
###### Proposition D\.2\(Block means are sufficient for Bayes classification\)\.

Consider the block model in Definition[D\.1](https://arxiv.org/html/2606.05441#A4.Thmtheorem1)in a two\-class mean\-shift setting where Eq\.[45](https://arxiv.org/html/2606.05441#A4.E45)withσ2\\sigma^\{2\}known andμt,0,μt,1\\mu\_\{t,0\},\\mu\_\{t,1\}possibly varying across blockstt\. Then for each block𝒮t\\mathcal\{S\}\_\{t\}the block sum\(Casella and Berger,[2024](https://arxiv.org/html/2606.05441#bib.bib60)\)in Eq\.[46](https://arxiv.org/html/2606.05441#A4.E46)is a sufficient statistic for\(μt,0,μt,1\)\(\\mu\_\{t,0\},\\mu\_\{t,1\}\), and the joint log\-likelihood ratio\(Neyman and Pearson,[1933](https://arxiv.org/html/2606.05441#bib.bib58)\)forYYbased on all featuresXXdepends only on the collection\{St\}t=1M\\\{S\_\{t\}\\\}\_\{t=1\}^\{M\}\. Consequently, any Bayes\-optimal classifier\(Devroyeet al\.,[1996](https://arxiv.org/html/2606.05441#bib.bib59)\)based onXXcan be written as a function of the NSC block meanszt=St/sz\_\{t\}=S\_\{t\}/s\.

Xj∣Y=y∼𝒩​\(μt,y,σ2\),j∈𝒮t,t=1,…,M,y∈\{0,1\}X\_\{j\}\\mid Y=y\\;\\sim\\;\\mathcal\{N\}\(\\mu\_\{t,y\},\\sigma^\{2\}\),\\qquad j\\in\\mathcal\{S\}\_\{t\},\\;t=1,\\dots,M,\\;y\\in\\\{0,1\\\}\(45\)St:=∑j∈𝒮tXjS\_\{t\}:=\\sum\_\{j\\in\\mathcal\{S\}\_\{t\}\}X\_\{j\}\(46\)

###### Proof sketch\.

Within each blocktt, conditional onY=yY=y, the observations\{Xj\}j∈𝒮t\\\{X\_\{j\}\\\}\_\{j\\in\\mathcal\{S\}\_\{t\}\}are i\.i\.d\. Gaussian with meanμt,y\\mu\_\{t,y\}and varianceσ2\\sigma^\{2\}\. The joint likelihood within blockttis defined by Eq\.[47](https://arxiv.org/html/2606.05441#A4.E47)\. Rewriting the exponent shows that this likelihood depends on the data only through the block sumSt=∑j∈𝒮tXjS\_\{t\}=\\sum\_\{j\\in\\mathcal\{S\}\_\{t\}\}X\_\{j\}\(equivalently the block meanztz\_\{t\}\)\. ThusStS\_\{t\}\(orztz\_\{t\}\) is a sufficient statistic forμt,y\\mu\_\{t,y\}in the exponential\-family sense\. Across blocks, conditional independence implies Eq\.[48](https://arxiv.org/html/2606.05441#A4.E48)so the global log\-likelihood ratiolog⁡p​\(X∣Y=1\)−log⁡p​\(X∣Y=0\)\\log p\(X\\mid Y=1\)\-\\log p\(X\\mid Y=0\)depends onXXonly through\{St\}t=1M\\\{S\_\{t\}\\\}\_\{t=1\}^\{M\}, i\.e\., only through\{zt\}t=1M\\\{z\_\{t\}\\\}\_\{t=1\}^\{M\}\. Therefore any Bayes\-optimal decision rulesign​\(log⁡p​\(Y=1∣X\)−log⁡p​\(Y=0∣X\)\)\\mathrm\{sign\}\(\\log p\(Y=1\\mid X\)\-\\log p\(Y=0\\mid X\)\)can be written as a function of\(z1,…,zM\)\(z\_\{1\},\\dots,z\_\{M\}\)\.

p​\(X𝒮t∣Y=y\)=∏j∈𝒮t12​π​σ2​exp⁡\(−\(Xj−μt,y\)22​σ2\)p\(X\_\{\\mathcal\{S\}\_\{t\}\}\\mid Y=y\)=\\prod\_\{j\\in\\mathcal\{S\}\_\{t\}\}\\frac\{1\}\{\\sqrt\{2\\pi\\sigma^\{2\}\}\}\\exp\\\!\\Big\(\-\\frac\{\(X\_\{j\}\-\\mu\_\{t,y\}\)^\{2\}\}\{2\\sigma^\{2\}\}\\Big\)\(47\)p​\(X∣Y=y\)=∏t=1Mp​\(X𝒮t∣Y=y\)p\(X\\mid Y=y\)=\\prod\_\{t=1\}^\{M\}p\(X\_\{\\mathcal\{S\}\_\{t\}\}\\mid Y=y\)\(48\)∎

###### Lemma D\.4\(SNR gain of NSC vs Random Projection in a block\)\.

Consider one block𝒮\\mathcal\{S\}of sizessunder a simple two\-class Gaussian mean\-shift model \(Eq\.[49](https://arxiv.org/html/2606.05441#A4.E49)\) with independent coordinates and fixedΔ≠0\\Delta\\neq 0,σ2\>0\\sigma^\{2\}\>0\(Dasgupta and Gupta,[2003](https://arxiv.org/html/2606.05441#bib.bib62); Devroyeet al\.,[1996](https://arxiv.org/html/2606.05441#bib.bib59); Johnson and Lindenstrauss,[1984](https://arxiv.org/html/2606.05441#bib.bib48)\)\.

Xj∣Y=0∼𝒩​\(0,σ2\),Xj∣Y=1∼𝒩​\(Δ,σ2\),j∈𝒮X\_\{j\}\\mid Y=0\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\),\\qquad X\_\{j\}\\mid Y=1\\sim\\mathcal\{N\}\(\\Delta,\\sigma^\{2\}\),\\qquad j\\in\\mathcal\{S\}\(49\)1. 1\.The NSC block mean \(Eq\.[50](https://arxiv.org/html/2606.05441#A4.E50)\) has class\-conditional mean difference𝔼​\[zNSC∣Y=1\]−𝔼​\[zNSC∣Y=0\]=Δ\\mathbb\{E\}\[z\_\{\\mathrm\{NSC\}\}\\mid Y=1\]\-\\mathbb\{E\}\[z\_\{\\mathrm\{NSC\}\}\\mid Y=0\]=\\Deltaand varianceVar​\(zNSC∣Y\)=σ2/s\\mathrm\{Var\}\(z\_\{\\mathrm\{NSC\}\}\\mid Y\)=\\sigma^\{2\}/s, hence we get Eq\.[50](https://arxiv.org/html/2606.05441#A4.E50)and[51](https://arxiv.org/html/2606.05441#A4.E51) zNSC=1s​∑j∈𝒮Xjz\_\{\\mathrm\{NSC\}\}=\\frac\{1\}\{s\}\\sum\_\{j\\in\\mathcal\{S\}\}X\_\{j\}\(50\)SNRNSC:=\(𝔼​\[zNSC∣Y=1\]−𝔼​\[zNSC∣Y=0\]\)2Var​\(zNSC∣Y\)=Δ2​sσ2\\mathrm\{SNR\}\_\{\\mathrm\{NSC\}\}:=\\frac\{\\big\(\\mathbb\{E\}\[z\_\{\\mathrm\{NSC\}\}\\mid Y=1\]\-\\mathbb\{E\}\[z\_\{\\mathrm\{NSC\}\}\\mid Y=0\]\\big\)^\{2\}\}\{\\mathrm\{Var\}\(z\_\{\\mathrm\{NSC\}\}\\mid Y\)\}=\\frac\{\\Delta^\{2\}s\}\{\\sigma^\{2\}\}\(51\)
2. 2\.Letu=\(uj\)j∈𝒮u=\(u\_\{j\}\)\_\{j\\in\\mathcal\{S\}\}be a random projection direction with∑j∈𝒮uj2=1\\sum\_\{j\\in\\mathcal\{S\}\}u\_\{j\}^\{2\}=1and entries of order1/s1/\\sqrt\{s\}, and define Eq\.[52](https://arxiv.org/html/2606.05441#A4.E52)\. Then we get Eq\.[52](https://arxiv.org/html/2606.05441#A4.E52)and Eq\.[53](https://arxiv.org/html/2606.05441#A4.E53)\. For typical randomuu,𝔼​\[\(∑juj\)2\]=1\\mathbb\{E\}\\big\[\(\\sum\_\{j\}u\_\{j\}\)^\{2\}\\big\]=1, so the typical SNR ofzRPz\_\{\\mathrm\{RP\}\}is defined by Eq\.[54](https://arxiv.org/html/2606.05441#A4.E54)\. zRP=∑j∈𝒮uj​Xjz\_\{\\mathrm\{RP\}\}=\\sum\_\{j\\in\\mathcal\{S\}\}u\_\{j\}X\_\{j\}\(52\)𝔼​\[zRP∣Y=1\]−𝔼​\[zRP∣Y=0\]=Δ​∑j∈𝒮uj,Var​\(zRP∣Y\)=σ2\\mathbb\{E\}\[z\_\{\\mathrm\{RP\}\}\\mid Y=1\]\-\\mathbb\{E\}\[z\_\{\\mathrm\{RP\}\}\\mid Y=0\]=\\Delta\\sum\_\{j\\in\\mathcal\{S\}\}u\_\{j\},\\qquad\\mathrm\{Var\}\(z\_\{\\mathrm\{RP\}\}\\mid Y\)=\\sigma^\{2\}\(53\)SNRRP:=\(𝔼​\[zRP∣Y=1\]−𝔼​\[zRP∣Y=0\]\)2Var​\(zRP∣Y\)≈Δ2σ2\\mathrm\{SNR\}\_\{\\mathrm\{RP\}\}:=\\frac\{\\big\(\\mathbb\{E\}\[z\_\{\\mathrm\{RP\}\}\\mid Y=1\]\-\\mathbb\{E\}\[z\_\{\\mathrm\{RP\}\}\\mid Y=0\]\\big\)^\{2\}\}\{\\mathrm\{Var\}\(z\_\{\\mathrm\{RP\}\}\\mid Y\)\}\\approx\\frac\{\\Delta^\{2\}\}\{\\sigma^\{2\}\}\(54\)

Consequently, in this block model \(Eq\.[55](https://arxiv.org/html/2606.05441#A4.E55)i\.e\., NSC’s block mean enjoys an SNR gain of a factorssover a typical random projection coordinate\.

SNRNSCSNRRP≈s\\frac\{\\mathrm\{SNR\}\_\{\\mathrm\{NSC\}\}\}\{\\mathrm\{SNR\}\_\{\\mathrm\{RP\}\}\}\\approx s\(55\)

###### Proof sketch\.

Part \(1\) follows by linearity of expectation and the variance of the average ofssi\.i\.d\. Gaussians\. For part \(2\), the mean and variance ofzRPz\_\{\\mathrm\{RP\}\}are obtained by linearity and the constraint∑juj2=1\\sum\_\{j\}u\_\{j\}^\{2\}=1\. For randomuuwith roughly i\.i\.d\. components of variance1/s1/s, we have𝔼​\[\(∑juj\)2\]=s⋅𝔼​\[uj2\]=1\\mathbb\{E\}\[\(\\sum\_\{j\}u\_\{j\}\)^\{2\}\]=s\\cdot\\mathbb\{E\}\[u\_\{j\}^\{2\}\]=1, so the typical squared signal isΔ2\\Delta^\{2\}, leading to the stated SNR\. The ratio then simplifies toss\. ∎

###### Proposition D\.6\(HDLSS stability of NSC block statistics\)\.

In the block model of Definition[D\.1](https://arxiv.org/html/2606.05441#A4.Thmtheorem1), suppose the block sizessis fixed while the ambient dimensionm=M​sm=Msmay grow\. For each block𝒮t\\mathcal\{S\}\_\{t\}, the NSC block mean \(Eq\.[56](https://arxiv.org/html/2606.05441#A4.E56)\) satisfies Eq\.[57](https://arxiv.org/html/2606.05441#A4.E57)and its empirical estimate based onnnsamples converges to its population value at rateOp​\(n−1/2\)O\_\{p\}\(n^\{\-1/2\}\), independently ofmm\. By contrast, global linear DR methods such as PCA or dense linear encoders must estimate directions inℝm\\mathbb\{R\}^\{m\}based on the empirical covariance matrix or large weight matrices of sizeO​\(m\)O\(m\)orO​\(m2\)O\(m^\{2\}\), which is known to be unstable in the HDLSS regimem≫nm\\gg nwithout strong structural assumptions\(Jung and Marron,[2009](https://arxiv.org/html/2606.05441#bib.bib65); Yata and Aoshima,[2009](https://arxiv.org/html/2606.05441#bib.bib66); Strawderman,[2014](https://arxiv.org/html/2606.05441#bib.bib63); Cover and Thomas,[2006](https://arxiv.org/html/2606.05441#bib.bib64)\)\. Thus, in this stylized setting, NSC’s local block statistics remain well\-conditioned asmmgrows, while unconstrained global DR can become ill\-posed\.

zt=1s​∑j∈𝒮tXjz\_\{t\}=\\frac\{1\}\{s\}\\sum\_\{j\\in\\mathcal\{S\}\_\{t\}\}X\_\{j\}\(56\)zt=ht\+ϵ¯t,ϵ¯t:=1s​∑j∈𝒮tϵj∼𝒩​\(0,σ2/s\)z\_\{t\}\\;=\\;h\_\{t\}\+\\bar\{\\epsilon\}\_\{t\},\\qquad\\bar\{\\epsilon\}\_\{t\}:=\\frac\{1\}\{s\}\\sum\_\{j\\in\\mathcal\{S\}\_\{t\}\}\\epsilon\_\{j\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}/s\)\(57\)

### D\.5Comparison and Evaluation Protocol

To empirically position NSC as a structured DR layer, we adopt the following protocol for both real and synthetic experiments\.

##### Experimental setup\.

For each dataset \(real HDLSS or synthetic block\-structured\):

1. 1\.GO\-LR ordering\.Compute the GO\-LR global feature permutationΠ∗\\Pi^\{\\ast\}on the training set using a chosen metric \(e\.g\., correlation, cosine, euclidean, manhattan, or KL divergence\), with local refinement passes as in Algorithm[1](https://arxiv.org/html/2606.05441#alg1)\.
2. 2\.Intrinsic dimension\.Estimated^\\hat\{d\}via effective rank \(Eqs\.[27](https://arxiv.org/html/2606.05441#A3.E27)\-[29](https://arxiv.org/html/2606.05441#A3.E29)\), and compute IDF=d^/m=\\hat\{d\}/m\(Eq\.[30](https://arxiv.org/html/2606.05441#A3.E30)\)\.
3. 3\.NSC configuration\.Configure NSC withΠ∗\\Pi^\{\\ast\}, using: - •*Default compression:*MMchosen by Eq\.[21](https://arxiv.org/html/2606.05441#S3.E21)\(e\.g\., yieldingM≈2​d^M\\approx 2\\hat\{d\}and resulting in tens to a few hundred meta\-features\)\. - •*Aggressive compression:*fixedM∈\{32,64\}M\\in\\\{32,64\\\}to emulate strong dimensionality reduction\. Apply NSC to obtain embeddingsXNSC∈ℝn×MX\_\{\\text\{NSC\}\}\\in\\mathbb\{R\}^\{n\\times M\}\.

##### Baselines\.

For each dataset and target dimensionMM\(e\.g\.,M=32M=32\), we construct the following DR baselines:

- •PCA: computekkprincipal components withk=Mk=M, producingXPCA∈ℝn×MX\_\{\\text\{PCA\}\}\\in\\mathbb\{R\}^\{n\\times M\}\.
- •RP: apply a dense Gaussian projectionR∈ℝM×mR\\in\\mathbb\{R\}^\{M\\times m\}, normalized to preserve variance, yieldingXRP=X​R⊤X\_\{\\text\{RP\}\}=XR^\{\\top\}\.
- •AE: train a shallow autoencoder with bottleneck dimensionMMon the training set, and use the bottleneck activationsXAE∈ℝn×MX\_\{\\text\{AE\}\}\\in\\mathbb\{R\}^\{n\\times M\}as the DR representation\.
- •UMAP: learn a nonlinear manifold embedding with target dimensionMM, preserving local neighborhood structure in the sample space and producingXUMAP∈ℝn×MX\_\{\\text\{UMAP\}\}\\in\\mathbb\{R\}^\{n\\times M\}\.
- •PaCMAP: learn a nonlinear embedding with target dimensionMMby balancing nearby, mid\-near, and far pair constraints, yieldingXPaCMAP∈ℝn×MX\_\{\\text\{PaCMAP\}\}\\in\\mathbb\{R\}^\{n\\times M\}\.

##### NSC variants and naming conventions\.

We evaluate a family ofNSCvariants that share a common pipeline \(i\) determine an IDF to set the internal granularity which is a ratio of Intrinsic Dimension \(ID\) to the actual dimension, \(ii\) segment the feature sequence into contiguous blocks, and \(iii\) compress each block into anMM\-dimensional representation, but differ in how IDF is estimated and how each segment is summarized\. We denote the default variant as NSC, which uses an effective\-rank \(data\-driven\) ID estimate and applies statistical descriptor\-based pooling within each segment\. To isolate the impact of a PCA\-inspired ID heuristic while keeping the same descriptor\-based segment summarization, we useNSC\-P\(also written as NSC\-PCA\)\. To study the role of replacing descriptors with explicit linear projection inside segments, we defineNSC\-SP\(also written as NSC\-SegPCA\), which retains the effective rank\-based IDF but applies PCA within each segment after segmentation\. Finally, our proposedNSC\-pSP\(also written as NSC\-PIDF\-SegPCA\) combines both modifications: a PCA\-inspired IDF estimate together with per\-segment PCA compression after segmentation\. In all cases, the suffixes indicate the modification relative to NSC:Pdenotes PCA\-inspired IDF,SPdenotes segmented per\-block PCA, andpSPdenotes the combination of both\.

##### Quantitative comparisons\.

To compare NSC variants against PCA/RP/AE/UMAP/PaCMAP, we use:

1. 1\.Linear\-probe accuracy\.Train a logistic regression classifier on each latent spaceXNSC,XPCA,XRP,XAE,XUMAP,XPaCMAPX\_\{\\text\{NSC\}\},X\_\{\\text\{PCA\}\},X\_\{\\text\{RP\}\},X\_\{\\text\{AE\}\},X\_\{\\text\{UMAP\}\},X\_\{\\text\{PaCMAP\}\}using the same stratified cross\-validation protocol\. This measures how well each DR method preserves label\-relevant structure inMMdimensions\.
2. 2\.kkNN classification in latent space\.EvaluatekkNN accuracy using the compressed embeddings under the same stratified cross\-validation protocol to assess neighborhood quality\.
3. 3\.Label\-based separability metrics\.Compute silhouette and Davies\-Bouldin scores on each latent representation using ground\-truth class labels as the partition\.
4. 4\.Statistical comparison across datasets\.Aggregate per\-dataset metrics and apply nonparametric tests, using a Friedman test\(Friedman,[1937](https://arxiv.org/html/2606.05441#bib.bib96)\)across methods followed by one\-sided Wilcoxon signed\-rank comparisons with NSC\-pSP as the reference \(see Tables[D\.1](https://arxiv.org/html/2606.05441#A4.T1),[D\.2](https://arxiv.org/html/2606.05441#A4.T2),[D\.3](https://arxiv.org/html/2606.05441#A4.T3)\)\.

##### Evaluation protocol\.

We report only quantitative DR\-style evaluations under a fixed aggressive budget ofM=32M\{=\}32\. For each method \(NSC variants, PCA, AE, RP, UMAP, PaCMAP\), we first compute a 32\-dimensional latent representation inℝ32\\mathbb\{R\}^\{32\}, and then assess \(i\) linear\-probe accuracy via logistic regression, \(ii\)kkNN accuracy in latent space, and \(iii\) label\-based separability via silhouette and Davies\-Bouldin scores, using the same stratified cross\-validation protocol across methods \(Tables[D\.1](https://arxiv.org/html/2606.05441#A4.T1)\-[D\.2](https://arxiv.org/html/2606.05441#A4.T2)\)\. Overall, these experiments treat NSC as a first\-class dimensionality reduction layer: it produces an explicitℝM\\mathbb\{R\}^\{M\}representation that can be consumed directly by TabPFN\-style predictor under strict token budgets, while still supporting standard DR comparisons to PCA/RP/AE/UMAP/PaCMAP via probe and clustering metrics\. In HDLSS settings, NSC couples GO\-LR’s MinLA\-motivated ordering with subunit\-wise pooling to achieve \(i\) aggressive compression \(M≪mM\\ll m\) and \(ii\) a locality\-preserving structured embedding\. Empirically, NSC is typically competitive with PCA, AE, UMAP, and PaCMAP and consistently outperforms RP on real HDLSS datasets; on the block\-structured synthetic model, the NSC family yields the best mean results across the evaluated metrics \(Table[D\.1](https://arxiv.org/html/2606.05441#A4.T1)\) and shows significantly stronger predictive, neighborhood, and separability structure in several comparisons \(Table[D\.2](https://arxiv.org/html/2606.05441#A4.T2)\), highlighting a regime where its locality bias matches the data\-generating process\.

##### Synthetic block\-model validation and takeaways\.

To connect the empirical trends to our theoretical picture, we evaluate NSC as an explicit DR layer on a synthetic HDLSS block model aligned with its inductive bias:mmfeatures are generated inB=40B\{=\}40latent\-factor blocks \(shared block signal \+ noise\), class information is injected via mean shifts on a small subset of blocks, and then feature indices are randomly permuted to destroy contiguity in the raw space\. Under an aggressive budget \(M=32M\{=\}32\) and over 10 independent repetitions \(Table[D\.1](https://arxiv.org/html/2606.05441#A4.T1)\), NSC\-style embeddings remain strongly discriminative while preserving neighborhood/cluster structure competitively against unstructured and nonlinear DR: nonparametric tests confirm significant differences across methods and show consistent gains in latent\-space geometry \(e\.g\., higherkkNN accuracy versus PCA/RP and improved separability via silhouette/DB; Table[D\.2](https://arxiv.org/html/2606.05441#A4.T2)\)\. Overall, these results reinforce the DR interpretation of NSC: GO\-LR constructs a locality\-revealing axis, and NSC compresses it via subunit\-wise pooling into interpretable meta\-features whose coordinates summarize coherent redundant neighborhoods, contrasting with the global mixing of PCA/RP, many AE encoders, and sample\-space manifold embeddings such as UMAP/PaCMAP\. Thus, its primary advantage under strong compression is retaining local geometry and cluster structure while staying competitive in discriminative accuracy; among variants, NSC\-pSP is the most consistently competitive on the block model \(Table[D\.3](https://arxiv.org/html/2606.05441#A4.T3)\), while on real HDLSS benchmarks at the sameM=32M\{=\}32budget it is broadly comparable to PCA/AE/UMAP/PaCMAP with smaller, dataset\-dependent differences\.

##### GO\-LR\+NSC vs\. PCA\.

Global PCA\(Hotelling,[1933](https://arxiv.org/html/2606.05441#bib.bib46)\)→\\rightarrowTabPFN\-2\.5\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115)\)is a natural control for testing whether the gains of GOTabPFN come merely from dimensionality reduction\. We therefore compare GO\-LR\+NSC\+TabPFN\-2\.5 against Global PCA\+TabPFN\-2\.5 while keeping the same frozen TabPFN\-2\.5 predictor and changing only the front\-end compression interface\. As shown in Table[D\.4](https://arxiv.org/html/2606.05441#A4.T4), GO\-LR\+NSC outperforms Global PCA on all 8 HDLSS datasets, suggesting that the improvement is not explained by generic global compression alone, but by locality\-aware ordering and structured neighborhood compression\.

Table D\.4:GO\-LR\+NSC vs\. PCA\.Accuracy comparison using the same frozen TabPFN\-2\.5 predictor, where only the front\-end compression method differs\. Values are mean accuracy with subscripted standard deviation over5×55\\times 5CV\.
##### GO\-LR\+NSC vs\. Lasso\-selected features\.

We compare GO\-LR\+NSC against Lasso\-selected features under the same frozen TabPFN\-2\.5\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115)\)predictor and identical5×55\\times 5CV protocol\. Specifically, we evaluate Lasso\+TabPFN\-2\.5 \(LT\) across sparsity levelsC∈\{0\.01,0\.02,0\.05\}C\\in\\\{0\.01,0\.02,0\.05\\\}\. As shown in Table[D\.5](https://arxiv.org/html/2606.05441#A4.T5), GOTabPFN outperforms all LT variants on most HDLSS datasets, while LT only slightly exceeds GOTabPFN on SMK and AML under the weakest sparsity setting \(C=0\.05C=0\.05\)\. This suggests that locality\-aware feature ordering and structured neighborhood compression provide more robust gains than sparsity\-based feature selection alone in HDLSS settings\.

Table D\.5:GO\-LR\+NSC vs\. Lasso\-selected features\.Accuracy comparison using the same frozen TabPFN\-2\.5 predictor\. LT denotes Lasso\-selected features \+ TabPFN\-2\.5, evaluated at different sparsity levelsCC\. Values are mean accuracy with subscripted standard deviation over5×55\\times 5CV\.

## Appendix EWhy Feature Ordering? Local Neighborhoods Enable Structure\-Aware Compression

A common critique is that tabular features form an unordered set, so learning a column order may appear arbitrary\. In this work, ordering is not introduced to impose a fictional semantics \(e\.g\., time\), but to construct an algorithmic coordinate system: a 1D axis on which local neighborhoods become meaningful via a structure\-revealing layout objective from seriation / graph layout \(rather than positional meaning\)\(Arabie and Hubert,[1992](https://arxiv.org/html/2606.05441#bib.bib67); Hahsleret al\.,[2008](https://arxiv.org/html/2606.05441#bib.bib68); Díazet al\.,[2002](https://arxiv.org/html/2606.05441#bib.bib11); Atkinset al\.,[1998](https://arxiv.org/html/2606.05441#bib.bib69)\)\. This is essential for our pipeline because NSC is an explicitly local operator it pools contiguous segments into meta\-features \(Sec\.[3\.2](https://arxiv.org/html/2606.05441#S3.SS2)\)\. Without an ordering that places related features nearby, contiguity\-based pooling becomes arbitrary aggregation and can destroy predictive signal\. Therefore, feature ordering is primarily justified as a neighborhood construction mechanism that enables structured compression and stable tokenization under tight budgets\.

### E\.1Ordering Constructs Coherent Neighborhoods That NSC Can Pool

NSC partitions the ordered feature axis intoMMcontiguous segments and pools each segment into a token \(Sec\.[3\.2](https://arxiv.org/html/2606.05441#S3.SS2)\)\. This implicitly assumes that adjacency along the axis corresponds to statistical relatedness\. GO\-LR is designed exactly to enforce this locality: it approximately minimizes a MinLA/seriation\-style dispersion objective that penalizes placing strongly related feature pairs far apart, aligning index locality with a similarity graph\(Díazet al\.,[2002](https://arxiv.org/html/2606.05441#bib.bib11); Atkinset al\.,[1998](https://arxiv.org/html/2606.05441#bib.bib69); Gareyet al\.,[1974](https://arxiv.org/html/2606.05441#bib.bib70)\)\. As a result, features that are close under the global dissimilarity structure become near neighbors in index, so NSC pooling operates on coherent neighborhoods and yields compressed tokens that preserve informative local structure\. In contrast, if features are left in raw or arbitrary order, contiguous segments mix unrelated variables and pooling becomes a lossy averaging operation\. This is precisely the failure mode we can worry about: ordering only matters if downstream modules use contiguity\. Since NSC explicitly uses contiguity, ordering becomes a necessary part of the representation interface\.

### E\.2Ordering Improves Compression by Reducing Cross\-Segment Boundary Cuts

The role of ordering can be formalized through the segmentation\-induced boundary cut\. LetW¯∈ℝm×m\\bar\{W\}\\in\\mathbb\{R\}^\{m\\times m\}denote the global dissimilarity matrix used to define neighborhoods \(Sec\.[3\.2](https://arxiv.org/html/2606.05441#S3.SS2)\), and letΠ∗\\Pi^\{\\ast\}be the learned ordering\. Consider a segmentation intoMMcontiguous segments\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}along the ordered axis\. Define the cross\-segment boundary cost by Eq\.[58](https://arxiv.org/html/2606.05441#A5.E58)\. Intuitively,Cut\\mathrm\{Cut\}measures how often highly related features are split across segment boundaries; the definition parallels cut objectives used in graph\-based segmentation/layout\(Shi and Malik,[2000](https://arxiv.org/html/2606.05441#bib.bib71); Díazet al\.,[2002](https://arxiv.org/html/2606.05441#bib.bib11)\)\. A smaller cut indicates that each segment captures a coherent neighborhood, so pooled tokens retain structure\. Since GO\-LR reduces dispersion, it typically also reduces boundary cuts for reasonable segmentations; random or raw orders inflate boundary cuts, making local pooling ineffective\.

Cut​\(Π∗,\{𝒮t\}\)=∑t=1M−1∑i∈𝒮t∑j∈𝒮t\+1W¯Π∗​\(i\),Π∗​\(j\)\\mathrm\{Cut\}\(\\Pi^\{\\ast\},\\\{\\mathcal\{S\}\_\{t\}\\\}\)\\;=\\;\\sum\_\{t=1\}^\{M\-1\}\\sum\_\{i\\in\\mathcal\{S\}\_\{t\}\}\\sum\_\{j\\in\\mathcal\{S\}\_\{t\+1\}\}\\bar\{W\}\_\{\\Pi^\{\\ast\}\(i\),\\,\\Pi^\{\\ast\}\(j\)\}\(58\)

### E\.3What Ordering Is Not: We Do Not Make Tabular Data “Sequential”

We do not claim that tabular columns possess an intrinsic order like words or pixels\. Rather, we learn an order as a layout that makes neighborhood\-based operators \(pooling, segmentation, local filters\) well\-defined\. This mirrors classic seriation/linear arrangement goals: the objective is not positional semantics, but an ordering that makes locality meaningful for downstream computation\(Arabie and Hubert,[1992](https://arxiv.org/html/2606.05441#bib.bib67); Hahsleret al\.,[2008](https://arxiv.org/html/2606.05441#bib.bib68); Atkinset al\.,[1998](https://arxiv.org/html/2606.05441#bib.bib69); Díazet al\.,[2002](https://arxiv.org/html/2606.05441#bib.bib11)\)\. In our case, ordering is valuable specifically because NSC is local along the constructed axis\.

### E\.4Empirical Diagnostics for Neighborhood Preservation \(Order\-Only\)

We report order\-only diagnostics that quantify whether an ordering induces meaningful local neighborhoods independently of any classifier or tokenizer\. All diagnostics are computed on the 8 HDLSS datasets using standardized features\. Since explicitly materializing a dense pairwise dissimilarity/Gram matrix scales as𝒪​\(m2\)\\mathcal\{O\}\(m^\{2\}\)in memory \(and time\), formingW¯∈ℝm×m\\bar\{W\}\\in\\mathbb\{R\}^\{m\\times m\}quickly becomes impractical at HDLSS dimensions \(e\.g\., gene\-expression arrays with∼104\\sim 10^\{4\}\-104​\.510^\{4\}\.5genes and GWAS with≥105\\geq 10^\{5\}SNP markers\)\.\(Siet al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib72); Dangond,[2000](https://arxiv.org/html/2606.05441#bib.bib73); Maguireet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib74)\), we use a lightweight proxyW¯i​j≜1−\|corr​\(xi,xj\)\|\\bar\{W\}\_\{ij\}\\triangleq 1\-\|\\mathrm\{corr\}\(x\_\{i\},x\_\{j\}\)\|, estimated from the standardized data \(and for adjacent deltas from a small row subset for stability/speed\) where lower is better, indicating that neighbors along the axis are more mutually similar\. Across the 8 datasets,Π∗\\Pi^\{\\ast\}achieves significantly lower adjacency dissimilarity than random permutations \(paired by dataset; Wilcoxon signed\-rankp=0\.00390625p=0\.00390625; Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)d,f\)\.

##### Local adjacency coherence \(path\-length objective\)\.

Given a feature orderingΠ∗\\Pi^\{\\ast\}and a dissimilarity matrixW¯\\bar\{W\}, we define the adjacent dissimilarityδt=W¯Π∗​\(t\),Π∗​\(t\+1\)\\delta\_\{t\}=\\bar\{W\}\_\{\\Pi^\{\*\}\(t\),\\,\\Pi^\{\*\}\(t\+1\)\}\. We define the adjacency coherence as the mean adjacent dissimilarity along the ordering, which is the Hamiltonian path \(TSP\-path\) length objective used in seriation, up to normalization\(Hahsleret al\.,[2008](https://arxiv.org/html/2606.05441#bib.bib68)\)\. We measureAdjCoh​\(Π∗\)\\mathrm\{AdjCoh\}\(\\Pi^\{\\ast\}\)by Eq\.[59](https://arxiv.org/html/2606.05441#A5.E59)\.

AdjCoh​\(Π∗\)=1m−1​∑t=1m−1δt=1m−1​∑t=1m−1W¯Π∗​\(t\),Π∗​\(t\+1\)\\mathrm\{AdjCoh\}\(\\Pi^\{\*\}\)=\\frac\{1\}\{m\-1\}\\sum\_\{t=1\}^\{m\-1\}\\delta\_\{t\}=\\frac\{1\}\{m\-1\}\\sum\_\{t=1\}^\{m\-1\}\\bar\{W\}\_\{\\Pi^\{\*\}\(t\),\\,\\Pi^\{\*\}\(t\+1\)\}\(59\)

##### Neighborhood hit\-rate for top\-kkneighbors\.

For each featureii, let𝒩k​\(i\)\\mathcal\{N\}\_\{k\}\(i\)be its top\-kknearest neighbors underW¯\\bar\{W\}\. For an orderingΠ∗\\Pi^\{\\ast\}, define the window neighborhood𝒲h​\(i\)=\{j:\|posΠ∗​\(j\)−posΠ∗​\(i\)\|≤h\}\.\\mathcal\{W\}\_\{h\}\(i\)=\\\{j:\\ \|\\mathrm\{pos\}\_\{\\Pi^\{\\ast\}\}\(j\)\-\\mathrm\{pos\}\_\{\\Pi^\{\\ast\}\}\(i\)\|\\leq h\\\}\.We computeHitRatek,h​\(Π∗\)\\mathrm\{HitRate\}\_\{k,h\}\(\\Pi^\{\*\}\)by Eq\.[60](https://arxiv.org/html/2606.05441#A5.E60)where higher is better\. To avoidO​\(m2\)O\(m^\{2\}\)complexity, we compute𝒩k​\(i\)\\mathcal\{N\}\_\{k\}\(i\)on a feature subsample \(default 2048 features\) and average over features and \(when applicable\) multiple random seeds\.Π∗\\Pi^\{\\ast\}yields consistently higher hit\-rate than random orderings \(Wilcoxon signed\-rankp=0\.0078125p=0\.0078125for a representative\(k,h\)=\(10,16\)\(k,h\)=\(10,16\); Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)c,e,f\), and the advantage persists over multiple\(k,h\)\(k,h\)choices \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)e\)\(Venna and Kaski,[2001](https://arxiv.org/html/2606.05441#bib.bib75)\)\.

HitRatek,h​\(Π∗\)=1m​∑i=1m\|𝒩k​\(i\)∩𝒲h​\(i\)\|k\\mathrm\{HitRate\}\_\{k,h\}\(\\Pi^\{\\ast\}\)=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\frac\{\|\\mathcal\{N\}\_\{k\}\(i\)\\cap\\mathcal\{W\}\_\{h\}\(i\)\|\}\{k\}\(60\)

##### Segmentation boundary cut\.

To test alignment between the ordering and contiguity\-based pooling, we evaluate a boundary\-cut proxy under common segmentation rules \(uniform, equal\-mass, largest\-jump\) withM=32M\{=\}32andlmin=8l\_\{\\min\}\{=\}8\(matching our NSC configuration\)\. Given segments\{𝒮t\}\\\{\\mathcal\{S\}\_\{t\}\\\}alongΠ∗\\Pi^\{\*\}, we summarize the average dissimilarity at segment boundaries via the adjacent deltas\(Eq\.[61](https://arxiv.org/html/2606.05441#A5.E61)\) whereℬ\\mathcal\{B\}are boundary indices between consecutive segments\. Lower cut indicates that segment boundaries fall on weaker connections, i\.e\., stronger within\-segment neighborhood coherence\. Across datasets,Π∗\\Pi^\{\\ast\}yields favorable boundary alignment relative to random permutations \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)b\)\(Shi and Malik,[2000](https://arxiv.org/html/2606.05441#bib.bib71)\)\.

Cut​\(Π∗,\{𝒮t\}\)≈1\|ℬ\|​∑b∈ℬδb−1\\mathrm\{Cut\}\(\\Pi^\{\\ast\},\\\{\\mathcal\{S\}\_\{t\}\\\}\)\\;\\approx\\;\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{b\\in\\mathcal\{B\}\}\\delta\_\{b\-1\}\(61\)

### E\.5Ablations That Isolate Why Ordering Helps NSC

To separate the effect of ordering from the effect of compression/tokenization, we evaluate NSC under controlled ordering perturbations while keeping the tokenizer/compressor fixed \(sameMM, segmentation rule, pooling, and tuned hyperparameters\)\. We use five random seeds for permutation\-based controls\.

##### Ordered vs\. un\-ordered\.

We compare the same NSC configuration under: \(i\) GO\-LR orderΠ∗\\Pi^\{\\ast\}, \(ii\) the raw/original column order, \(iii\) random permutations \(averaged over seeds\)\. This directly tests whether NSC benefits from neighborhood structure rather than from compression alone\. In parallel, the order\-only diagnostics \(AdjCoh/HitRate/Cut\) show thatΠ∗\\Pi^\{\\ast\}is systematically more neighborhood\-preserving than random, providing a mechanistic explanation for the observed gains \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)c,d,f\)\.

##### Destroy global layout while partially preserving locality \(block shuffle\)\.

Starting fromΠ∗\\Pi^\{\\ast\}, we partition indices into contiguous blocks of sizebband randomly permute the blocks\. This preserves within\-block neighborhoods but disrupts long\-range arrangement\. If NSC relies primarily on local neighborhoods, performance/diagnostics should improve asbbincreases\. Consistent with this, normalized hit\-rate increases monotonically with block size:0\.740±0\.1840\.740\\pm 0\.184\(b=8b\{=\}8\),0\.889±0\.1770\.889\\pm 0\.177\(b=16b\{=\}16\),0\.968±0\.0970\.968\\pm 0\.097\(b=32b\{=\}32\),1\.011±0\.0951\.011\\pm 0\.095\(b=64b\{=\}64\), relative to the correspondingΠ∗\\Pi^\{\\ast\}value per dataset \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)a,f\)\. This supports the locality hypothesis: preserving larger local neighborhoods recovers theΠ∗\\Pi^\{\\ast\}advantage\.

##### Keep order fixed, break contiguity \(round\-robin segments\)\.

We keepΠ∗\\Pi^\{\\ast\}but break contiguity by assigning features to segments in a round\-robin manner and then concatenating segments\. This retains the same set and global ordering statistics, but destroys the contiguous neighborhood pooling assumption\. The resulting drop in hit\-rate \(and corresponding degradation in order\-sensitive behavior\) isolates contiguity as the operative mechanism \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)c,d\)\.

##### Optional representation probes\.

When needed, we complement the order\-only metrics with lightweight probes on the produced tokens \(e\.g\., linear probe\) under the ablations above\. These probes are used only to verify that improved neighborhood preservation translates into higher\-quality representations, while the primary claim remains anchored in the classifier\-free diagnostics\.

##### Summary\.

Across datasets, the learned orderingΠ∗\\Pi^\{\\ast\}consistently preserves local neighborhoods better than baselines: the hit\-rate is highest underΠ∗\\Pi^\{\\ast\}, typically followed by the raw order, while randomization and explicitly breaking contiguity degrade neighborhood agreement \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)\(c\)\); this is mirrored by adjacency coherence, whereΠ∗\\Pi^\{\\ast\}attains the lowest \(best\)AdjCoh\\mathrm\{AdjCoh\}and random orderings are worst \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)\(d\)\)\. The robustness heatmap further shows thatΠ∗\\Pi^\{\\ast\}yields uniformly positive gains over random across all tested\(k,h\)\(k,h\), with larger improvements for wider locality windows \(largerhh\) and smallerkk\(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)\(e\)\)\. For segmentation alignment,Π∗\\Pi^\{\\ast\}tends to reduce boundary cut under “equal\_mass” \(and is roughly neutral under “largest\_jump”\), whereas “uniform” can be inconsistent and often flips the advantage \(negative medianΔ​Cut\\Delta\\mathrm\{Cut\}\), suggesting uniform boundaries may not match the induced contiguous structure \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)\(b\)\)\. Finally, the block\-shuffle experiment shows a clear scale effect: when features are shuffled within small blocks, the normalized hit\-rate drops substantially, but it recovers towardΠ∗\\Pi^\{\\ast\}as block size increases \(e\.g\., rising from≈0\.67\\approx 0\.67atb=8b\{=\}8to≈0\.90\\approx 0\.90atb=64b\{=\}64\), indicating that locality is largely preserved within coarse blocks and mainly disrupted by fine\-grained shuffles \(Fig\.[E\.1](https://arxiv.org/html/2606.05441#A5.F1)\(a\)\)\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/blockshuffle_sensitivity.png)\(a\)Block\-shuffle sensitivity\.NormalizedHitRatek,h\\mathrm\{HitRate\}\_\{k,h\}\(relative toΠ∗\\Pi^\{\\ast\}\) vs\. block sizebb\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/cut_delta_boxplot.png)\(b\)Boundary\-cut advantage\.Δ​Cut=Cut​\(random\)−Cut​\(Π∗\)\\Delta\\mathrm\{Cut\}=\\mathrm\{Cut\}\(\\text\{random\}\)\-\\mathrm\{Cut\}\(\\Pi^\{\\ast\}\)across datasets \(positive⇒\\RightarrowΠ∗\\Pi^\{\\ast\}lower cut\)\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/hitrate_across_families.png)\(c\)Hit\-rate across order families\.HitRatek,h\\mathrm\{HitRate\}\_\{k,h\}is highest underΠ∗\\Pi^\{\\ast\}and drops when contiguity is broken\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/adjcoh_across_families.png)\(d\)Adjacency coherence across families\.AdjCoh\\mathrm\{AdjCoh\}is lowest \(best\) underΠ∗\\Pi^\{\\ast\}and worse under random\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/hitrate_delta_heatmap.png)\(e\)Robust neighborhood gains\.MeanΔ​HitRatek,h=HitRate​\(Π∗\)−HitRate​\(random\)\\Delta\\mathrm\{HitRate\}\_\{k,h\}=\\mathrm\{HitRate\}\(\\Pi^\{\\ast\}\)\-\\mathrm\{HitRate\}\(\\text\{random\}\)over\(k,h\)\(k,h\)\.
\(f\)Aggregated stats\.Wilcoxon tests \(n=8n\{=\}8datasets\) and block\-shuffle summary \(normalized toΠ∗\\Pi^\{\\ast\}\)\.

Figure E\.1:Order\-only neighborhood diagnostics and controls\.GO\-LR orderingΠ∗\\Pi^\{\\ast\}improves local coherence \(AdjCoh\), increases neighborhood recovery \(HitRate\), yields favorable segmentation alignment \(Cut\), and degrades predictably under block\-shuffle and contiguity\-breaking controls\.

### E\.6Beyond NSC: When Ordering Can Improve Accuracy

While our primary motivation is NSC’s contiguity\-based pooling, the same locality principle applies to any order\-sensitive learner that introduces architectural locality over feature tokens \(e\.g\., local attention windows, relative position bias, or convolutional mixing\)\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib76); Childet al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib77); Huanget al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib78); Gorishniyet al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib79); Somepalliet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib80)\)\. In such models, the feature order acts as a computational layout: by placing statistically related features nearby, the model concentrates informative interactions into small neighborhoods, which can reduce sample complexity and improve generalization in low\-sample regimes\.

##### Experimental evidence \(accuracy changes only when the backbone uses locality\)\.

We run controlled ordering experiments on two biomedicaln<mn<mdatasets \(AI\-d\_case5\(Ohlssonet al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib81)\)and ADNI\_AD123\(Petersenet al\.,[2010](https://arxiv.org/html/2606.05441#bib.bib84)\)\) using the same backbone and training protocol, changing only the column order applied consistently to train/val/test\. As an order\-sensitive backbone, we use a local\-window Transformer whose attention is restricted to a fixed neighborhood around each feature token\(Beltagyet al\.,[2020](https://arxiv.org/html/2606.05441#bib.bib85)\), making performance dependent on index\-locality\. We evaluate multiple ordering strategies: our GO\-LR orderΠ∗\\Pi^\{\\ast\}, a TabSeq\-style ordering\(Habibet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib142)\), the raw column order, random permutations \(averaged over seeds followed by TabICL\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\)\), a light version of ROTATOR\(Wanget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib128)\)and controlled perturbations that partially preserve locality \(block\-shuffle\) or explicitly destroy contiguity while keeping the same global order statistics \(round\-robin “break contiguity”\)\. Figure[E\.2](https://arxiv.org/html/2606.05441#A5.F2)\(top\) and Table[E\.1](https://arxiv.org/html/2606.05441#A5.T1)show that ordering yields non\-trivial AUC changes for the local\-window Transformer, and GO\-LR produces the strongest gains among the tested ordering methods on these datasets\. We use a fixed training protocol with a single train/val/test split and fixed hyperparameters \(no dataset\-specific tuning or HPO\); all results are produced with a fixed seed \(SEED=42\), except the random\-permutation baseline which is averaged over 5 permutation seeds\.

##### Permutation\-invariant sanity check\.

To verify that gains are not artifacts of reindexing, we repeat the same experiment using a permutation\-invariant control model \(set encoder / invariant pooling\) where consistent reordering should not systematically affect performance\(Zaheeret al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib143)\)\. As expected, Figure[E\.2](https://arxiv.org/html/2606.05441#A5.F2)\(middle\) shows near\-zero deltas across orderings, supporting the interpretation that improvements arise from locality\-aware computation rather than from accidental leakage or inconsistent preprocessing\. This perspective is also compatible with permutation\-ensemble approaches such as TabICL, which averages predictions over multiple random feature permutations to approximate invariance\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\)\.

##### Mechanistic link: better neighborhood preservation⇒\\Rightarrowbetter accuracy\.

Finally, we connect accuracy changes to order\-only locality diagnostics\. Figure[E\.2](https://arxiv.org/html/2606.05441#A5.F2)\(bottom\) shows that orderings with higher neighborhood hit\-rate \(HitRatek,h\) tend to yield higher AUC under the local\-window backbone, supporting the hypothesis that ordering helps by increasing the density of meaningful local interactions\. In other words, ordering can improve accuracy whenever the downstream architecture uses locality; when the architecture is invariant, ordering should not matter such as tree\-based models or set\-based models\(Zaheeret al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib143)\)\.

##### Relation to prior ordering methods\.

This finding aligns with prior work that explicitly learns or uses feature layouts to benefit order\-sensitive tabular learners \(e\.g\., TabSeq\(Habibet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib142)\)ordering heuristics and ordering strategies in recent LLM\-based tabular pipelines such as ROTATOR\-LLM\(Wanget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib128)\)\)\.

##### Summary\.

Figure[E\.2](https://arxiv.org/html/2606.05441#A5.F2)provides a causal/mechanistic check that ordering only matters when the backbone uses locality\. In the order\-sensitive local\-window Transformer, GO\-LR \(Π∗\\Pi^\{\\ast\}\) yields the most consistent positiveΔ\\DeltaAUC versus random across both datasets, while disrupting locality either by random permutations, breaking contiguity, or \(to a lesser extent\) block shuffling reduces or erases these gains; in contrast, the permutation\-invariant control showsΔ\\DeltaAUC values clustered near zero with no systematic advantage for any ordering, indicating that improvements are not an artifact of reindexing features but arise from the model’s locality bias \(Fig\.[E\.2](https://arxiv.org/html/2606.05441#A5.F2)a\)\. The mechanism plot further supports this explanation: for the local model, downstream AUC increases with neighborhood preservation \(HitRatek,h\), with higher\-performing orderings \(e\.g\., GO\-LR/Π∗\\Pi^\{\\ast\}\) occupying the high\-HitRate/high\-AUC region, whereas random or contiguity\-breaking variants sit at lower HitRate and correspondingly lower AUC \(Fig\.[E\.2](https://arxiv.org/html/2606.05441#A5.F2)b\)\. Together, these results suggest that learned orderings improve accuracy or overall classification performance beyond NSC\-based compression specifically by aligning informative feature neighborhoods with the local attention window, and that when locality is removed \(invariant control\), ordering ceases to provide systematic benefit \(Fig\.[E\.2](https://arxiv.org/html/2606.05441#A5.F2)\)\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/fig1_ordering_improves_accuracy2.png)\(a\)Causal check\.Δ\\DeltaAUC vs\. random permutations for an order\-sensitive local\-window Transformer \(top\) and a permutation\-invariant control \(bottom\)\. Ordering changes performance only when the backbone uses locality\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/fig2_neighborhood_predicts_accuracy2.png)\(b\)Mechanism\.Neighborhood preservation \(HitRatek,h\) correlates with downstream AUC for the local model, supporting the locality hypothesis\.

Figure E\.2:Ordering can improve accuracy beyond NSC\.Learned orderings matter for architectures that introduce locality over feature tokens; invariant controls do not exhibit systematic gains\.Table E\.1:Ordering improves AUC for an order\-sensitive local\-window Transformer, but not for a permutation\-invariant control, on twon<mn<mdatasets\. Values are mean±\\pmstd where multiple runs exist \(e\.g\., random permutations\); parentheses showΔ\\DeltaAUC relative to the random baseline within each dataset \(local model\)\.

## Appendix FFeature Ordering \- When to Use? Through the Lens of Locality

##### When Feature Ordering Matters\.

Deep Sets\(Zaheeret al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib143)\)is designed to be permutation\-invariant for genuinely unordered inputs \(e\.g\., point clouds, MIL, and chemoinformatics\)\. In contrast, we study high\-dimensional tabular settings where the chosen column layout can materially affect learning\. A well\-chosen permutation can reduce redundancy, expose latent dependencies among features, and ultimately improve predictive performance or help in contiguity based\-process like NSC where locality matters\. This is particularly pertinent for high\-dimensional biological measurements \(e\.g\., gene expression\), EEG and other sensor data, remote sensing and climate datasets, and multimodal or heavily engineered feature tables domains that often exhibit sparsity, redundancy, and hidden structure, and are thus natural targets for sequence\-dependent models\. DynaTab\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\)initiated a systematic study of when feature ordering is useful in high\-dimensional tabular learning, primarily from the perspective of ordering sensitivity and sequence\-dependent modeling\. We extend this view through the lens of locality: feature ordering is useful not only because sequence\-sensitive backbones depend on token order, but also because a good permutation can create contiguous neighborhoods of statistically related features, enabling locality\-based operators such as NSC to compress related features into informative meta\-features\.

##### Dataset Categorization Rules\.

There is no universally agreed upon numerical cutoff that uniquely determines when a dataset should be labeled HDLSS\. In much of the HDLSS literature, the term is used broadly for regimes where the ambient dimension \(number of variables\) is far larger than the sample size, and is often formalized through HDLSS asymptotics\(Aoshimaet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib89)\)in which the dimension grows while the sample size is fixed \(or grows much more slowly\)\(Hallet al\.,[2005](https://arxiv.org/html/2606.05441#bib.bib88); Jung and Marron,[2009](https://arxiv.org/html/2606.05441#bib.bib65)\)\. Consequently, the simple rule “

m\>nm\>n” is a useful heuristic but too coarse to capture the practical spectrum of high dimensionality\. To make this notion more operational, we follow DynaTab’s\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\)empirical regime stratification and use the feature\-to\-sample ratio

ρ=m/n\\rho=m/n, which helps distinguish qualitatively different regimes beyond a binary HDLSS vs\. non\-HDLSS split\. Let

nndenote the number of samples,

mmthe number of features, and

ρ=mn\\rho=\\frac\{m\}\{n\}the feature\-to\-sample ratio\. We assign each dataset to one of five regimes using the following empirical

ρ\\rhothresholds:

### F\.1Intrinsic Dimensionality Factor as a Proxy for Locality Exploitability

Our primary justification for feature ordering in this paper is locality: ordering constructs an algorithmic 1D axis on which contiguity corresponds to statistical relatedness, making neighborhood\-based operators \(e\.g\., NSC’s contiguous pooling\) well\-defined \(Sec\.[3\.2](https://arxiv.org/html/2606.05441#S3.SS2), Appendix[E](https://arxiv.org/html/2606.05441#A5)\)\. This motivates a complementary question: when should we expect ordering to provide tangible benefit? We connect to this locality view via the IDF\. Letmmbe the ambient number of features and letd^\\hat\{d\}denote an estimate of the dataset’s intrinsic dimensionality \(e\.g\., the number of principal components required to reach a fixed cumulative variance threshold, or an effective\-rank estimate\)\. In other words, the minimal number of features capturing core variability\(Chenet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib87)\)to its total feature count\. Following DynaTab\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\), we use Eq\.[62](https://arxiv.org/html/2606.05441#A6.E62)to computeIDF\\mathrm\{IDF\}\. Intuitively,IDF\\mathrm\{IDF\}measures how compact the data are relative to the ambient dimension\. A small IDF indicates substantial redundancy/low effective rank, which we hypothesize corresponds to stronger, more compressible correlation structure and thus a greater ability to induce local neighborhoods via a 1D layout\.

IDF=d^m\\mathrm\{IDF\}\\;=\\;\\frac\{\\hat\{d\}\}\{m\}\(62\)
##### Complexity score \(IDF\-normalized compactness\)\.

Following DynaTab\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\), we summarize dataset compactness and “ordering opportunity” by the IDF\-normalized score in Eq\.[63](https://arxiv.org/html/2606.05441#A6.E63), whereCumVar​\(d^\)\\mathrm\{CumVar\}\(\\hat\{d\}\)is the cumulative variance explained atd^\\hat\{d\}components andppis a tunable sensitivity parameter\. Larger values indicate that a small intrinsic subspace captures substantial variance, suggesting higher redundancy and greater potential for ordering\-based locality\.

ComplexityScore=CumVar​\(d^\)IDFp\\mathrm\{ComplexityScore\}\\;=\\;\\frac\{\\mathrm\{CumVar\}\(\\hat\{d\}\)\}\{\\mathrm\{IDF\}^\{\\,p\}\}\(63\)

### F\.2Feature Ordering Effectiveness and Success Probability

Following DynaTab\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\), we use the Feature Ordering Effectiveness \(FOE\) as a composite indicator of ordering benefit, given in Eq\.[64](https://arxiv.org/html/2606.05441#A6.E64), whereκ\\kappais a dataset\-specific scaling factor andAUC\\mathrm\{AUC\}denotes the area under the IDF\-variance curve, estimated by trapezoidal integration over discrete IDF\-variance pairs\. We chooseκ\\kappaby minimizing the deviation from a target value \(set to11\) via Eq\.[65](https://arxiv.org/html/2606.05441#A6.E65)\. Settingp=2p\{=\}2introduces quadratic sensitivity\(Hinton and Salakhutdinov,[2006](https://arxiv.org/html/2606.05441#bib.bib47)\), amplifying penalties when variance grows slowly with intrinsic dimension\. AUC is estimated using the trapezoidal rule\(Hanley and McNeil,[1982](https://arxiv.org/html/2606.05441#bib.bib86)\)for efficient integration of discrete IDF–variance pairs\. Whileκ\\kappaandAUC\\mathrm\{AUC\}vary by dataset, FOE preserves an inverse dependence on IDF by Eq\.[66](https://arxiv.org/html/2606.05441#A6.E66)\. We also report a simple success\-probability proxy by Eq\.[67](https://arxiv.org/html/2606.05441#A6.E67)\. Asd^→m\\hat\{d\}\\rightarrow m,psuccp\_\{\\mathrm\{succ\}\}decreases, reflecting limited room for ordering to expose structure beyond what is already “fully spread” across features\.

FOE=κ\(AUC⋅IDF\)p\\mathrm\{FOE\}\\;=\\;\\frac\{\\kappa\}\{\\left\(\\mathrm\{AUC\}\\cdot\\mathrm\{IDF\}\\right\)^\{p\}\}\(64\)Loss​\(κ\)=\(κ\(AUC\)p−1\)2\\mathrm\{Loss\}\(\\kappa\)\\;=\\;\\left\(\\frac\{\\kappa\}\{\(\\mathrm\{AUC\}\)^\{p\}\}\-1\\right\)^\{2\}\(65\)FOE∝1IDF\(for fixedκandAUC\)\\mathrm\{FOE\}\\;\\propto\\;\\frac\{1\}\{\\mathrm\{IDF\}\}\\quad\\text\{\(for fixed $\\kappa$ and $\\mathrm\{AUC\}$\)\}\(66\)psucc=1−IDF=1−d^mp\_\{\\mathrm\{succ\}\}\\;=\\;1\-\\mathrm\{IDF\}\\;=\\;1\-\\frac\{\\hat\{d\}\}\{m\}\(67\)

### F\.3Linking IDF/FOE to Locality: Testable Predictions

The locality view yields a mechanistic interpretation of IDF/FOE: ordering helps when the dataset admits a linear layout that concentrates strong relations into short\-range neighborhoods\. We formalize this using order\-only locality diagnostics \(Appendix[E\.4](https://arxiv.org/html/2606.05441#A5.SS4)\)\. LetΠ∗\\Pi^\{\\ast\}be the learned GO\-LR ordering and letΠ\(r\)\\Pi^\{\(r\)\}denote random permutations\.

##### Locality gains relative to random orderings\.

We define three locality gains by Eqs\.[68](https://arxiv.org/html/2606.05441#A6.E68),[69](https://arxiv.org/html/2606.05441#A6.E69),[70](https://arxiv.org/html/2606.05441#A6.E70)\.

Δ​AdjCoh\\displaystyle\\Delta\\mathrm\{AdjCoh\}=𝔼r​\[AdjCoh​\(Π\(r\)\)\]−AdjCoh​\(Π∗\)\\displaystyle\\;=\\;\\mathbb\{E\}\_\{r\}\\\!\\left\[\\mathrm\{AdjCoh\}\(\\Pi^\{\(r\)\}\)\\right\]\-\\mathrm\{AdjCoh\}\(\\Pi^\{\\ast\}\)\(68\)Δ​HitRateK,h\\displaystyle\\Delta\\mathrm\{HitRate\}\_\{K,h\}=HitRateK,h​\(Π∗\)−𝔼r​\[HitRateK,h​\(Π\(r\)\)\]\\displaystyle\\;=\\;\\mathrm\{HitRate\}\_\{K,h\}\(\\Pi^\{\\ast\}\)\-\\mathbb\{E\}\_\{r\}\\\!\\left\[\\mathrm\{HitRate\}\_\{K,h\}\(\\Pi^\{\(r\)\}\)\\right\]\(69\)Δ​Cut\\displaystyle\\Delta\\mathrm\{Cut\}=𝔼r​\[Cut​\(Π\(r\)\)\]−Cut​\(Π∗\)\\displaystyle\\;=\\;\\mathbb\{E\}\_\{r\}\\\!\\left\[\\mathrm\{Cut\}\(\\Pi^\{\(r\)\}\)\\right\]\-\\mathrm\{Cut\}\(\\Pi^\{\\ast\}\)\(70\)Positive values indicate thatΠ∗\\Pi^\{\\ast\}induces stronger locality than random orderings: lower adjacent dissimilarity \(betterAdjCoh\\mathrm\{AdjCoh\}\), higher neighborhood recovery \(betterHitRate\\mathrm\{HitRate\}\), and lower cross\-segment boundary cost \(betterCut\\mathrm\{Cut\}\)\.

##### Locality Exploitability Score \(LES\)\.

Optionally, we aggregate these into a single dataset\-level score by z\-normalizing each gain across datasets and averaging by Eq\.[71](https://arxiv.org/html/2606.05441#A6.E71)whereLES\\mathrm\{LES\}measures how much local neighborhood structure a dataset allows ordering to unlock\.

LES=13​\(zscore​\(Δ​AdjCoh\)\+zscore​\(Δ​HitRateK,h\)\+zscore​\(Δ​Cut\)\)\\mathrm\{LES\}\\;=\\;\\frac\{1\}\{3\}\\Big\(\\mathrm\{zscore\}\(\\Delta\\mathrm\{AdjCoh\}\)\+\\mathrm\{zscore\}\(\\Delta\\mathrm\{HitRate\}\_\{K,h\}\)\+\\mathrm\{zscore\}\(\\Delta\\mathrm\{Cut\}\)\\Big\)\(71\)For a benchmark suite with multiple datasets, thezscore​\(⋅\)\\mathrm\{zscore\}\(\\cdot\)terms in Eq\.[71](https://arxiv.org/html/2606.05441#A6.E71)are computed across the evaluated dataset collection\. For single\-dataset diagnostics, cross\-dataset z\-normalization is not defined\. We therefore report the raw finite\-diagnostic aggregate

LESsingle=1\|𝒟fin\|​∑d∈𝒟find,𝒟fin=\{Δ​AdjCoh,Δ​HitRateK,h,Δ​Cut\}∩ℝfinite\\mathrm\{LES\}\_\{\\mathrm\{single\}\}\\;=\\;\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{fin\}\}\|\}\\sum\_\{d\\in\\mathcal\{D\}\_\{\\mathrm\{fin\}\}\}d,\\qquad\\mathcal\{D\}\_\{\\mathrm\{fin\}\}=\\left\\\{\\Delta\\mathrm\{AdjCoh\},\\Delta\\mathrm\{HitRate\}\_\{K,h\},\\Delta\\mathrm\{Cut\}\\right\\\}\\cap\\mathbb\{R\}\_\{\\mathrm\{finite\}\}\(72\)Here,𝒟fin\\mathcal\{D\}\_\{\\mathrm\{fin\}\}contains only finite locality diagnostics, so unavailable quantities such asΔ​HitRateK,h\\Delta\\mathrm\{HitRate\}\_\{K,h\}orΔ​Cut\\Delta\\mathrm\{Cut\}for very small feature dimensions are omitted from the average\. Thus, Eq\.[71](https://arxiv.org/html/2606.05441#A6.E71)is benchmark\-relative, while Eq\.[72](https://arxiv.org/html/2606.05441#A6.E72)provides a dataset\-level locality summary when only one dataset is evaluated\.

##### Predictions\.

Under the locality hypothesis, IDF/FOE provide opportunity indicators for locality gains, rather than deterministic guarantees\. We summarize this diagnostic expectation as in Eq\.[73](https://arxiv.org/html/2606.05441#A6.E73)\.

IDF↓,FOE↑,psucc↑⟹higher expected opportunity for locality gains\\mathrm\{IDF\}\\downarrow,\\;\\mathrm\{FOE\}\\uparrow,\\;p\_\{\\mathrm\{succ\}\}\\uparrow\\quad\\Longrightarrow\\quad\\text\{higher expected opportunity for locality gains\}\(73\)Importantly, Eq\.[73](https://arxiv.org/html/2606.05441#A6.E73)is diagnostic rather than deterministic: IDF/FOE indicate when ordering is worth trying, while LES measures whether the learned GO\-LR ordering actually realizes locality gains over random orderings\.

For single\-dataset diagnostics, the same opportunity\-based interpretation can be applied toLESsingle\\mathrm\{LES\}\_\{\\mathrm\{single\}\}from Eq\.[72](https://arxiv.org/html/2606.05441#A6.E72), but it should be treated as an empirical diagnostic rather than a guaranteed monotonic relationship\. Intuitively, smallIDF\\mathrm\{IDF\}suggests that variance concentrates in a low\-dimensional subspace, which may co\-occur with stronger feature redundancy or community structure in the similarity graph\. When such structure is present, GO\-LR can align related features into contiguous neighborhoods that NSC can pool effectively\.

### F\.4Experimental Validation Protocol \(Locality as the Bridge\)

We validate the bridge “When⇒\\RightarrowWhy” by testing whether IDF/FOE predict order\-only locality gains on the same dataset suite used throughout the paper\.

##### Correlation tests \(dataset\-level\)\.

Across datasets, we compute Spearman correlations betweenIDF\\mathrm\{IDF\}\(orpsuccp\_\{\\mathrm\{succ\}\}/FOE\\mathrm\{FOE\}\) and each locality gain by Eq\.[74](https://arxiv.org/html/2606.05441#A6.E74)\.

ρ​\(IDF,Δ​HitRateK,h\),ρ​\(IDF,Δ​AdjCoh\),ρ​\(IDF,Δ​Cut\),ρ​\(FOE,LES\)\\rho\\\!\\left\(\\mathrm\{IDF\},\\Delta\\mathrm\{HitRate\}\_\{K,h\}\\right\),\\;\\rho\\\!\\left\(\\mathrm\{IDF\},\\Delta\\mathrm\{AdjCoh\}\\right\),\\;\\rho\\\!\\left\(\\mathrm\{IDF\},\\Delta\\mathrm\{Cut\}\\right\),\\;\\rho\\\!\\left\(\\mathrm\{FOE\},\\mathrm\{LES\}\\right\)\(74\)We expect negative correlations forIDF\\mathrm\{IDF\}\(smaller IDF⇒\\Rightarrowlarger gains\) and positive correlations forFOE\\mathrm\{FOE\}andpsuccp\_\{\\mathrm\{succ\}\}\.

##### Link to downstream NSC behavior\.

To directly connect locality gains to NSC’s contiguity\-based pooling, we also test whether datasets with largerLES\\mathrm\{LES\}obtain larger ordering\-induced improvements under NSC by Eq\.[75](https://arxiv.org/html/2606.05441#A6.E75)\.

ρ​\(LES,Δ​PerfNSC\),Δ​PerfNSC=Perf​\(NSC\+Π∗\)−𝔼r​\[Perf​\(NSC\+Π\(r\)\)\]\\rho\\\!\\left\(\\mathrm\{LES\},\\Delta\\mathrm\{Perf\}\_\{\\mathrm\{NSC\}\}\\right\),\\qquad\\Delta\\mathrm\{Perf\}\_\{\\mathrm\{NSC\}\}=\\mathrm\{Perf\}\(\\text\{NSC\}\+\\Pi^\{\\ast\}\)\-\\mathbb\{E\}\_\{r\}\\\!\\left\[\\mathrm\{Perf\}\(\\text\{NSC\}\+\\Pi^\{\(r\)\}\)\\right\]\(75\)This closes the chain:

low IDF / high FOE↝higher opportunity for locality gains→validated by LESeffective contiguity\-based pooling \(NSC\)\\text\{low IDF / high FOE\}\\;\\leadsto\\;\\text\{higher opportunity for locality gains\}\\;\\xrightarrow\{\\text\{validated by LES\}\}\\;\\text\{effective contiguity\-based pooling \(NSC\)\}

##### Summary\.

We use locality as the operational criterion for deciding “when” feature ordering should help\. Concretely, we first compute an opportunity proxy from intrinsic dimensionality: we estimated^\\hat\{d\}from the PCA cumulative\-variance curve at a fixed threshold, then we defineIDF=d^/m\\mathrm\{IDF\}=\\hat\{d\}/m, and formFOE\\mathrm\{FOE\}by combiningIDF\\mathrm\{IDF\}with the IDF\-variance curve areaAUC\\mathrm\{AUC\}\(withκ\\kappaoptimized via Eq\.[65](https://arxiv.org/html/2606.05441#A6.E65)\)\. Intuitively, HDLSS/HDHSS datasets typically exhibit very smallIDF\\mathrm\{IDF\}and hence largeFOE\\mathrm\{FOE\}\(Table[F\.1](https://arxiv.org/html/2606.05441#A6.T1), top ranks\), indicating strong redundancy/low effective rank and thus substantial room for an ordering algorithm to expose coherent local neighborhoods\. In contrast, low\-dimensional or near\-full\-rank datasets tend to haveIDF≈1\\mathrm\{IDF\}\\approx 1\(andPsuccess=1−IDF≈0P\_\{\\mathrm\{success\}\}=1\-\\mathrm\{IDF\}\\approx 0\), suggesting limited remaining structure for ordering to uncover; in such cases, ordering is expected to be less critical unless the locality diagnostics indicate otherwise\. Additionally, Table[F\.2](https://arxiv.org/html/2606.05441#A6.T2)shows that orlraws10P has the strongest expected benefit from ordering, with the lowest IDF and highest FOE among the additional cross\-domain datasets\. Cell Cycle also exhibits a relatively high FOE despite being MixedRegime, suggesting substantial compressible structure\. In contrast, RELATHE, BASEHOCK, and PCMAC have lower FOE scores, indicating that feature ordering is expected to provide more limited gains for these datasets\. Second, beyond this screen, we quantify whether the opportunity is realized by the ordering algorithm: we learn a GO\-LR orderingΠ∗\\Pi^\{\\ast\}and measure order\-only locality gains against random permutations usingΔ​AdjCoh\\Delta\\mathrm\{AdjCoh\},Δ​HitRateK,h\\Delta\\mathrm\{HitRate\}\_\{K,h\}, andΔ​Cut\\Delta\\mathrm\{Cut\}, whose z\-normalized average definesLES\\mathrm\{LES\}\(Eqs\.[68](https://arxiv.org/html/2606.05441#A6.E68)\-[71](https://arxiv.org/html/2606.05441#A6.E71)\)\. The scatter plots \(Fig\.[F\.1](https://arxiv.org/html/2606.05441#A6.F1)\) show thatIDF/FOE\\mathrm\{IDF\}/\\mathrm\{FOE\}are best interpreted as capacity measures: they separate regimes where ordering is plausibly useful \(lowIDF\\mathrm\{IDF\}/highFOE\\mathrm\{FOE\}, often HDLSS\) from regimes where it is unlikely to matter \(highIDF\\mathrm\{IDF\}, typically low\-dimensional\), whileLES\\mathrm\{LES\}diagnoses whether GO\-LR successfully linearizes the dataset’s similarity structure into short\-range neighborhoods that contiguity\-based operators \(e\.g\., NSC segmentation/pooling\) can exploit\. In summary, ordering is most relevant in HDLSS\-like regimes with lowIDF\\mathrm\{IDF\}or highFOE\\mathrm\{FOE\}, and it is most likely to help when GO\-LR also produces positive locality improvements over random orderings, reflected by higherLES\\mathrm\{LES\}\.; conversely, for low\-dimensional/high\-IDF\\mathrm\{IDF\}datasets, ordering is generally not required\. Practically, we recommend a two\-stage test: use lowIDF\\mathrm\{IDF\}/ highFOE\\mathrm\{FOE\}to flag datasets where ordering may help, and use positive locality gains \(highLES\\mathrm\{LES\}\) to predict when ordering will actually benefit architectures that rely on contiguity or local neighborhoods along the input sequence e\.g\., local\-window attention/Transformer variants, state\-space sequence models \(e\.g\., Mamba\-style SSMs\), recurrent models \(LSTM/GRU\), and sequence\-based LLM backbones since these models implicitly assume that nearby tokens/features should interact more strongly than distant ones\. In our pipeline, NSC is not a backbone but a compression / dimensionality\-reduction operator whose contiguous segmentation and pooling explicitly depends on meaningful neighborhoods; thus, when GO\-LR induces stronger locality than random orderings \(positiveΔ​AdjCoh\\Delta\\mathrm\{AdjCoh\},Δ​HitRateK,h\\Delta\\mathrm\{HitRate\}\_\{K,h\},Δ​Cut\\Delta\\mathrm\{Cut\}and higherLES\\mathrm\{LES\}\), NSC\-style compression, and potentially other locality\-sensitive sequence models, are expected to benefit\.

Table F\.1:When to use ordering through locality: FOE\-sorted datasets with IDF/FOE/PsuccessP\_\{\\mathrm\{success\}\}and order\-only locality gains/LES \(not used for sorting\)\. Here, AUC denotes the area under the cumulative explained\-variance–IDF curve, computed via trapezoidal integration over discrete pairs\(IDFk=k/ntotal,CVar​\(k\)\)\\big\(\\mathrm\{IDF\}\_\{k\}=k/n\_\{\\text\{total\}\},\\,\\mathrm\{CVar\}\(k\)\\big\)\. Here, HDLSS = High\-Dimensional Low\-Sample Size, HDHSS = High\-Dimensional High\-Sample Size, LDLSS = Low\-Dimensional Low\-Sample Size, LDHSS = Low\-Dimensional High\-Sample Size\.Table F\.2:Ordering\-locality diagnostics for additional cross\-domain datasets\.Datasets are sorted by FOE score\. Categories follow the empirical regime rule usingnn,mm, andρ=m/n\\rho=m/n\. AUC is computed under the cumulative explained variance\-IDF curve via trapezoidal integration\. LES is standardized within this five\-dataset subset\.![Refer to caption](https://arxiv.org/html/2606.05441v1/plot_idf_vs_les2.png)\(a\)IDF vs\. LES\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/plot_foe_vs_les2.png)\(b\)FOE vs\. LES\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/plot_psuccess_vs_les2.png)\(c\)PsuccessP\_\{\\mathrm\{success\}\}vs\. LES\.

Figure F\.1:Locality Exploitability Score \(LES\) against intrinsic\-dimension/compression proxies\.

## Appendix GDetailed Comparative Results

Table[G\.1](https://arxiv.org/html/2606.05441#A7.T1)reports the full 5×\\times5 cross\-validation results \(mean accuracy with subscripted standard deviation\) for all88HDLSS benchmarks and the complete set of50\+50\+baselines\. Our method, GOTabPFN, attains the highest mean accuracy on every dataset \(Colon, Lung, GLI, SMK, ALLAML, Prostate, Arcene, TOX\), leading to an average rank of1\.00±0\.001\.00\_\{\\pm 0\.00\}in the rightmost column; that is, it is consistently ranked first across all tasks and all CV folds\. The absolute accuracies are also strong: GOTabPFN achieves at least90%90\\%mean accuracy on66of88datasets \(Lung, GLI, ALLAML, Prostate, Arcene, TOX\), while maintaining competitive performance even on the most challenging HDLSS cases, such as SMK and ARC\. On ARC, for instance, GOTabPFN reaches90\.60%90\.60\\%accuracy, whereas the best competing methods remain in the mid\-80%80\\%range, and on SMK it still leads the next\-best model by a non\-trivial margin\. Across all datasets, the standard deviations of GOTabPFN are comparable to or smaller than those of the strongest baselines, indicating that the gains are not the result of a few lucky splits but are stable across repeated 5×\\times5 CV\.

The immediate competitors are other TabPFN\-style models and modern HDLSS\-focused baselines\. TANDEM and TabPFN Wide form the closest group, with average ranks of3\.63±1\.323\.63\_\{\\pm 1\.32\}and3\.75±2\.383\.75\_\{\\pm 2\.38\}, respectively\. However, even these strong baselines lag behind GOTabPFN on every individual dataset: they never surpass our method on any of Colon, Lung, GLI, SMK, ALLAML, Prostate, Arcene, or TOX, and their average ranks remain strictly higher\. A second tier of competitive models includes TabDPT, TabICL, BETA, TuneTables, and well\-regularized neural and boosted\-tree baselines such as RealMLP, LGBM, and CatBoost, with average ranks roughly in the44–1818range\. These methods often perform reasonably on some datasets \(e\.g\., LGBM and CatBoost on PRS and TOX, RealMLP on AML\), but they either fall short on at least one particularly difficult HDLSS dataset \(e\.g\., SMK or ARC\) or show larger variability across splits, which results in clearly worse average ranks compared to GOTabPFN\.

Classical shallow models and generic deep architectures occupy the middle of the table\. Linear or margin\-based methods \(Lasso, SVM\), tree ensembles \(RF, XGBoost, GBM, AdaBoost\),kk\-NN, and simple MLP variants \(MLP, MLP\-PLR, RealMLP\) typically achieve moderate performance on most datasets, with average ranks in the low\-to\-mid1010–2020range\. They can be competitive on a subset of benchmarks \(e\.g\., RF and GBM on some of the easier tasks, SVM on GLI\), but they do not exhibit the uniformly strong behavior of GOTabPFN and often degrade substantially on the most extreme HDLSS settings \(e\.g\., SMK and ARC\)\. Models designed primarily with feature selection or explainability in mind \(STG, L2X, INVASE, REAL\-X, ENODE, ModernNCA\) also tend to underperform in this regime: while they occasionally match classical baselines on certain datasets, their overall average ranks \(typically\>18\>18and often\>40\>40\) indicate that their inductive biases are not sufficient on their own to close the gap to GOTabPFN\.

The bottom portion of the table is dominated by recent transformer\- and Mamba\-based tabular architectures originally developed and tuned for larger, non\-HDLSS datasets\. Methods such as FT\-Transformer, SAINT, TabM, AutoInt, Category Embedding, ResNet Tabular, TabNet, NODE, DeepFM, DCN, DANets, TANGOS, Tab\-Transformer, NDTF, MambaTab, Mambular, MambAttention, 1D CNN, and TabSeq generally obtain average ranks in the3030–45\+45\+range and substantially lower accuracies on several HDLSS benchmarks\. In many cases, these models struggle to surpass70%70\\%accuracy on the hardest datasets and can even drop close to chance levels on some splits, highlighting a clear mismatch between their inductive biases \(e\.g\., heavy overparameterization and large\-context attention\) and the HDLSS setting\. Finally, the rows with “N/A” entries correspond to TabPFN variants for which results were not available on all88datasets \(e\.g\., TabPFN\-2\.5 on COL only, and TabPFN v1/v2 and LoCalPFN without HDLSS evaluations\); for completeness, we report their observed accuracies and overall average ranks in the rightmost column, but we exclude them from cross\-dataset comparisons\. Taken together, these detailed results show that GOTabPFNis not merely competitive but uniformly dominant across a broad and challenging suite of HDLSS benchmarks, outperforming both specialized TabPFN\-style baselines and a wide spectrum of modern tabular architectures\.

Evaluation beyond accuracy\.To evaluate whether the gains of GOTabPFN extend beyond accuracy, we compare GOTabPFN with two strong tabular foundation model baselines, TabICL\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\)and TabDPT\(Maet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib97)\), using ROC\-AUC and macro\-F1 on the same 8 HDLSS datasets\. As shown in Table[G\.2](https://arxiv.org/html/2606.05441#A7.T2), GOTabPFN obtains the best ROC\-AUC on 6/8 datasets and the best macro\-F1 on 6/8 datasets, indicating that the proposed GO\-LR\+NSC representation improves not only top\-line accuracy but also ranking quality and class\-balanced predictive performance\.

Table G\.1:Performance of the models on 8 HDLSS datasets \(mean accuracy with subscripted standard deviation over 5×\\times5 CV\)\. Dataset abbreviations: COL = Colon, LNG = Lung, GLI = GLI\-85, SMK = SMK\_CAN\_187, AML = ALLAML, PRS = Prostate\-GE, ARC = Arcene, TOX = TOX\-171\. Model abbreviations: GOTabPFNours\{\}\_\{\\text\{ours\}\}= our method, TWide = TabPFN Wide, TTables = TuneTables, BETA = TabPFN Unleashed, PGate = ProtoGate, TRNN = TabulaRNN, RF = Random Forest, NB = Naive Bayes, DT = Decision Tree, MambAtt = MambAttention, FT\-T = FT\-Transformer, CatEmbed = CategoryEmbedding, ResNetT = ResNetTabular, Tab\-T = TabTransformer\.Table G\.2:Evaluation beyond accuracy on 8 HDLSS datasets\.ROC\-AUC and macro\-F1 comparisons under5×55\\times 5CV\. Values are mean with subscripted standard deviation\. Bold denotes the best result per dataset/metric\.
## Appendix HGOTabPFN Hyperparameters

##### HDLSS datasets\.

Table[H\.1](https://arxiv.org/html/2606.05441#A8.T1)reports the best\-performing GOTabPFN configurations across eight HDLSS benchmarks, showing that the GO\-LR stage adapts its distance metric \(euclidean/manhattan/correlation/KL\) and clustering granularity \(k=4k\\\!=\\\!4–1212\) per dataset with only 1–3 refinement passes, while NSC consistently favors high retention thresholds \(τ≈0\.99\\tau\\\!\\approx\\\!0\.99, except ALLAML at0\.950\.95\) and typically uses the gamma\-based rule \(with Colon using IDF and Lung/SMK using the default rule\)\. Segmentation is dataset\-dependent uniform for several datasets, but equal\-mass for Colon/TOX and largest\-jump for ALLAML/Prostate indicating that both tokenization strategy and compression hyperparameters \(γ,β,Mmin/max,lmin\\gamma,\\beta,M\_\{\\min/\\max\},l\_\{\\min\}\) must be tuned to match the underlying feature geometry; only SMK employs feature subsampling and only SMK/Arcene enable assume\_standardized, while TabPFN random seeds vary modestly across datasets\.

##### Cross\-domain datasets\.

Table[H\.2](https://arxiv.org/html/2606.05441#A8.T2)reports the best\-performing GOTabPFN configurations on the 8 additional cross\-domain datasets\. Similar to the HDLSS setting, GO\-LR adapts both the metric and clustering granularity to each dataset, using cosine, Manhattan, correlation, and KL\-based dissimilarities withk=4k=4–1111clusters and only 1–2 refinement passes\. Most cross\-domain datasets favor uniform NSC segmentation and the defaultMM\-rule, while Cell Cycle and both DrivFace tasks use the IDF rule, and DrivFace additionally benefits from largest\-jump segmentation\. Feature subsampling is used for most high\-dimensional cross\-domain datasets, especially ORL, RELATHE, PCMAC, Cell Cycle, CIFAR\-10, and DrivFace, reflecting the larger feature spaces in this evaluation\. The selected configurations also show that standardization is useful for most cross\-domain settings except ORL, while TabPFN seeds vary modestly across datasets\. Overall, the table shows that GOTabPFN remains flexible across text, image\-feature, camera sensor, and RNA\-seq domains by adapting the feature graph, segmentation rule, and compression budget to the geometry of each dataset\. Prior Labs notes that TabPFN inference is deterministic for a fixed seed in the same environment, while small differences may occur across different hardware configurations\.111Prior Labs FAQ:[https://docs\.priorlabs\.ai/faq](https://docs.priorlabs.ai/faq)\. See also the PriorLabs/TabPFN reproducibility discussion, issue \#266:[https://github\.com/PriorLabs/TabPFN/issues/266](https://github.com/PriorLabs/TabPFN/issues/266)\.

Table H\.1:Best GOTabPFN hyperparameters for 8 HDLSS datasets \(COL = Colon, LNG = Lung, GLI = GLI\-85, SMK = SMK\_CAN\_187, AML = ALLAML, PRS = Prostate\-GE, ARC = Arcene, TOX = TOX\-171\)\. GO\-LR metric: eucl\.=euclidean, manh\.=manhattan, corr\.=correlation, KL=kl\_divergence\. NSC seg: unif\.=uniform, eq\-mass=equal\_mass, lrg\-jump=largest\_jump\. NSC rule: gam\.=gamma, def\.=default\. Std? indicatesassume\_standardized\.Table H\.2:Best GOTabPFN hyperparameters for 8 cross\-domain datasets\(ORL = orlraws10P, BAS = BASEHOCK, REL = RELATHE, PCM = PCMAC, CCY = Cell Cycle, CIF = CIFAR\-10, DF\-R = DrivFace\-Regression, DF\-C = DrivFace\-Classification\)\. GO\-LR metric: cos\.=cosine, manh\.=manhattan, corr\.=correlation, KL=kl\_divergence\. NSC seg: unif\.=uniform, lrg\-jump=largest\_jump\. NSC rule: def\.=default\. Std? indicatesassume\_standardized\.

## Appendix IStatistical Significance Analysis

##### Significance analysis on HDLSS datasets\.

We evaluate statistical differences across methods using \(i\) a Friedman test\(Friedman,[1937](https://arxiv.org/html/2606.05441#bib.bib96)\)over per\-dataset ranks, followed by a Nemenyi post\-hoc critical\-difference \(CD\) analysis\(Nemenyi,[1963](https://arxiv.org/html/2606.05441#bib.bib92)\)\(Fig\.[I\.1](https://arxiv.org/html/2606.05441#A9.F1)\), and \(ii\) pairwise Wilcoxon signed\-rank tests\(Demšar,[2006](https://arxiv.org/html/2606.05441#bib.bib95)\)comparing GOTabPFN to each baseline across the same 8 datasets, with Holm correction to control family\-wise error \(Table[I\.1](https://arxiv.org/html/2606.05441#A9.T1)\)\. The Friedman test indicates a significant overall effect across methods, and the CD diagram visualizes the separation in average ranks, where GOTabPFN attains the lowest \(best\) average rank\. For pairwise tests, the raw Wilcoxonpp\-values are identical across baselines \(praw=0\.00781p\_\{\\text\{raw\}\}=0\.00781\), reflecting that GOTabPFN improves over each comparator on all datasets with no sign reversals \(a common outcome whenn=8n=8and per\-dataset differences are consistently positive\)\. After Holm correction, the adjustedpp\-values become more conservative \(pHolm=0\.0703p\_\{\\text\{Holm\}\}=0\.0703\), so we do not claim strict significance atα=0\.05\\alpha=0\.05under family\-wise error control; nevertheless, the combination of uniform wins, rank dominance, and consistent positive paired differences supports the robustness of the observed improvements\.

##### Extended statistical significance analysis\.

We further evaluate statistical significance on an expanded 16\-dataset benchmark formed by combining the 8 HDLSS datasets with the 8 cross\-domain datasets, using the strongest common comparison set from the main HDLSS and cross\-domain experiments\. Table[I\.2](https://arxiv.org/html/2606.05441#A9.T2)summarizes the average ranks: GOTabPFN achieves the best average rank by a clear margin \(1\.12\), followed by TabPFN\-Wide \(3\.62\), TANDEM \(3\.69\), and TabDPT \(4\.34\)\. The same ranking trend is visualized in Fig\.[I\.2](https://arxiv.org/html/2606.05441#A9.F2), where GOTabPFN is separated from the nearest competing methods by more than two average\-rank points\. As summarized in Table[I\.3](https://arxiv.org/html/2606.05441#A9.T3), the omnibus Friedman test over the 9\-method comparison is strongly significant \(χ2=52\.55\\chi^\{2\}=52\.55,p=1\.32×10−8p=1\.32\\times 10^\{\-8\}; 11 complete rows used\), rejecting the null hypothesis that all methods have equal rank distributions across the expanded benchmark\. We then compare GOTabPFN against each baseline using pairwise Wilcoxon signed\-rank tests with Holm correction\. Table[I\.4](https://arxiv.org/html/2606.05441#A9.T4)shows that GOTabPFN remains statistically significant against every baseline after correction, with Holm\-adjustedpp\-values between2\.44×10−42\.44\\times 10^\{\-4\}and6\.10×10−46\.10\\times 10^\{\-4\}\. The win/tie/loss counts are uniformly favorable: 16/0/0 against Lasso, TabDPT, TabPFN\-Wide, and TuneTables; 14/0/0 against TabICL on the common subset; 12/0/0 against ProtoGate on the common subset; and 15/0/1 against MLP and TANDEM\. Table[I\.5](https://arxiv.org/html/2606.05441#A9.T5)further reports the mean accuracy gain and Holm\-corrected pairwise significance of GOTabPFN against each strong baseline on the expanded 16\-dataset benchmark\. The Holm\-adjustedpp\-values remain below0\.010\.01for all pairwise comparisons, indicating that GOTabPFN is significantly better than each baseline after controlling for multiple comparisons\. Together, these results complement the 55\-baseline analysis on the original 8 HDLSS datasets by showing that GOTabPFN remains robust across a broader 16\-dataset evaluation against the strongest repeated baselines\.

Table I\.1:Pairwise significance of GOTabPFN against the top baselines across the 8 HDLSS datasets\.Δ\\DeltaAcc denotes the mean accuracy improvement of GOTabPFN over each baseline \(percentage points\) averaged across datasets\. We report Wilcoxon signed\-rankpp\-values \(prawp\_\{\\text\{raw\}\}\) and Holm\-correctedpp\-values \(pHolmp\_\{\\text\{Holm\}\}\) for multiple comparisons; Sig\. indicates significance after Holm correction \(α=0\.05\\alpha=0\.05\)\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/GOTabPFN_cd_others_top10_rankline.png)Figure I\.1:Average\-rank comparison on the 8 HDLSS datasets\.Lower rank is better\. Friedman/Nemenyi analysis shows GOTabPFN as the best\-ranked method on the original HDLSS benchmark\.![Refer to caption](https://arxiv.org/html/2606.05441v1/expanded16_cd_rankline.png)Figure I\.2:Average\-rank comparison on the expanded 16\-dataset benchmark\.Lower rank is better\. GOTabPFN achieves the best average rank \(1\.12\), followed by TabPFN\-Wide \(3\.62\) and TANDEM \(3\.69\), with a significant Friedman test \(χ2=52\.55\\chi^\{2\}=52\.55,p=1\.32×10−8p=1\.32\\times 10^\{\-8\}\)\.Table I\.2:Average\-rank summary on the expanded 16\-dataset benchmark\.Lower rank is better\. The Friedman test over the 9\-method comparison is significant \(χ2=52\.55\\chi^\{2\}=52\.55,p=1\.32×10−8p=1\.32\\times 10^\{\-8\}; 11 complete rows\)\.Table I\.3:Omnibus Friedman test on the expanded 16\-dataset benchmark\.The test evaluates whether the 9 methods have equal rank distributions across datasets; 11 complete rows were used because Friedman requires all compared methods to be present\.Table I\.4:Pairwise significance of GOTabPFN on the expanded 16\-dataset benchmark\.Δ\\DeltaAcc denotes the mean accuracy improvement of GOTabPFN over each baseline in percentage points, averaged over the datasets used for that comparison\. W/T/L counts wins/ties/losses in favor of GOTabPFN\. We report raw Wilcoxon signed\-rankpp\-values and Holm\-correctedpp\-values across the 8 pairwise comparisons\.Table I\.5:Pairwise significance on the expanded 16\-dataset benchmark\.Δ\\DeltaAcc denotes the mean accuracy improvement of GOTabPFN over each baseline in percentage points\. We report the number of datasets used \(ndsn\_\{\\rm ds\}\), raw Wilcoxon signed\-rankpp\-values, and Holm\-correctedpp\-values across the 8 pairwise comparisons\. All comparisons remain significant after Holm correction\.

## Appendix JAdditional Ablation Analysis

Fig\.[J\.7](https://arxiv.org/html/2606.05441#A10.F7)reports accuracy distributions across CV splits for the top\-10 methods, showing that GOTabPFN achieves the strongest central tendency with competitive dispersion, i\.e\., high average performance without depending on a small number of favorable splits\. Consistently, the average\-rank vs\. global\-mean scatter in Fig\.[J\.4](https://arxiv.org/html/2606.05441#A10.F4)places GOTabPFN in the top\-left regime \(lowest average rank and highest global accuracy\)\. The normalized per\-dataset accuracies in Fig\.[J\.2](https://arxiv.org/html/2606.05441#A10.F2)and the dataset\-wise rank breakdown as a function of sample size in Fig\.[J\.3](https://arxiv.org/html/2606.05441#A10.F3)further indicate that the gains persist across datasets of different sizes rather than being driven by a single benchmark\. Fig\.[J\.6](https://arxiv.org/html/2606.05441#A10.F6)summarizes the best\-second\-best gaps per dataset, highlighting where the leading method separates more clearly from the runner\-up \(notably on the harder benchmarks\), while Fig\.[J\.5](https://arxiv.org/html/2606.05441#A10.F5)localizes improvements by visualizingΔ\\DeltaAcc \(ours−\-baseline\) across datasets and competitors, revealing broadly positive deltas with the largest separations against simpler baselines on challenging tasks\. Finally, Fig\.[J\.1](https://arxiv.org/html/2606.05441#A10.F1)characterizes distributional shape via skewness and kurtosis computed over per\-dataset accuracies, where negative skewness reflects a small number of difficult datasets that pull performance downward and higher kurtosis for several baselines suggests heavier tails and greater instability compared to the most consistent top performers\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_top10_skew_kurtosis.png)Figure J\.1:Skewness/kurtosis for the top\-10 methods on the 8 HDLSS benchmarks\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_hdlss_top10_normalized_accuracy_bar.png)Figure J\.2:Normalized accuracy for the top\-10 methods on the 8 HDLSS benchmarks\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/model_rank_vs_sample_size_top10_penalizedNA_bar.png)Figure J\.3:Model rank versus sample size for the top\-10 methods on the 8 HDLSS benchmarks\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_rank_vs_accuracy_legend_top10.png)Figure J\.4:Avg\. rank vs\. global accuracy\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/GOTabPFN_delta_heatmap.png)Figure J\.5:Δ\\DeltaAcc heatmap \(ours−\-baseline\)\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_best_second_margin.png)Figure J\.6:Best\-second\-best margin across the 8 HDLSS benchmarks\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/GOTabPFN_top10_boxplot.png)Figure J\.7:Accuracy distributions across CV splits for the top\-10 methods on the 8 HDLSS benchmarks\.
## Appendix KRepresentation Quality via t\-SNE

Figure[K\.1](https://arxiv.org/html/2606.05441#A11.F1)visualizes the NSC token/latent representation learned by GOTabPFN on the Colon dataset using a 2D t\-SNE\(Maaten and Hinton,[2008](https://arxiv.org/html/2606.05441#bib.bib49)\)embedding \(points colored by class\)\. Each point corresponds to a sample after GO\-LR ordering and NSC compression \(PCA\-based segmentation\)\. The plot indicates that the compressed representation preserves class\-discriminative structure in a low\-dimensional manifold: samples from the two classes exhibit partially separated regions with limited overlap, suggesting that NSC produces a structured embedding that is more amenable to the downstream TabPFN\-2\.5 head under the HDLSS regime\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/colon_gotabpfn_tsne.png)Figure K\.1:t\-SNE visualization of the NSC latent space in GOTabPFN on Colon \(colored by class\)\.
## Appendix LInference Level Ablation on Calibration and Robustness

On Colon, we complement the main accuracy results with inference\-time diagnostics that probe robustness, selectivity, neighborhood structure, and calibration\. First, robustness to feature perturbations \(Table[L\.1](https://arxiv.org/html/2606.05441#A12.T1)and Fig\.[1\(b\)](https://arxiv.org/html/2606.05441#A12.F1.sf2)\) shows a graceful degradation as the perturbed fraction increases: accuracy remains near\-ceiling under mild corruption \(e\.g\.,≤10%\\leq 10\\%\) but drops substantially under heavy perturbations, with shuffling generally more harmful than mean\-imputation at moderate rates \(e\.g\., at 50%: 74\.19% vs\. 98\.39%\)\. Second, selective prediction behaves as expected \(Fig\.[1\(a\)](https://arxiv.org/html/2606.05441#A12.F1.sf1)\): increasing the confidence thresholdτ\\tauraises accuracy on the retained “confident” subset while reducing coverage, indicating that model confidence meaningfully ranks predictions by correctness\. Third, local consistency in the NSC latent space remains stable \(Table[L\.1](https://arxiv.org/html/2606.05441#A12.T1)and Fig\.[1\(d\)](https://arxiv.org/html/2606.05441#A12.F1.sf4)\), with kNN label agreement around 73\.87% atk=5k\{=\}5across this evaluation, suggesting a reasonably coherent neighborhood geometry after compression\. Finally, the reliability diagram in Fig\.[N\.1](https://arxiv.org/html/2606.05441#A14.F1)reports a low expected calibration error \(ECE≈0\.033\\approx 0\.033\), indicating that predicted probabilities are well\-aligned with empirical accuracy on this dataset\.

Table L\.1:GOTabPFN \(Colon\) inference\-time robustness to feature perturbations and latent\-space neighborhood consistency\. Tab Shuffle randomly permutes a fraction of feature columns across samples; Tab Drop replaces a fraction with the global feature mean\. Higher is better for accuracy and kNN agreement\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_confidence_curve.png)\(a\)Confidence vs\. accuracy & coverage\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_robustness.png)\(b\)Robustness to perturbations\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_reliability.png)\(c\)Reliability diagram \(ECE\)\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_knn_agreement.png)\(d\)kkNN label agreement\.

Figure L\.1:GOTabPFN \(Colon\): confidence/coverage, calibration, robustness, and latent\-space neighborhood agreement\.
## Appendix MSanity and Stress Diagnostics

Figure[M\.1](https://arxiv.org/html/2606.05441#A13.F1)summarizes additional sanity and stress tests for GOTabPFN on Colon\. As shown in Fig\.[1\(a\)](https://arxiv.org/html/2606.05441#A13.F1.sf1), performance is near\-ceiling on the full input \(98\.39%\), but drops sharply under degenerate signals \(all\-zero or global\-mean inputs both 64\.52%\), and further degrades when the feature rows are randomly permuted \(54\.84%\), confirming that predictions depend on meaningful sample\-specific structure rather than trivial priors\. Notably, accuracy remains high under strong additive noise \(95\.16%\), suggesting robustness to moderate distributional corruption in feature values\. We also evaluate a simple tabular test\-time augmentation \(TTA\) procedure \(Fig\.[1\(b\)](https://arxiv.org/html/2606.05441#A13.F1.sf2)\): majority voting overnaug=5n\_\{\\text\{aug\}\}\{=\}5noisy/dropout augmentations matches the base accuracy \(both 98\.39%\), and only 3\.23% of samples change their predicted label under any augmentation, indicating high prediction stability\. Table[M\.1](https://arxiv.org/html/2606.05441#A13.T1)reports these stress\-test and stability numbers alongside per\-class accuracy, showing uniformly strong performance across classes \(97\.5% on the majority class with support 40, and 100% on the minority class with support 22\), consistent with the robustness patterns observed in Fig\.[M\.1](https://arxiv.org/html/2606.05441#A13.F1)\.

Table M\.1:Sanity/stress and stability diagnostics for GOTabPFN on Colon dataset\.![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_signal_sanity_stress.png)\(a\)Signal sanity & stress tests\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_tta_accuracy.png)\(b\)Base vs\. TTA accuracy\.

Figure M\.1:Sanity and stress diagnostics on Colon\.
## Appendix NAdditional Reliability and Interpretability Diagnostics

On the Colon benchmark, GOTabPFN achieves a baseline Top\-1 accuracy of98\.39%98\.39\\%and reaches100%100\\%Top\-2 accuracy \(Fig\.[N\.1](https://arxiv.org/html/2606.05441#A14.F1)[1\(b\)](https://arxiv.org/html/2606.05441#A14.F1.sf2)\)\. The normalized confusion matrix indicates a single error case, with the most frequent confusion being true class0→10\\rightarrow 1occurring once \(Fig\.[N\.1](https://arxiv.org/html/2606.05441#A14.F1)[1\(c\)](https://arxiv.org/html/2606.05441#A14.F1.sf3)\)\. Confidence is well\-separated: the mean marginptop​1−ptop​2p\_\{\\text\{top\}1\}\-p\_\{\\text\{top\}2\}is high for correct predictions \(0\.9490\.949\) and near\-zero for the lone incorrect prediction \(0\.0250\.025\), yielding a clear bimodal separation \(Fig\.[N\.1](https://arxiv.org/html/2606.05441#A14.F1)[1\(a\)](https://arxiv.org/html/2606.05441#A14.F1.sf1)\)\. Finally, per\-feature permutation importance on this small evaluation set shows0\.00\.0percentage\-point accuracy drop for the top\-ranked features, consistent with near\-saturated accuracy and limited headroom for measurable single\-feature perturbation effects under this diagnostic protocol\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_margin_hist.png)\(a\)Margin distribution \(ptop​1−ptop​2p\_\{\\text\{top\}1\}\-p\_\{\\text\{top\}2\}\)\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_topk_accuracy.png)\(b\)Top\-kkaccuracy curve\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_confusion_matrix_norm.png)\(c\)Normalized confusion matrix\.

Figure N\.1:Extra reliability diagnostics for GOTabPFN on Colon\. We report \(left\) margin separation between correct vs\. incorrect predictions, \(middle\) Top\-kkaccuracy, and \(right\) normalized confusion matrix\.
## Appendix OTheory\-Inspired Representation Diagnostics

We analyze the learned embedding geometry and confidence behavior of GOTabPFN on Colon, where the model attains98\.39%98\.39\\%top\-1 accuracy\. The embedding spectrum in Fig\.[O\.1](https://arxiv.org/html/2606.05441#A15.F1)[1\(a\)](https://arxiv.org/html/2606.05441#A15.F1.sf1)is strongly low\-rank, with an effective dimension \(participation ratio\) of3\.53\.5, and the cumulative curve in Fig\.[O\.1](https://arxiv.org/html/2606.05441#A15.F1)[1\(b\)](https://arxiv.org/html/2606.05441#A15.F1.sf2)shows that only2/4/6/11/292/4/6/11/29components capture50/80/90/95/99%50/80/90/95/99\\%of the variance, respectively, indicating a highly concentrated representation\. Despite this compression, local\-neighborhood classifiers are insufficient: leave\-one\-out kNN in embedding space peaks at82\.26%82\.26\\%\(atk=5k\{=\}5\) and degrades for largerkk\(Fig\.[O\.1](https://arxiv.org/html/2606.05441#A15.F1)[1\(c\)](https://arxiv.org/html/2606.05441#A15.F1.sf3)\), remaining well below the parametric TabPFN head, suggesting the decision rule leverages more than simple Euclidean locality\. Finally, margin diagnostics confirm strong separation and calibrated confidence: the mean normalized margin is0\.97070\.9707for correct predictions versus−0\.0359\-0\.0359for incorrect ones, and the conditional error stays near0across a wide range of margin thresholds while maintaining near\-full coverage \(Fig\.[O\.1](https://arxiv.org/html/2606.05441#A15.F1)[1\(d\)](https://arxiv.org/html/2606.05441#A15.F1.sf4)\-[1\(e\)](https://arxiv.org/html/2606.05441#A15.F1.sf5)\)\. Together, these results support that GOTabPFN forms a low\-dimensional, sharply separated embedding while relying on a richer \(non\-kNN\) parametric decision mechanism\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_embedding_spectrum.png)\(a\)Embedding spectrum \(top PCs\)\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_embedding_cumulative_variance.png)\(b\)Cumulative explained variance\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_knn_vs_head.png)\(c\)kNN \(LOO\) vs\. parametric head\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_margin_conditional_error.png)\(d\)Margin\-thresholded conditional error\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_margin_coverage.png)\(e\)Coverage vs\. margin threshold\.

Figure O\.1:Theory\-inspired representation diagnostics for GOTabPFN \(Colon\)\.\(a\) Variance spectrum indicates a sharp concentration of energy in the leading PCs\. \(b\) Cumulative explained variance shows rapid saturation\. \(c\) Leave\-one\-out kNN in embedding space underperforms the parametric head, suggesting performance is not explained by simple local geometry alone\. \(d\-e\) Margin\-based conditional error and coverage demonstrate high\-confidence predictions over most samples\.
## Appendix POOD and Local Sensitivity Diagnostics

On Colon, GOTabPFN achieves98\.39%98\.39\\%ID accuracy and exhibits highly confident and low\-entropy predictions on ID inputs \(mean max\-softmax confidence=0\.967=0\.967, mean entropy=0\.106=0\.106; Fig\.[1\(a\)](https://arxiv.org/html/2606.05441#A16.F1.sf1)\-[1\(b\)](https://arxiv.org/html/2606.05441#A16.F1.sf2)\)\. Under synthetic OOD\-style tabular inputs, uncertainty increases: Gaussian noise and column\-wise permutation both reduce confidence \(mean≈0\.810\\approx 0\.810and0\.7950\.795\) and raise entropy \(mean≈0\.424\\approx 0\.424and0\.4460\.446\), while constant “blank” features yield intermediate behavior \(confidence0\.8830\.883, entropy0\.3600\.360\), indicating that the model does not collapse to uniformly overconfident predictions off\-manifold \(Fig\.[1\(a\)](https://arxiv.org/html/2606.05441#A16.F1.sf1)–[1\(b\)](https://arxiv.org/html/2606.05441#A16.F1.sf2)\)\. Finally, a local Lipschitz\-like probe under small feature perturbations \(ϵ=0\.10\\epsilon=0\.10\) shows a distribution concentrated at low‖Δ​probs‖2/‖Δ​x‖2\\\|\\Delta\\mathrm\{probs\}\\\|\_\{2\}/\\\|\\Delta x\\\|\_\{2\}with a modest tail \(mean0\.0410\.041, median0\.0150\.015; Fig\.[1\(c\)](https://arxiv.org/html/2606.05441#A16.F1.sf3)\), suggesting that the learned predictor is generally stable to small tabular noise while allowing occasional locally sensitive regions\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_conf_id_vs_noise_const.png)\(a\)Max\-softmax confidence: ID vs\. Noise/Const\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_entropy_id_vs_noise_const.png)\(b\)Predictive entropy: ID vs\. Noise/Const\.
![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_local_sensitivity_tab.png)\(c\)Local sensitivity in feature space\.

Figure P\.1:OOD and sensitivity\-style diagnostics for GOTabPFN on Colon\. Confidence/entropy histograms compare in\-distribution \(ID\) inputs with synthetic OOD tabular perturbations \(Noise, Const\), while the right panel reports a Lipschitz\-like local sensitivity score‖Δ​probs‖2/‖Δ​x‖2\\\|\\Delta\\mathrm\{probs\}\\\|\_\{2\}/\\\|\\Delta x\\\|\_\{2\}under small feature\-space noise\.
## Appendix QDeployment\-Oriented Triage Diagnostics

To clarify deployment behavior beyond multiclass accuracy, we cast Colon as a triage task by treating class 1 as the positive \(“high\-risk”\) class and sweeping a decision threshold over the predicted probabilityp​\(y=1∣x\)p\(y\{=\}1\\mid x\)\. As shown in Fig\.[Q\.1](https://arxiv.org/html/2606.05441#A17.F1), the resulting ROC achievesAUC=1\.000\\mathrm\{AUC\}=1\.000, indicating perfect separability between positives and negatives on this evaluation set \(N=62N=62\)\. Using a sensitivity\-driven operating constraint \(target sensitivity0\.950\.95\), the selected threshold isth∗=0\.7885\\text\{th\}^\{\\ast\}=0\.7885, which attains sensitivity=1\.000=1\.000and specificity=1\.000=1\.000; the induced confusion matrix isT​N=40,F​P=0,F​N=0,T​P=22TN\{=\}40,FP\{=\}0,FN\{=\}0,TP\{=\}22, yielding100%100\\%precision/recall and100%100\\%binary accuracy at this operating point\. Finally, we report wall\-clock latency over 20 random mini\-batches \(batch size 64\) with mean≈639\\approx 639ms per batch \(p50≈638\\approx 638ms, p90≈644\\approx 644ms, p99≈649\\approx 649ms\), providing a coarse throughput reference for deployment\-oriented settings\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/Figs/gotabpfn_colon_triage_roc.png)Figure Q\.1:Deployment\-style triage diagnostic on Colon \(class 1 vs rest\)\. ROC curve for treating class 1 as the “high\-risk” positive class\. The marked operating point \(th\*\) corresponds to the selected threshold achieving the target sensitivity criterion, with the best specificity among feasible thresholds\.
## Appendix RExtension beyond TabPFN

GO\-LR \+ NSC is not tied to TabPFN; it is a model\-agnostic representation interface that can be paired with other tabular foundation models\. We use TabPFN\-2\.5\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115)\)as the main backbone because its feature\-dimensionality bottleneck is especially clear, but the same front\-end also transfers to TabICL\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\), as shown in Table[R\.1](https://arxiv.org/html/2606.05441#A18.T1)\. Across the same 8 HDLSS datasets, GO\-LR \+ NSC \+ TabICL improves over vanilla TabICL on 5/8 datasets in accuracy, 7/8 in ROC\-AUC, and 5/8 in macro\-F1\. Importantly, it also reduces runtime on all 8 datasets, with especially large gains on very high\-dimensional cases such as GLI, SMK, and ARC\. These results suggest that GO\-LR \+ NSC acts as a broader HDLSS\-oriented representation layer rather than a TabPFN\-specific preprocessing trick\.

Table R\.1:GO\-LR \+ NSC as a model\-agnostic front\-end for TabICL\.We compare GO\-LR \+ NSC \+ TabICL against vanilla TabICL on 8 HDLSS datasets under5×55\\times 5CV\. Values are mean with subscripted standard deviation for accuracy, ROC\-AUC, and macro\-F1; runtime is total seconds\. Bold denotes the better value between the two methods for each metric/dataset\.
## Appendix STabPFN Seed Sensitivity

##### Dataset\-seed vs\. TabPFN\-seed robustness\.

We distinguish two sources of randomness in our evaluation\. A dataset seed controls the train/validation/test partition or CV folds, and therefore measures data\-split robustness: whether conclusions persist across different sampled splits of the same small HDLSS dataset\. This is the main source of evaluation variability in our setting, since small tabular datasets can be highly split\-sensitive\(Rubachevet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib173); Grinsztajnet al\.,[2022](https://arxiv.org/html/2606.05441#bib.bib174); Bouthillieret al\.,[2021](https://arxiv.org/html/2606.05441#bib.bib175)\); accordingly, our main experiments use repeated5×55\\times 5CV following ProtoGate\(Jianget al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib98)\)to average over multiple dataset splits\. In contrast, a TabPFN seed controls therandom\_stateor equivalent stochastic components inside the TabPFN inference/configuration pipeline, and therefore measures model/inference stochasticity while holding the data split fixed; such run and distribution\-level variance is also a known concern in neural\-network evaluation\(Jordan,[2024](https://arxiv.org/html/2606.05441#bib.bib176)\)\. Prior Labs provides the official TabPFN classification interface,222Prior Labs classification documentation:[https://docs\.priorlabs\.ai/capabilities/classification](https://docs.priorlabs.ai/capabilities/classification)\.and maintainers discuss fixed\-random\_statereproducibility for TabPFN in the implementation repository\.333PriorLabs/TabPFN reproducibility discussion, issue \#266:[https://github\.com/PriorLabs/TabPFN/issues/266](https://github.com/PriorLabs/TabPFN/issues/266)\.Recent TabPFN studies also report averages over multiple seeds\(Yeet al\.,[2025a](https://arxiv.org/html/2606.05441#bib.bib177)\)\. Thus, varying dataset seeds tests evaluation robustness, whereas varying TabPFN seeds isolates model\-seed sensitivity\. In our main HDLSS experiments, we prioritize repeated5×55\\times 5CV for split robustness and use the Optuna\-tuned best TabPFN seed per dataset to avoid underestimating TabPFN from an unfavorable inference seed; fixed\-split multi\-TabPFN\-seed results are supplementary analyses of model\-seed variance\.

##### Findings from TabPFN\-seed robustness analysis\.

Tables[S\.1](https://arxiv.org/html/2606.05441#A19.T1),[S\.2](https://arxiv.org/html/2606.05441#A19.T2),[S\.3](https://arxiv.org/html/2606.05441#A19.T3),[S\.4](https://arxiv.org/html/2606.05441#A19.T4), and[S\.5](https://arxiv.org/html/2606.05441#A19.T5)show that GOTabPFN remains consistently strong across TabPFN seeds, not only under one favorable random state\. On the 8 HDLSS datasets, GOTabPFN achieves the best average accuracy for every tested seed:90\.5790\.57,89\.6689\.66,89\.9389\.93,89\.7989\.79, and89\.6689\.66, compared with the strongest competing averages of roughly8888\-88\.588\.5from TabPFN\-Wide\(Kolberget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib117)\)or TuneTables\(Feueret al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib112)\)\. The detailed per\-dataset table further shows that GOTabPFN is especially effective on difficult high\-dimensional datasets such as GLI, ARC, SMK, TOX, AML, and LNG, although some baselines occasionally win on individual datasets such as PRS or TOX for particular seeds\. On the 8 cross\-domain datasets, GOTabPFN is also the strongest complete method across all seeds, with averages around86\.686\.6\-86\.986\.9, while TabPFN\-Wide and TuneTables vary more substantially across seeds\. The only higher BETA\(Liu and Ye,[2025](https://arxiv.org/html/2606.05441#bib.bib116)\)average is reported for seed 42 over 6 datasets only, because Cell Cycle and DrivFace\-Regression were omitted due to runtime; therefore, that number is not directly comparable to the complete 8\-dataset averages\. For ROC\-AUC on the HDLSS benchmark, GOTabPFN again achieves the best average in 4 of 5 seeds and is nearly tied with TabPFN\-Wide at seed 93 \(93\.7193\.71vs\.93\.7493\.74\), indicating that the gains are not limited to accuracy but also largely persist in ranking quality\. Overall, these results distinguish GOTabPFN from other high\-dimensional TabPFN variants: TabPFN\-Wide, BETA, and TuneTables can process larger feature spaces, but they still rely primarily on the foundation\-model predictor or tuning strategy to absorb high\-dimensional structure\. GOTabPFN instead first reorganizes the feature space through GO\-LR and compresses locally coherent neighborhoods through NSC, yielding a compact and more stable representation before the frozen TabPFN\-2\.5 head\. This structured front\-end explains why GOTabPFN remains competitive or superior across seeds: it reduces the burden on the predictor, preserves locality among related features, and provides a more robust HDLSS\-specific interface than simply widening, tuning, or directly applying TabPFN\-style models to high\-dimensional inputs\.

##### Pareto frontier of runtime and performance\.

We further evaluate the accuracy\-efficiency trade\-off among high\-dimensionality compatible TabPFN\-family methods using mean runtime and mean performance under TabPFN seed 42\. As shown in Fig\.[S\.1](https://arxiv.org/html/2606.05441#A19.F1), GOTabPFN lies on the Pareto frontier on both benchmarks\. On the original 8 HDLSS datasets, GOTabPFN achieves the highest mean performance while remaining substantially faster than BETA and only moderately slower than TabPFN\-Wide and TuneTables, yielding the best overall trade\-off among the compared methods\. On the additional 8 cross\-domain datasets, where both sample size and dimensionality increase beyond the original HDLSS\-only benchmark, GOTabPFN again remains Pareto optimal: it achieves the best mean performance while requiring lower runtime than TabPFN\-Wide and TuneTables\. These results indicate that the GO\-LR\+NSC front\-end does not merely improve accuracy by adding excessive computation; rather, it provides an efficient structured compression interface that improves the performance\-runtime balance of TabPFN\-style prediction in high\-dimensional and larger\-sample regimes\.

##### Random seed sensitivity on Colon\.

We further isolate TabPFN inference\-seed sensitivity on the Colon dataset by fixing the best GO\-LR\+NSC configuration from Table[H\.1](https://arxiv.org/html/2606.05441#A8.T1)and varying only the TabPFN inference seed over\{0,1,2,3,4,7,11,17,23,42\}\\\{0,1,2,3,4,7,11,17,23,42\\\}\. Preprocessing, GO\-LR, NSC, and the exact same5×55\\times 5CV splits are kept fixed, so any variation comes only from the TabPFN seed\. As shown in Table[S\.6](https://arxiv.org/html/2606.05441#A19.T6), the results are tightly clustered: the across\-seed mean accuracy is87\.1987\.19, with an across\-seed standard deviation of only0\.460\.46and a full range of86\.5686\.56\-88\.1888\.18percentage points\. Thus, while TabPFN seed introduces mild variation, the Colon result is not driven by an exceptionally favorable stochastic realization\. This is particularly relevant because the strongest Colon baselines are also TabPFN\-family methods: GOTabPFN \(88\.1888\.18\), TabPFN\-Wide \(87\.8587\.85\), and TuneTables \(86\.8086\.80\)\. Under this like\-for\-like comparison, GOTabPFN remains the best\-performing TabPFN\-family method, supporting that its gain comes from the GO\-LR\+NSC representation interface rather than seed luck alone\.

Table S\.1:Average accuracy \(↑\\uparrow\) across the 8 HDLSS datasets for different TabPFN seeds\.Values are mean accuracy with subscripted standard deviation across datasets\.Table S\.2:TabPFN\-seed robustness on 8 cross\-domain datasets\.Average accuracy across the 8 additional cross\-domain datasets for different TabPFN seeds\. Values are mean with subscripted standard deviation across datasets\. BETA is omitted because it was not run on all datasets and required substantially longer runtime\.![Refer to caption](https://arxiv.org/html/2606.05441v1/pareto_time_vs_performance_seed42.png)Figure S\.1:Pareto frontier of mean runtime vs\. mean performance for high\-dimensionality compatible TabPFN\-family methods\.Results use TabPFN seed 42\. Lower runtime and higher performance are better\. Dashed lines connect Pareto optimal methods\. Left: original 8 HDLSS datasets\. Right: additional 8 cross\-domain datasets\. BETA is excluded from the right panel because it was not run on Cell Cycle and DrivFace\-Regression and required substantially longer runtime\.Table S\.3:Accuracy comparison across 8 HDLSS datasets for different TabPFN seeds\.Values are mean accuracy with subscripted standard deviation over5×55\\times 5CV\. The Avg\. column reports the mean across the 8 datasets, with the subscript giving the mean of the corresponding standard deviations\.Table S\.4:Accuracy comparison across 8 cross\-domain datasets for different TabPFN seeds\.Values are mean accuracy with subscripted standard deviation over5×55\\times 5CV\. The Avg\. column reports the mean across datasets within each seed block, with the subscript giving the mean of the corresponding standard deviations\. BETA is reported only for seed 42 and its average is computed over 6 datasets because Cell Cycle and DrivFace\-Regression were omitted due to substantially longer runtime\.†\\daggerFor BETA at seed 42, Cell Cycle and DrivFace\-Regression are unavailable because the runs were prohibitively time\-consuming; its average is computed over the remaining 6 datasets\.

Table S\.5:ROC\-AUC comparison across 8 HDLSS datasets for different TabPFN seeds\.Values are mean ROC\-AUC with subscripted standard deviation over5×55\\times 5CV\. The Avg\. column reports the mean across the 8 datasets, with the subscript giving the mean of the corresponding standard deviations\.Table S\.6:TabPFN inference\-seed sensitivity of GOTabPFN on Colon\.GO\-LR, NSC, preprocessing, and the exact same5×55\\times 5CV splits are fixed; only the TabPFN inference seed is varied\. Values are mean accuracy with subscripted standard deviation over5×55\\times 5CV\.

## Appendix TAdditional Clarifications

##### Clarity of graph\-based feature ordering\.

To make the graph\-based ordering step easier to follow, the main paper introduces GO\-LR with an intuitive figure before the formal MinLA\-based development\. Here, we provide additional clarification\. The goal of GO\-LR is to place statistically related features close to one another on a one\-dimensional axis, so that subsequent NSC segmentation groups coherent neighborhoods rather than arbitrary columns\. Equivalently, GO\-LR treats features as nodes in a weighted feature graph, where stronger edges indicate stronger feature relationships, and seeks a linear arrangement in which strongly connected nodes remain nearby\. For example, iff1f\_\{1\}is strongly related to bothf3f\_\{3\}andf4f\_\{4\}, whilef5f\_\{5\}is comparatively independent, an ordering such as\[f3,f1,f4,f2,f5\]\[f\_\{3\},f\_\{1\},f\_\{4\},f\_\{2\},f\_\{5\}\]is preferable to\[f1,f2,f3,f4,f5\]\[f\_\{1\},f\_\{2\},f\_\{3\},f\_\{4\},f\_\{5\}\], because the related features become contiguous and can be compressed more meaningfully\. This intuition is illustrated in Fig\.[1](https://arxiv.org/html/2606.05441#S1.F1)in the main paper\. Formally, this corresponds to a Minimum Linear Arrangement \(MinLA\)\-style objective that penalizes placing strongly related features far apart\. Since exact optimization is combinatorial and intractable at scale, GO\-LR uses a TSP\-style nearest\-neighbor path as an efficient initialization and then applies local refinement under the MinLA\-style dispersion objective\.

##### Clarifying meta\-feature construction\.

To make the NSC representation interface more self\-contained, we explicitly define a meta\-feature as the low\-dimensional token obtained by compressing one contiguous segment of the GO\-LR\-ordered feature axis\. LetΠ∗\\Pi^\{\\ast\}denote the global feature ordering,\{𝒮t\}t=1M\\\{\\mathcal\{S\}\_\{t\}\\\}\_\{t=1\}^\{M\}the contiguous ordered segments, andut=x𝒮tΠu\_\{t\}=x^\{\\Pi\}\_\{\\mathcal\{S\}\_\{t\}\}the subvector of samplexxrestricted to segment𝒮t\\mathcal\{S\}\_\{t\}\. Thett\-th meta\-feature is thenzt=g​\(ut\)z\_\{t\}=g\(u\_\{t\}\), whereg​\(⋅\)g\(\\cdot\)is a segment\-level pooling or projection operator\. In our main NSC\-pSP instantiation,g​\(⋅\)g\(\\cdot\)is implemented by segment\-wise PCA projection, producing one scalar token per ordered segment\. The final compressed representation is thereforeZ​\(x\)=\(z1,…,zM\)Z\(x\)=\(z\_\{1\},\\ldots,z\_\{M\}\), which is passed to the frozen TabPFN\-2\.5 head\. Fig\.[2](https://arxiv.org/html/2606.05441#S2.F2)in the main paper illustrates this process: GO\-LR first reorders the original features, NSC partitions the ordered axis into contiguous neighborhoods, and each neighborhood is compressed into a meta\-feature\. Additional details on ordered segmentation, subunit pooling, and meta\-feature construction are provided in Sec\.[3\.2](https://arxiv.org/html/2606.05441#S3.SS2)in the main paper\.

##### Similarity vs\. dissimilarity metric\.

We use1−\|corr​\(i,j\)\|1\-\|\\mathrm\{corr\}\(i,j\)\|as a dependence\-aware dissimilarity so that strongly coupled feature pairs have small distance regardless of sign\. Concretely, both strongly positive and strongly negative correlations satisfy\|corr​\(i,j\)\|≈1\|\\mathrm\{corr\}\(i,j\)\|\\approx 1, hencedi​j=1−\|corr​\(i,j\)\|≈0d\_\{ij\}=1\-\|\\mathrm\{corr\}\(i,j\)\|\\approx 0\. This is the intended behavior for GO\-LR: the goal is not to preserve the sign of association, but to place strongly dependent or redundant features into the same local neighborhood before compression\. This choice is also consistent with our neuro\-inspired motivation\. In Sec\.[3\.2](https://arxiv.org/html/2606.05441#S3.SS2), we discuss evidence that dendritic inputs are organized into local subunits rather than summed globally, and that correlated synapses may exhibit local clustering within such compartments\. Our algorithmic analogue is therefore that GO\-LR first brings strongly coupled features close along the ordered axis, and NSC then pools these local neighborhoods into subunit\-level meta\-features\. In this sense,1−\|corr​\(i,j\)\|1\-\|\\mathrm\{corr\}\(i,j\)\|should be read as a practical measure of lack of coupling, chosen to support local clustering and subunit\-style aggregation, rather than as a broader semantic notion of dissimilarity\.

##### Cross\-domain evaluation\.

To evaluate whether GOTabPFN generalizes beyond biomedical HDLSS datasets, we extend the benchmark with 8 additional cross\-domain datasets spanning text, face images, camera\-sensor data, image features, and RNA\-seq measurements\. These datasets cover HDLSS, HDHSS, and mixed regimes under the empirical categorization rule adopted from DynaTab\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\)and expanded in Appendix[F](https://arxiv.org/html/2606.05441#A6)\. This extension is important because recent HDLSS\-specific tabular models are still evaluated largely on biomedical benchmarks, with only limited coverage of text, face, or sensor domains\. For example, ProtoGate\(Jianget al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib98)\)evaluates primarily on 7 biomedical datasets, while LSPIN/LLSPIN\(Yanget al\.,[2022a](https://arxiv.org/html/2606.05441#bib.bib99)\)reports real\-world experiments on 3 text and 3 biomedical datasets\. Similarly, high\-dimensional TabPFN\-style extensions remain limited in domain coverage: TabPFN\-Wide\(Kolberget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib117)\)uses 4 biomedical datasets, BETA\(Liu and Ye,[2025](https://arxiv.org/html/2606.05441#bib.bib116)\)includes a mixture of biomedical, text, and face\-image datasets, and TuneTables\(Feueret al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib112)\)is evaluated through the broader TabZilla benchmark\(McElfreshet al\.,[2023](https://arxiv.org/html/2606.05441#bib.bib178)\)\. We did not identify a clear real\-world financial HDLSS dataset suitable for inclusion in this evaluation\. The resulting cross\-domain suite, summarized in Table[T\.1](https://arxiv.org/html/2606.05441#A20.T1), therefore broadens the empirical scope of our study while retaining high\-dimensional settings where feature ordering and locality\-aware compression are relevant\.

Table T\.1:Additional cross\-domain datasets\.Datasets are categorized using the empirical regime rule in Appendix[F](https://arxiv.org/html/2606.05441#A6), following DynaTab\(Habibet al\.,[2026b](https://arxiv.org/html/2606.05441#bib.bib157)\)\. Here,nnis the number of instances,mmis the number of features, andρ=m/n\\rho=m/nis the feature\-to\-sample ratio\. CIFAR\-10 uses subsampled ResNet\-50 image embeddings\.11footnotemark:1
##### Cross\-domain dominance\.

Fig\.[T\.1](https://arxiv.org/html/2606.05441#A20.F1)shows that GOTabPFN generalizes beyond its primary HDLSS target regime to a broader cross\-domain benchmark\. Across the 8 additional datasets, GOTabPFN ranks first on 7/8 datasets, with positive margins over the strongest competing method on ORL \(\+0\.80\), BAS \(\+0\.07\), PCM \(\+0\.52\), Cell Cycle \(\+1\.27\), CIFAR\-10 \(\+0\.30\), DrivFace\-R \(\+0\.0043\), and DrivFace\-C \(\+1\.15\)\. The only exception is RELATHE, where GOTabPFN remains competitive and trails the best baseline by only 0\.55 points\. These results indicate that, even with limited tuning, the GO\-LR\+NSC pipeline remains highly effective on HDLSS\-style datasets while also preserving competitive performance in adjacent high\-dimensional and mixed\-regime settings\. The larger margins on ORL, Cell Cycle, and DrivFace\-C further suggest that locality\-aware feature ordering and compression are particularly beneficial when the data retain strong high\-dimensional structure\.

![Refer to caption](https://arxiv.org/html/2606.05441v1/gotabpfn_signed_margin_clean2.png)Figure T\.1:Signed margin of GOTabPFN on cross\-domain datasets\.Positive bars show GOTabPFN’s margin over the runner\-up when it ranks first; the negative bar shows how far it trails the best baseline when it does not rank first\. GOTabPFN ranks first on 7/8 additional datasets, with especially strong margins on ORL, Cell Cycle, and DrivFace\-C, and remains competitive on RELATHE\.
##### Clarifying tuning fairness\.

Our tuning protocol follows a distinction that is common in recent tabular\-learning practice and is summarized in Fig\.[A\.1](https://arxiv.org/html/2606.05441#A1.F1)in Appendix[A](https://arxiv.org/html/2606.05441#A1)\. On one side are PFN/ICL\-style tabular foundation models, which are typically used close to off\-the\-shelf because much of the modeling burden is absorbed during pre\-training\. For example, the official TabICL\(Jinganget al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib107)\)repository states that TabICL does not require preprocessing or hyperparameter tuning,444TabICL official repository:[https://github\.com/soda\-inria/tabicl](https://github.com/soda-inria/tabicl)\.and the official TabDPT\(Maet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib97)\)repository similarly describes TabDPT as an ICL\-based tabular foundation model that generalizes to new tasks without additional training or hyperparameter tuning\.555TabDPT official repository:[https://github\.com/layer6ai\-labs/TabDPT\-inference](https://github.com/layer6ai-labs/TabDPT-inference)\.On the other side are conventional tuned tabular learners, such as TabR\(Gorishniyet al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib108)\), TabM\(Gorishniyet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib104)\), RealMLP\(Holzmülleret al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib110)\), XGBoost\(Chen and Guestrin,[2016](https://arxiv.org/html/2606.05441#bib.bib109)\), CatBoost\(Prokhorenkovaet al\.,[2018](https://arxiv.org/html/2606.05441#bib.bib105)\), and LightGBM\(Keet al\.,[2017](https://arxiv.org/html/2606.05441#bib.bib106)\), whose standard use often involves dataset\-specific hyperparameter search\. GOTabPFN naturally lies between these two regimes: the GO\-LR\+NSC front\-end is dataset\-adaptive and therefore tunable, while the downstream TabPFN\-2\.5\(Grinsztajnet al\.,[2025](https://arxiv.org/html/2606.05441#bib.bib115)\)predictor remains a frozen pre\-trained backbone that is neither retrained nor structurally modified\. Thus, the substantive dataset\-specific adaptation in GOTabPFN occurs only in the representation interface before TabPFN inference, not in the foundation\-model backbone itself\. Regarding the TabPFN seed in Table[H\.1](https://arxiv.org/html/2606.05441#A8.T1), this seed is not a trainable parameter and does not modify the backbone weights; it is a fixed inference\-time configuration selected from the same predefined set under the same outer Optuna\(Akibaet al\.,[2019](https://arxiv.org/html/2606.05441#bib.bib43)\)study as the front\-end hyperparameters \(see Appendix[S](https://arxiv.org/html/2606.05441#A19)for TabPFN seed sensitivity\)\. Therefore, GOTabPFN should be viewed as a hybrid tunable front\-end attached to a frozen PFN\-style predictor, rather than as a fully tuned end\-to\-end tabular learner\. See Table[T\.2](https://arxiv.org/html/2606.05441#A20.T2)for the source URLs of baselines\.

##### Evaluation scope and statistical validation\.

We provide a consolidated view of the expanded benchmark and statistical analyses supporting our empirical conclusions\. The main paper reports results on the original 8 HDLSS datasets in Sec\.[4](https://arxiv.org/html/2606.05441#S4), including the top\-model summary in Table[1](https://arxiv.org/html/2606.05441#S3.T1); Appendix[G](https://arxiv.org/html/2606.05441#A7)further provides the full 55\-baseline comparison in Table[G\.1](https://arxiv.org/html/2606.05441#A7.T1)\. To broaden the benchmark beyond the original biomedical\-heavy HDLSS setting, we add 8 cross\-domain datasets spanning text, face\-image, camera\-sensor, image\-feature, and RNA\-seq domains, with results summarized in Table[2](https://arxiv.org/html/2606.05441#S3.T2)\. Thus, the final evaluation covers both the targeted HDLSS regime and a broader 16\-dataset cross\-domain setting\. For statistical validation, Appendix[I](https://arxiv.org/html/2606.05441#A9)provides an expanded significance analysis: Fig\.[I\.2](https://arxiv.org/html/2606.05441#A9.F2)and Table[I\.2](https://arxiv.org/html/2606.05441#A9.T2)show that GOTabPFN achieves the best average rank on the 16\-dataset benchmark; Table[I\.3](https://arxiv.org/html/2606.05441#A9.T3)reports a significant omnibus Friedman test across the 9\-method comparison; and Tables[I\.4](https://arxiv.org/html/2606.05441#A9.T4)and[I\.5](https://arxiv.org/html/2606.05441#A9.T5)show that GOTabPFN remains significantly better than each strong repeated baseline after Holm correction\. We also contextualize our evaluation protocol relative to closely related HDLSS and high\-dimensional tabular studies\. Prior HDLSS\-specific work often relies on small but carefully curated benchmark suites and reports rank\-based summaries to compare methods across heterogeneous datasets\. For example, ProtoGate\(Jianget al\.,[2024](https://arxiv.org/html/2606.05441#bib.bib98)\)evaluates on 7 biomedical HDLSS datasets against 16 baselines and reports average rank as a primary aggregate measure, while LSPIN/LLSPIN\(Yanget al\.,[2022a](https://arxiv.org/html/2606.05441#bib.bib99)\)evaluates on 6 real\-world high\-dimensional datasets, including 3 text and 3 biomedical datasets, and summarizes performance using median rank\. Following this established practice, we report average ranks and statistical tests, but we further expand the evidence by evaluating GOTabPFN against 55 baselines on the original 8 HDLSS datasets and against the strongest repeated baselines on an expanded 16\-dataset benchmark\. Overall, the conclusions are not based only on the original 8\-dataset rank summary, but are supported by detailed 55\-baseline HDLSS comparisons, an additional 8\-dataset cross\-domain evaluation, and both omnibus and pairwise statistical tests on the expanded 16\-dataset benchmark\.

##### GOTabPFN in extreme dimensionality\.

GOTabPFN is evaluated across a broad high\-dimensional range, with feature counts spanning fromm=2,000m=2\{,\}000at the lower end of our HDLSS benchmark tom=42,728m=42\{,\}728on the Cell Cycle RNA\-seq dataset withn=1067n=1067samples\. On this largest dataset, GOTabPFN remains fully operational and achieves79\.94±2\.5379\.94\\pm 2\.53accuracy,92\.36±1\.3692\.36\\pm 1\.36AUC, and79\.95±2\.5179\.95\\pm 2\.51macro\-F1\. These results show that the GO\-LR\+NSC representation interface scales beyond moderate HDLSS feature counts and remains effective in substantially larger transcriptomic feature spaces, where direct use of TabPFN\-style predictors would otherwise be constrained by the extreme dimensionality\.

##### Theoretical grounding, novelty, and HDLSS relevance\.

The theoretical results in Sec\.[3](https://arxiv.org/html/2606.05441#S3.F3)are not intended as isolated complexity\-theoretic contributions; rather, they formalize why GO\-LR is a principled ordering mechanism\. Specifically, the MinLA formulation identifies the objective that GO\-LR approximates, the NP\-hardness result explains why exact scalable optimization is unrealistic, and the TSP\-style path construction motivates an efficient surrogate initialization before local MinLA\-based refinement\. The overall contribution is therefore best understood as an integrated HDLSS\-oriented framework: GO\-LR provides a theoretically grounded feature ordering objective, NSC converts the ordered axis into stable meta\-features through structured local compression, and the resulting representation interface enables a frozen TabPFN\-style backbone to operate effectively beyond its native feature counts’ limits\. This integration yields a new HDLSS\-oriented tabular foundation model interface in which feature ordering, locality\-preserving compression, and frozen TabPFN\-style inference are jointly aligned to overcome the dimensionality bottleneck of existing tabular foundation models\.

##### Fine\-tuning as future work\.

An important future direction is to pretrain or fine\-tune a TabPFN\-style backbone directly on structured representations produced by GO\-LR\+NSC\. In principle, this could be done by generating large collections of synthetic HDLSS\-style tasks, applying graph\-guided ordering and subunit compression, and then adapting the backbone to operate natively in the resulting meta\-feature space\. A further extension would be to initialize from the existing TabPFN checkpoint and pretrain or fine\-tune the full GO\-LR\+NSC\+TabPFN pipeline end\-to\-end, so that the entire model becomes a pretrained HDLSS\-oriented foundation model rather than a tuned front\-end attached to a frozen predictor\. Such a model could potentially reduce or eliminate dataset\-specific tuning at inference time, because the ordering, compression, and prediction components would be jointly adapted during pretraining\. However, this would shift the focus from representation\-side adaptation of an existing frozen backbone to the development of a new HDLSS\-specific tabular foundation model\. We therefore view backbone adaptation or full\-pipeline pretraining on GO\-LR\+NSC representations as a promising but distinct direction beyond the present scope\.

Table T\.2:List of 55 baseline models and their source URLs\.

Similar Articles

TabPFN-3: Technical Report

arXiv cs.LG

TabPFN-3 is a new foundation model for tabular data, pretrained on synthetic data, that scales to 1M training rows while reducing training and inference time, achieving state-of-the-art performance on tabular prediction, time series, and relational data.

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

arXiv cs.AI

This paper studies whether tabular foundation models based on pretrained prior-data fitted networks (PFNs) can generalize to strategic tabular data where individuals modify features after deployment. It proposes Strategic Prior-data Fitted Network (SPN), an inference-time framework that aligns PFN predictions with the post-manipulation distribution without retraining.

PriorLabs/TabPFN

GitHub Trending (daily)

TabPFN is introduced as a foundation model specifically designed for tabular data by PriorLabs.