@vintcessun: Numerical datasets don't even share column names — how can AI retrieve and align across tables? Existing embedding methods fail on heterogeneous tables, and LLMs are at a loss. This problem blocks cross-dataset RAG, algorithm selection, and simulation initialization — without common feature names, similarity matching is guesswork. The paper proposes: compute 20+ statistical descriptors (mean, quantiles, missing rate...) for each table.
Summary
This paper proposes a method for cross-table retrieval and alignment of heterogeneous numerical tabular datasets using statistical descriptors and sentence embeddings, enabling similarity matching and interpretable variable-level correspondence without shared column names.
View Cached Full Text
Cached at: 05/31/26, 02:54 AM
Numeric datasets often have different column names. How can AI perform cross-table retrieval and alignment? Existing embedding methods fail when encountering heterogeneous tables, and LLMs are also helpless. This problem has hindered cross-dataset RAG, algorithm selection, and simulation initialization—without common feature names, similarity matching can only rely on guessing. The paper proposes: compute 20+ statistical descriptors (mean, quantiles, missing rate…) for each table, use sentence
Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets
Source: https://arxiv.org/html/2605.30289 [1]\fnmM. Ross\surKunz
[1]\orgnameIdaho National Laboratory
Abstract
Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches either target predictive modeling over individual datasets, which requires a shared set of variable definitions, or lack mechanisms for interpretable cross-dataset alignment. The proposed methodology characterizes numeric tabular datasets through structured exploratory data analysis descriptors, embeds those descriptors into a shared vector space using a pretrained sentence transformer, and quantifies cross-dataset similarity via Canonical Correlation Analysis (CCA). Furthermore, a penalized formulation of CCA is applied to recover sparse, interpretable variable-level correspondences between datasets, identifying which statistical descriptors or variable-level quantities drive cross-dataset alignment without requiring shared variable names or feature conventions. Differential privacy is optionally applied to the descriptor set prior to embedding, supporting deployment in sensitive data contexts without requiring access to raw observations at time of comparison. The methodology is evaluated across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization. Results demonstrate a totalP@1P@1score of 0.9, with known nearest-neighbor retrieval and cluster structure remaining robust across embedding ablations and differential privacy budgets. The proposed framework provides a principled pathway for integrating heterogeneous numeric data into retrieval-augmented generation pipelines while preserving statistical context, with direct applications to data-driven algorithm selection and simulation model initialization for unknown datasets.
keywords:
tabular data embedding, dataset similarity, canonical correlation analysis, retrieval-augmented generation
1Introduction
Numeric tabular datasets are the dominant data format in scientific practice, yet organizing, searching, and comparing such datasets across heterogeneous feature spaces remains a largely unsolved problem in data management and knowledge discovery. Scientific data repositories accumulate datasets that vary in sample size, variable composition, measurement scale, and domain context, making it difficult to determine which datasets are structurally related, which share distributional properties, or which are likely to respond similarly to a given modeling approach. Existing dataset search and profiling tools produce human-readable summaries or rely on schema matching over shared variable names, neither of which supports similarity comparison across datasets with disjoint feature spaces. This limitation is especially pronounced in natural science domains, where a single physical system may be characterized through multiple experimental modalities with incompatible variable definitions, and where the cost of redundant or mismatched modeling efforts is high. A principled representation of dataset-level statistical structure comparable across heterogeneous feature spaces, and interpretable in terms of the underlying data properties, would substantially advance the ability to organize, retrieve, and reason over large collections of scientific tabular data.
Large language model (LLM) embedding spaces offer a potential pathway toward such a representation. By encoding text into vectors of common dimension, pretrained sentence transformers enable similarity comparisons across documents from disparate domains without requiring shared vocabulary or structure. This capability motivates a statistical embedding strategy: rather than representing a dataset by its raw observations or variable names, one characterizes it through structured exploratory data analysis (EDA) descriptors, serializes those descriptors into natural language sentences, and embeds the resulting sentences into the LLM’s semantic space. The central challenge is ensuring that this embedding process preserves the statistical structure of the underlying data.
The proposed methodology computes individual statistical descriptors for each numeric column alongside multivariate summary quantities, which together serve as the basis for embedding tabular data into a shared vector space. These embeddings support hierarchical clustering and Canonical Correlation Analysis (CCA) across datasets, analogous to multi-view document embeddings[dhillon2011multi], even when data sources differ in dimension or variable composition. A penalized formulation of CCA is further employed to identify which variables or multivariate quantities drive cross-dataset correlation, yielding directly interpretable alignment structure. Differential privacy is optionally applied to the statistical descriptors prior to embedding, providing a principled mechanism for obfuscating sensitive summary information while preserving sufficient structure for meaningful cross-dataset comparison. Together, these components provide a means of interpretable tabular RAG that organizes heterogeneous numeric data while preserving statistical context.
The contributions of this work are as follows:
- •A structured EDA pipeline that produces statistically grounded, sentence-serialized descriptors for numeric tabular datasets, with optional differential privacy for sensitive data contexts, formalized as a mapping from a raw data matrix to a collection of embedding vectors in a shared semantic space.
- •A cross-dataset similarity framework based on mean canonical correlation between EDA embedding collections, supporting nearest-neighbor retrieval and hierarchical clustering across datasets with heterogeneous and disjoint feature spaces.
- •A penalized CCA formulation yielding sparse canonical loadings that identify which statistical descriptors or variable-level quantities drive cross-dataset alignment, providing interpretability not available under standard CCA.
- •Empirical validation across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization, demonstrating a totalP@1P@1score of 0.9 with nearest-neighbor retrieval.
- •A framework for retrieval-based algorithm and simulation model initialization, in which statistical similarity to cataloged datasets with known modeling outcomes supports data-driven selection of candidate methods for unknown datasets, connecting the proposed approach to meta-learning and AutoML pipelines.
The remainder of this paper is organized as follows. Section2 (https://arxiv.org/html/2605.30289#S2)reviews related work on tabular representation learning, dataset search, and differential privacy for data publishing. Section3 (https://arxiv.org/html/2605.30289#S3)describes the EDA fingerprinting pipeline, embedding procedure, and penalized CCA similarity framework. Section4 (https://arxiv.org/html/2605.30289#S4)details the datasets used for evaluation. Section5 (https://arxiv.org/html/2605.30289#S5)presents nearest-neighbor retrieval results, hierarchical clustering analysis, and penalized CCA interpretability examples. Section6 (https://arxiv.org/html/2605.30289#S6)summarizes the findings and discusses directions for future work.
2Related Work
2.1Tabular Representation and Foundation Models
Several lines of research have addressed the challenge of representing tabular data within learned model architectures. TabNet[arik2021tabnet]introduced sequential attention over tabular features to enable instance-wise feature selection during training, improving interpretability for prediction tasks. SAINT[somepalli2021saint]extended the transformer architecture to tabular data through inter-sample and inter-feature attention, demonstrating competitive performance on supervised learning benchmarks. TabPFNv2[hollmann2025accurate]demonstrated that a transformer pretrained exclusively on synthetic tabular data achieves state-of-the-art predictive accuracy on small datasets via in-context learning. TabICL[qu2025tabicl]extended this paradigm to large datasets through a two-stage column-then-row attention mechanism.[van2024tabular]argue that such Large Tabular Models (LTMs) remain significantly underrepresented in the foundation model literature despite the dominance of tabular data across scientific domains. Each of these approaches targets predictive modeling performance on individual datasets and does not address the problem of representing or comparing datasets across heterogeneous feature spaces.
2.2Dataset Search and Data Discovery
The problem of organizing and retrieving datasets from large repositories has received sustained attention in the data management community. Dataset search over data lakes, collections of heterogeneous raw data assets stored without imposed schema, has been formalized as a problem of identifying semantically or structurally related tables from among thousands of candidates[nargesian2019data]. A central challenge in this setting is that related datasets often share neither variable names nor schema structure, requiring similarity measures that operate over content rather than metadata alone. Schema matching approaches address this by identifying correspondences between column names or value distributions across pairs of tables[rahm2001survey], but these methods rely on lexical overlap or value-level comparison and do not generalize to datasets with entirely disjoint vocabularies. Dataset join and unionability discovery methods[zhu2019josie,nargesian2019data]extend schema matching to identify tables that can be joined or stacked, but again require overlapping values or column semantics and are not designed for cross-domain similarity assessment. The ARDA system[chepurko2020arda]and related augmentation frameworks treat dataset discovery as a feature engineering problem, identifying external tables that improve downstream model performance when joined to a query table, but this framing assumes a fixed target task rather than general-purpose similarity assessment. More recent work on dataset discovery for machine learning[zha2025data]has begun to treat datasets as first-class objects with learnable representations, but existing approaches remain largely confined to relational or structured tabular settings with compatible schemas. Existing profiling approaches produce human-readable summaries rather than embedding-compatible representations suitable for retrieval over heterogeneous collections.
2.3Tabular Retrieval-Augmented Generation and Embedding
Retrieval-augmented generation (RAG) offers a scalable approach to incorporating external knowledge at inference time by retrieving relevant context and injecting it into the model’s input[lewis2020retrieval]. Applied to tabular data, RAG pipelines retrieve relevant rows, columns, or dataset chunks at query time, but standard embedding models are pretrained predominantly on text corpora and underperform on numeric or relational table content. The Tabular Embedding Model (TEM)[khanna2025tabular]addresses this by fine-tuning lightweight embedding models specifically for tabular RAG pipelines, achieving substantial retrieval gains over general-purpose embedders. However, TEM operates at the row level within a single dataset and does not address cross-dataset similarity. The vision-language model connector paradigm[zhang2024vision,wu2019unified]demonstrates that heterogeneous modalities can be aligned into a shared latent space through learned projection modules, but[li2025lost]show that such projections distort local embedding geometry substantially. This geometric distortion finding motivates the use of distributional statistics rather than raw numeric values as the embedding substrate, as statistical descriptors are more naturally expressible in the semantic space of a pretrained sentence transformer than are raw numeric observations. No existing tabular RAG framework produces dataset-level embeddings that support cross-dataset similarity comparison while preserving interpretable statistical structure.
3Methodology
The proposed methodology is organized into two primary components. The first concerns the statistical characterization of numeric tabular datasets through a structured exploratory data analysis (EDA) framework. The second component addresses the transformation of these descriptors into a common vector space suitable for cross-dataset comparison and retrieval. Pairwise dataset similarity is then quantified through Canonical Correlation Analysis (CCA) with a sparse low dimensional approximation for interpretability.
3.1Problem Formulation
LetD={X∈Rn×p}\mathcal{D}=\{X\in\mathbb{R}^{n\times p}\}denote a numeric tabular dataset withnnobservations andppvariables. The goal is to construct a mappingφ:D→Rd\phi:\mathcal{D}\rightarrow\mathbb{R}^{d}such that for a collection of datasets{D1,…,DN}\{\mathcal{D}_{1},\ldots,\mathcal{D}_{N}\}with potentially disjoint variable sets, the pairwise similaritys(Di,Dj)=f(φ(Di),φ(Dj))s(\mathcal{D}_{i},\mathcal{D}_{j})=f\!\left(\phi(\mathcal{D}_{i}),\phi(\mathcal{D}_{j})\right)reflects meaningful statistical relationships between datasets even whenpi≠pjp_{i}\neq p_{j}or the variable sets are entirely disjoint. The mappingφ\phimust satisfy three properties. First, it must be computable without access to raw observations at comparison time, supporting privacy-preserving retrieval over sensitive data collections. Second, the resulting embedding must be compatible with a pretrained sentence transformer’s semantic space, enabling integration with downstream LLM-based retrieval pipelines. Third, the similarity measuressmust yield interpretable alignment structure that is recoverable via sparse canonical analysis, identifying which statistical descriptors or variable-level quantities drive cross-dataset correlation. Sections3.2 (https://arxiv.org/html/2605.30289#S3.SS2)and3.3 (https://arxiv.org/html/2605.30289#S3.SS3)describe the construction ofφ\phiandssrespectively.
A common theme throughout the methodology is the use of penalized regression and matrix decomposition techniques, chosen primarily for interpretability. The unifying computational primitive is the singular value decomposition (SVD), which factorizes a data matrixX∈Rn×pX\in\mathbb{R}^{n\times p}as:
X=UΣV⊤,X=U\Sigma V^{\top},(1)whereU∈Rn×rU\in\mathbb{R}^{n\times r}contains the left singular vectors,Σ∈Rr×r\Sigma\in\mathbb{R}^{r\times r}is a diagonal matrix of ordered singular valuesσ1≥σ2≥⋯≥σr≥0\sigma_{1}\geq\sigma_{2}\geq\cdots\geq\sigma_{r}\geq 0,V∈Rp×rV\in\mathbb{R}^{p\times r}contains the right singular vectors, andr=min(n,p)r=\min(n,p)is the rank ofXX. Each application of the SVD in the proposed methodology serves a distinct interpretive purpose in different stages, e.g., imputation or cross-correlation between datasets. The general penalized regression problem that unifies these applications takes the form:
minβL(X,β)+∑jpλ(|βj|),\min_{\beta}\;\mathcal{L}(X,\beta)+\sum_{j}p_{\lambda}(|\beta_{j}|),(2)whereL(X,β)\mathcal{L}(X,\beta)is a loss function measuring fit to the data,β\betais the parameter vector of
Similar Articles
@vintcessun: Feeding too many documents into RAG causes retrieval quality to drop from 75% to 40%? Vector search is diluted by a large amount of irrelevant content, causing a sharp drop in hit rate in real deployment. Root cause: heterogeneous documents are retrieved together, noise drowns out signal. Multi-agent orchestration seems intelligent but actually introduces a precision-fidelity paradox—poor configuration leads to failure in both aspects. The paper proposes MA…
This paper identifies 'vector search dilution' in RAG systems when scaling to large heterogeneous document collections, where accuracy dropped from 75% to 40% in a real-world deployment. The proposed MASDR-RAG method uses domain scoping via organizational metadata before retrieval, improving P@10 from 0.77 to 0.86 with low cost and easy deployment.
@gkxspace: Found a crazy open-source tool. You input a sentence describing what data you want, and it deploys a group of AI agents to research on various websites in parallel. After a few minutes, it compiles a structured table for you. In fact, the data is all on the internet, but turning it into a usable table has always been a labor-intensive task. In the past, this was an engineering project: combining searches, writing crawlers...
BigSet is an open-source tool. You input a sentence describing the data you need, and it deploys multiple AI agents to research the web in parallel, automatically inferring schema, deduplicating, verifying, and generating a structured table. It supports scheduled refreshes.
@freeman1266: Regular RAG vs Knowledge Graph RAG vs LLM Wiki—Three Knowledge Base Retrieval Methods, 95% of People Choose Wrong, Not Because They Don't Understand, but Because They Don't Recognize Their Data Morphology. Three Sentences to Clarify: Regular RAG: Chunk documents, vectorize them into the store, when a question comes find similar chunks to feed to …
This article compares the applicable scenarios and selection suggestions of three knowledge base retrieval schemes: Regular RAG, Knowledge Graph RAG, and LLM Wiki, emphasizing choosing the right scheme based on data morphology and avoiding blind use of complex tools.
@sitinme: GitHub 30k stars, do RAG without vector databases and with higher accuracy! Anyone doing RAG has probably experienced this: the vector database returns content that "looks relevant" but isn't the answer you're looking for. Especially with long documents like contracts, financial reports, technical manuals, when you ask "What was Q3 revenue?", it returns a paragraph about "company business overview." Similarity ≠ relevance—this is the fundamental problem with vector retrieval. PageIndex's solution is straightforward and brute-force: skip vectors, use reasoning.
Introduces an open-source project with 30k stars on GitHub that achieves RAG through reasoning instead of vector databases, claiming higher accuracy and solving the problem of similarity not equating to relevance.
LLM-as-a-Discriminator: When Synthetic Tables Still Look Real
This paper proposes an LLM-as-Discriminator method to audit privacy of synthetic tabular data by asking an LLM to classify samples as real or synthetic, showing that LLM discrimination can serve as a practical privacy audit signal.