TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Hugging Face Daily Papers 05/06/26, 12:00 AM Papers

tabular-data embeddings benchmark contrastive-learning representation-learning machine-learning

Summary

This paper introduces TabEmbed, a generalist embedding model for tabular data that unifies classification and retrieval tasks, along with TabBench, a new benchmark for evaluating tabular understanding.

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.

Original Article

View Cached Full Text

Cached at: 05/08/26, 07:13 AM

Paper page - TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Source: https://huggingface.co/papers/2605.04962

Abstract

A new generalist embedding model called TabEmbed is introduced that unifies tabular classification and retrieval tasks within a shared embedding space using large-scale contrastive learning with positive-aware hard negative mining.

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lackretrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce theTabular Embedding Benchmark(TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifiestabular classificationandretrievalwithin a sharedembedding space. By reformulating diverse tabular tasks assemantic matching problems, TabEmbed leverageslarge-scale contrastive learningwithpositive-aware hard negative miningto discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline foruniversal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.

View arXiv page View PDF GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.04962

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.04962 in a model README.md to link it from this page.

Datasets citing this paper1

#### qiangminjie27/TabBench Preview• Updatedabout 4 hours ago • 934

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.04962 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Paper page - TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

MVEB: Massive Video Embedding Benchmark

JFinTEB: Japanese Financial Text Embedding Benchmark

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

Submit Feedback

Similar Articles

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

MVEB: Massive Video Embedding Benchmark

JFinTEB: Japanese Financial Text Embedding Benchmark

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild