TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Hugging Face Daily Papers Papers

Summary

This paper introduces TabEmbed, a generalist embedding model for tabular data that unifies classification and retrieval tasks, along with TabBench, a new benchmark for evaluating tabular understanding.

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:13 AM

Paper page - TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Source: https://huggingface.co/papers/2605.04962

Abstract

A new generalist embedding model called TabEmbed is introduced that unifies tabular classification and retrieval tasks within a shared embedding space using large-scale contrastive learning with positive-aware hard negative mining.

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lackretrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce theTabular Embedding Benchmark(TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifiestabular classificationandretrievalwithin a sharedembedding space. By reformulating diverse tabular tasks assemantic matching problems, TabEmbed leverageslarge-scale contrastive learningwithpositive-aware hard negative miningto discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline foruniversal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.

View arXiv pageView PDFGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.04962

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.04962 in a model README.md to link it from this page.

Datasets citing this paper1

#### qiangminjie27/TabBench Preview• Updatedabout 4 hours ago • 934

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.04962 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Hugging Face Daily Papers

Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.

MVEB: Massive Video Embedding Benchmark

Hugging Face Daily Papers

This paper introduces MVEB, a large-scale benchmark for evaluating video embeddings across 23 tasks, finding that no single model dominates and that audio's contribution depends on dataset annotation provenance. It integrates into the MTEB ecosystem for unified multimodal evaluation.

JFinTEB: Japanese Financial Text Embedding Benchmark

arXiv cs.CL

JFinTEB introduces the first comprehensive benchmark for evaluating Japanese financial text embeddings, addressing a gap in domain-specific and language-specific evaluation resources. The benchmark includes retrieval and classification tasks evaluated across Japanese-specific, multilingual, and commercial embedding models, with datasets and evaluation framework publicly released.