TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Hugging Face Daily Papers 06/08/26, 12:00 AM Papers

Summary

TRL-Bench is a unified framework and library for standardizing the evaluation of tabular representation learning models across 20 encoders, 16 tasks, and 87 datasets. It provides a common interface to compare heterogeneous tabular models and reveals that no single encoder is best for all tasks.

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench

Original Article

View Cached Full Text

Cached at: 06/11/26, 01:39 PM

Paper page - TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Source: https://huggingface.co/papers/2606.09323 📊 ReleasingTRL-Bench— a unified framework + library for tabular representation learning,one stop for tabular representation learning. 🧩 20 encoders · 16 tasks · 87 datasets across 3 suites 🔍 Built to make heterogeneous tabular models directly comparable, and reusable as embedding models

Tabular encoders come in every shape: different input formats, training objectives, and output heads. So even two models built for the same job are hard to compare head-to-head. We built TRL-Bench to make them comparable.

It unifies everything at the level of the representation: each model is wrapped behind one shared interface that exports row-, column-, and table-embeddings, and shared lightweight heads probe those embeddings under common task definitions, so 20 encoders from every paradigm finally sit on the same axes.

It’s also a library: 20 different types of tabular models are adapted into embedding models that export row, column, and table embeddings for the community to reuse. It spans three suites: 🧩TRL-CTbench— 13 column/table tasks: schema, joinability, unionability, grounding 🔗TRL-Rbench— multi-target row prediction (50 subtasks, 123 targets) + record linkage (16 datasets) 🌊TRL-DLTE— a 47,772-table data-lake enrichment pipeline spanning all three granularities

The main takeaway is clear: there is no single best tabular encoder, strengths are split across different table jobs. The choice of tabular models should be task-aware.

We also find that:

📌 Off-the-shelf text encoders are surprisingly strong when the signal is in the surface text (column names and cell values); cross-table alignment and matching instead reward structure-aware specialists

📌 Predicting a value inside a table and matching the same record across tables call for different encoders: one rewards adapting to a single table, the other rewards embeddings that stay comparable across tables

📌 Stacking the best per-stage encoders does not give the best compositional pipeline, and neither does reusing one encoder end-to-end; the winning recipe matches a different specialist to each step (find related tables → align columns → match rows)

TRL-Bench is meant to serve both as adiagnostic benchmarkand as apractical libraryfor building on tabular representations.

📄 Paper:https://arxiv.org/abs/2606.09323 🌐 Website:https://logo-cuhksz.github.io/trl-bench.github.io/ 🤗 Datasets:https://huggingface.co/datasets/logo-lab/trl-ctbench · trl-rbench · trl-dlte 💻 Code:https://github.com/LOGO-CUHKSZ/TRL-Bench

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Paper page - TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Similar Articles

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

Submit Feedback

Similar Articles

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management