TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Hugging Face Daily Papers Papers

Summary

TRL-Bench is a unified framework and library for standardizing the evaluation of tabular representation learning models across 20 encoders, 16 tasks, and 87 datasets. It provides a common interface to compare heterogeneous tabular models and reveals that no single encoder is best for all tasks.

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:39 PM

Paper page - TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Source: https://huggingface.co/papers/2606.09323 πŸ“Š ReleasingTRL-Benchβ€” a unified framework + library for tabular representation learning,one stop for tabular representation learning. 🧩 20 encoders Β· 16 tasks Β· 87 datasets across 3 suites πŸ” Built to make heterogeneous tabular models directly comparable, and reusable as embedding models

pipeline

Tabular encoders come in every shape: different input formats, training objectives, and output heads. So even two models built for the same job are hard to compare head-to-head. We built TRL-Bench to make them comparable.

It unifies everything at the level of the representation: each model is wrapped behind one shared interface that exports row-, column-, and table-embeddings, and shared lightweight heads probe those embeddings under common task definitions, so 20 encoders from every paradigm finally sit on the same axes.

It’s also a library: 20 different types of tabular models are adapted into embedding models that export row, column, and table embeddings for the community to reuse. It spans three suites: 🧩TRL-CTbenchβ€” 13 column/table tasks: schema, joinability, unionability, grounding πŸ”—TRL-Rbenchβ€” multi-target row prediction (50 subtasks, 123 targets) + record linkage (16 datasets) 🌊TRL-DLTEβ€” a 47,772-table data-lake enrichment pipeline spanning all three granularities

The main takeaway is clear: there is no single best tabular encoder, strengths are split across different table jobs. The choice of tabular models should be task-aware.

We also find that:

πŸ“Œ Off-the-shelf text encoders are surprisingly strong when the signal is in the surface text (column names and cell values); cross-table alignment and matching instead reward structure-aware specialists

πŸ“Œ Predicting a value inside a table and matching the same record across tables call for different encoders: one rewards adapting to a single table, the other rewards embeddings that stay comparable across tables

πŸ“Œ Stacking the best per-stage encoders does not give the best compositional pipeline, and neither does reusing one encoder end-to-end; the winning recipe matches a different specialist to each step (find related tables β†’ align columns β†’ match rows)

TRL-Bench is meant to serve both as adiagnostic benchmarkand as apractical libraryfor building on tabular representations.

πŸ“„ Paper:https://arxiv.org/abs/2606.09323 🌐 Website:https://logo-cuhksz.github.io/trl-bench.github.io/ πŸ€— Datasets:https://huggingface.co/datasets/logo-lab/trl-ctbench Β· trl-rbench Β· trl-dlte πŸ’» Code:https://github.com/LOGO-CUHKSZ/TRL-Bench

Similar Articles

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Hugging Face Daily Papers

Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

arXiv cs.CL

RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.