TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders
Summary
TRL-Bench is a unified framework and library for standardizing the evaluation of tabular representation learning models across 20 encoders, 16 tasks, and 87 datasets. It provides a common interface to compare heterogeneous tabular models and reveals that no single encoder is best for all tasks.
View Cached Full Text
Cached at: 06/11/26, 01:39 PM
Paper page - TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders
Source: https://huggingface.co/papers/2606.09323 π ReleasingTRL-Benchβ a unified framework + library for tabular representation learning,one stop for tabular representation learning. π§© 20 encoders Β· 16 tasks Β· 87 datasets across 3 suites π Built to make heterogeneous tabular models directly comparable, and reusable as embedding models
Tabular encoders come in every shape: different input formats, training objectives, and output heads. So even two models built for the same job are hard to compare head-to-head. We built TRL-Bench to make them comparable.
It unifies everything at the level of the representation: each model is wrapped behind one shared interface that exports row-, column-, and table-embeddings, and shared lightweight heads probe those embeddings under common task definitions, so 20 encoders from every paradigm finally sit on the same axes.
Itβs also a library: 20 different types of tabular models are adapted into embedding models that export row, column, and table embeddings for the community to reuse. It spans three suites: π§©TRL-CTbenchβ 13 column/table tasks: schema, joinability, unionability, grounding πTRL-Rbenchβ multi-target row prediction (50 subtasks, 123 targets) + record linkage (16 datasets) πTRL-DLTEβ a 47,772-table data-lake enrichment pipeline spanning all three granularities
The main takeaway is clear: there is no single best tabular encoder, strengths are split across different table jobs. The choice of tabular models should be task-aware.
We also find that:
π Off-the-shelf text encoders are surprisingly strong when the signal is in the surface text (column names and cell values); cross-table alignment and matching instead reward structure-aware specialists
π Predicting a value inside a table and matching the same record across tables call for different encoders: one rewards adapting to a single table, the other rewards embeddings that stay comparable across tables
π Stacking the best per-stage encoders does not give the best compositional pipeline, and neither does reusing one encoder end-to-end; the winning recipe matches a different specialist to each step (find related tables β align columns β match rows)
TRL-Bench is meant to serve both as adiagnostic benchmarkand as apractical libraryfor building on tabular representations.
π Paper:https://arxiv.org/abs/2606.09323 π Website:https://logo-cuhksz.github.io/trl-bench.github.io/ π€ Datasets:https://huggingface.co/datasets/logo-lab/trl-ctbench Β· trl-rbench Β· trl-dlte π» Code:https://github.com/LOGO-CUHKSZ/TRL-Bench
Similar Articles
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
This paper introduces TabEmbed, a generalist embedding model for tabular data that unifies classification and retrieval tasks, along with TabBench, a new benchmark for evaluating tabular understanding.
Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation
This paper systematically compares fine-tuned encoder classifiers (ModernBERT family) against decoder-based safety judges for LLM adversarial evaluation, finding that encoders can offer a cost- and latency-efficient alternative without significant performance loss.
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.
EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management
EIBench introduces a simulator-based benchmark for interactive emotion management in LLMs, enabling evaluation and training via per-turn user state feedback. The authors propose CTC-GRPO, a reinforcement learning method that improves emotion management performance across multiple benchmarks.
