TRL-Bench：標準化跨範式表格編碼器的表徵層級評估

摘要

表格编码器通常通过任务特定的端到端流程进行评估，因此即使在处理相似的表格信号时，不同训练范式的模型也难以直接比较。我们提出TRL-Bench，一个多粒度的表格表示学习（TRL）基准，它标准化了跨范式的表示级评估：每个编码器通过其支持的封装器导出行嵌入、列嵌入或表嵌入，共享的轻量级探测头在三个套件中对这些嵌入进行探查：TRL-CTbench（列/表）、TRL-Rbench（行）以及TRL-DLTE（涵盖所有三种粒度的组合式数据湖表富集）。为了支持这种标准化设置，我们发布了精心整理的基准资产和任务重构，包括50个OpenML数据集（含123个经过验证的目标）、16个行对链接重写，以及从1,379个父表衍生的包含47,772个表的DLTE数据湖。在20个模型和16个任务中，TRL-Bench显示，一旦下游条件标准化，编码器质量便具有能力特异性，而非由单一排行榜捕获。在TRL-CTbench中，通用文本编码器通常在具有强表面文本信号的任务上领先，而表格专用编码器则在预训练目标与任务对齐时胜出。在TRL-Rbench中，表内预测和跨表链接偏好不同的训练机制，其中原子链接性能与DLTE流程中行匹配阶段的表现高度相关。在TRL-DLTE中，最强的流程结合了能力匹配的专用编码器，而非重复使用单一编码器，而最佳端到端质量取决于非加性组合适配性，而非单一阶段的边际排名。TRL-Bench为在共享下游条件下衡量导出表格表示中的可复用信号提供了通用协议。代码与数据：https://github.com/LOGO-CUHKSZ/TRL-Bench

English

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench