TRL-Bench: 테이블형 인코더의 교차 패러다임 표현 수준 평가 표준화

초록

테이블러 인코더는 일반적으로 작업별 엔드-투-엔드 파이프라인 내에서 평가되므로, 유사한 테이블 형 신호를 처리하더라도 서로 다른 훈련 패러다임의 모델을 직접 비교하기 어렵다. 본 연구에서는 다중 세분화 테이블러 표현 학습(TRL) 벤치마크인 TRL-Bench를 제안한다. 이는 패러다임 간 표현 수준 평가를 표준화하여, 각 인코더가 지원하는 래퍼를 통해 행, 열 또는 테이블 임베딩을 내보내고, 공유된 경량 헤드가 세 가지 제품군(TRL-CTbench(열/테이블), TRL-Rbench(행), TRL-DLTE(세 가지 세분화 수준을 모두 포괄하는 구성적 데이터 레이크 테이블 강화))에서 이를 평가한다. 이러한 표준화된 환경을 지원하기 위해 50개의 OpenML 테이블(123개의 검증된 타겟 포함), 16개의 행 쌍 연결 재작성, 1,379개 부모 테이블에서 파생된 47,772개 테이블로 구성된 DLTE 레이크 등 엄선된 벤치마크 자산과 작업 재구성을 공개한다. 20개 모델과 16개 작업에 걸친 TRL-Bench의 결과는, 다운스트림 조건이 표준화되면 인코더 품질이 단일 리더보드로 포착되지 않고 능력별로 특화됨을 보여준다. TRL-CTbench에서는 강한 표면 텍스트 신호를 가진 작업에서 일반 텍스트 인코더가 자주 우세한 반면, 테이블러 전문가는 사전 훈련 목표가 작업과 일치하는 경우에 승리한다. TRL-Rbench에서는 테이블 내 예측과 테이블 간 연결이 서로 다른 훈련 체제를 선호하며, 원자적 연결 성능은 DLTE 파이프라인의 행 매칭 단계와 강한 상관관계를 보인다. TRL-DLTE에서는 가장 강력한 파이프라인이 단일 인코더를 재사용하기보다 능력이 일치하는 전문가를 결합하며, 최고 수준의 엔드-투-엔드 품질은 단계별 주변 순위만이 아닌 비가산적 구성 적합도에 의존한다. TRL-Bench는 공유된 다운스트림 조건에서 내보내진 테이블러 표현의 재사용 가능한 신호를 측정하기 위한 공통 프로토콜을 제공한다. 코드 및 데이터: https://github.com/LOGO-CUHKSZ/TRL-Bench

English

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench