TRL-Bench: 表形式エンコーダのクロスパラダイム表現レベル評価の標準化

要旨

表形式エンコーダは通常、タスク固有のエンドツーエンドパイプライン内で評価されるため、異なる学習パラダイムのモデルは、同様の表形式信号を扱う場合でも直接比較が困難である。本稿では、TRL-Benchを導入する。これは、クロスパラダイムの表現レベル評価を標準化する多粒度表形式表現学習（TRL）ベンチマークである。各エンコーダはサポートするラッパーを通じて行、列、またはテーブルの埋め込みを出力し、共有の軽量ヘッドがそれらを3つのスイート（TRL-CTbench（列/テーブル）、TRL-Rbench（行）、TRL-DLTE（3つの粒度すべてにわたる構成可能なデータレイクテーブル拡張））で評価する。この標準化設定をサポートするため、厳選されたベンチマーク資産とタスク再構成を公開する。これには、123の検証済みターゲットを持つ50のOpenMLテーブル、16の行ペアリンケージ書き換え、および1,379の親テーブルから派生した47,772テーブルのDLTEデータレイクが含まれる。20モデルと16タスクにわたる評価により、TRL-Benchは、下流条件が標準化されると、エンコーダ品質は単一のリーダーボードで捉えられるものではなく、能力特異的であることを示す。TRL-CTbenchでは、表面テキスト信号が強いタスクでは汎用テキストエンコーダが優位に立つことが多い一方、表形式専門家はその事前学習目的がタスクと一致する場合に勝利する。TRL-Rbenchでは、テーブル内予測とテーブル間リンケージは異なる学習レジームを好み、原子リンケージ性能はDLTEパイプラインの行マッチング段階と強く相関する。TRL-DLTEでは、最強のパイプラインは単一のエンコーダを再利用するのではなく、能力が一致した専門家を組み合わせており、トップのエンドツーエンド品質は段階ごとの限界順位だけでなく、非加算的な構成適合性に依存する。TRL-Benchは、共有下流条件下でエクスポートされた表形式表現における再利用可能な信号を測定するための共通プロトコルを提供する。コードとデータ: https://github.com/LOGO-CUHKSZ/TRL-Bench

English

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench