IIDを超えて：表形式基盤モデルは本当にどの程度汎用的なのか？

要旨

表形式データに対する予測機械学習のための基盤モデルは、近年、学界と産業界で大きな注目を集めている。様々な分野の研究コミュニティは、多様なデータセットやタスクで表形式基盤モデルの評価を進めている。しかし、これらのタスクや分野に特化した評価は、ベンチマークソフトウェアや評価プロトコルが断片化しているため、モデル研究者にとってはほとんど利用できないままである。その結果、モデル研究者は標準ベンチマークに依存することになるが、標準ベンチマークは主に表形式基盤モデルがすでに優れているタスクに対して定義されている。最も困難なシナリオは除外されており、IIDデータにおける僅かな改善に焦点が当てられることで、より広範で要求の厳しい課題ではなく、分野における有意義な進歩が制限されている。この問題を克服するために、我々はBeyondArenaを導入する。これは、多様なタスクタイプ（IID、時系列、グループ化）をサポートし、サンプルサイズと特徴次元の規模にわたり、広範な分野からの多様な特徴タイプ（テキスト付き、高カーディナリティ）を持つ、表形式データのための初の統合的包括的ベンチマークである。標準ベンチマークを超えた統一ベンチマークを可能にするため、予測機械学習用の表形式データセットをキュレーションするためのPythonフレームワークおよびメタデータスキーマであるData Foundryを導入する。11モデルと142のキュレーションデータセットにわたる我々の結果は、既存の表形式基盤モデルが小規模から中規模のIIDデータに優れる一方、伝統的な木ベースモデルや深層学習モデルが非IID、大規模、高次元データセットにおいて依然として支配的であることを示している。BeyondArenaは、表形式データにおける最も要求の厳しい課題に対してモデル研究を導き、真に基礎的な表形式モデルへの進歩を可能にする。

English

Foundation models for predictive machine learning on tabular data have recently gained significant traction in academia and industry. Research communities across disciplines are increasingly evaluating tabular foundation models on diverse datasets and tasks. However, these task- and discipline-specific evaluations remain largely inaccessible to model researchers because benchmark software and evaluation protocols are fragmented. As a result, model researchers rely on standard benchmarks, which are mostly defined for tasks where tabular foundation models already excel. The most challenging scenarios are excluded, limiting meaningful progress in the field by focusing on marginal improvements on IID data rather than on broader, more demanding challenges. To overcome this, we introduce BeyondArena, the first unified holistic benchmark for tabular data that supports diverse task types (IID, temporal, grouped), across sample size and feature dimensionality scales, with diverse feature types (with text, with high cardinality) from a broad range of disciplines. To enable unified benchmarking beyond standard benchmarks, we introduce Data Foundry, a Python framework and metadata schema for curating tabular datasets for predictive machine learning. Our results across 11 models and 142 curated datasets show that existing tabular foundation models excel on tiny- to medium-sized IID data, while traditional tree-based and deep learning models still dominate on non-IID, large, and high-dimensional datasets. BeyondArena guides model research for the most demanding challenges in tabular data, enabling progress towards truly foundational tabular models.