TabSTAR: 意味的にターゲットを意識した表現を備えた基盤的テーブルモデル

要旨

深層学習は多くの分野で目覚ましい成功を収めてきたが、表形式データの学習タスクにおいては歴史的にパフォーマンスが低く、勾配ブースティング決定木（GBDT）が依然として主流となっている。しかし、最近の進展により、実世界の知識を活用し、特に自由記述テキストを含むデータセット間で汎化可能な「表形式基盤モデル（Tabular Foundation Models）」の道が開かれつつある。言語モデルの能力を表形式タスクに組み込む試みはこれまでにも行われてきたが、既存の手法の多くは静的でターゲットに依存しないテキスト表現を利用しており、その効果が制限されていた。本論文では、意味的にターゲットを意識した表現を備えた基盤表形式モデル「TabSTAR」を提案する。TabSTARは、テキスト特徴量を含む表形式データに対して転移学習を可能にするよう設計されており、データセット固有のパラメータを必要としないアーキテクチャを採用している。事前学習済みのテキストエンコーダを解凍し、ターゲットトークンを入力として受け取ることで、タスク固有の埋め込みを学習するために必要なコンテキストをモデルに提供する。TabSTARは、テキスト特徴量を伴う分類タスクの既知のベンチマークにおいて、中規模および大規模データセットの両方で最先端のパフォーマンスを達成し、その事前学習フェーズではデータセット数に応じたスケーリング則を示すことで、さらなる性能向上の道筋を提供する。

English

While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.

TabSTAR: 意味的にターゲットを意識した表現を備えた基盤的テーブルモデル

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

要旨

Support