大規模な表形式データのための言語モデリングを活用した転移学習

要旨

表形式データ - 行と列で構成された構造化された異種のスプレッドシート形式のデータ - は、多くの分野で広く使用されています。しかし、最近の基盤モデルが言語モデリングやコンピュータビジョンなどの分野でタスク固有のデータセットや予測器の開発の必要性を減らしている一方で、この転移学習のパラダイムは表形式データの分野では同様の影響を及ぼしていません。本研究では、このギャップを埋めることを目指し、表形式予測のための言語モデルであるTabuLa-8Bを提案します。TabLibコーパスから大規模で高品質なトレーニングデータセットを抽出するプロセスを定義し、表形式データのフィルタリングと品質管理の方法を提案します。結果として得られたデータセット（3.1Mのユニークなテーブルから1.6B行以上を含む）を使用して、Llama 3-8B大規模言語モデル（LLM）を表形式データ予測（分類およびビン化回帰）のために微調整し、表形式予測のための新しいパッキングとアテンションスキームを採用します。329のデータセットからなるテストスイートを通じて評価を行った結果、TabuLa-8Bは未見のテーブルに対するゼロショット精度がランダム推測よりも15パーセントポイント（pp）以上高く、これは既存の最先端の表形式予測モデル（例：XGBoost、TabPFN）では達成不可能な成果です。少数ショット設定（1-32ショット）では、ターゲットデータセットでの微調整なしに、TabuLa-8Bは同等または最大16倍のデータで明示的にトレーニングされたXGBoostおよびTabPFNモデルよりも5-15 pp高い精度を示します。本論文の公開に合わせて、モデル、コード、およびデータを公開します。

English

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

大規模な表形式データのための言語モデリングを活用した転移学習

Large Scale Transfer Learning for Tabular Data via Language Modeling

要旨

Support