通過語言建模進行表格數據的大規模遷移學習
Large Scale Transfer Learning for Tabular Data via Language Modeling
June 17, 2024
作者: Josh Gardner, Juan C. Perdomo, Ludwig Schmidt
cs.AI
摘要
表格數據是在許多領域中廣泛使用的結構化、異質、類似電子表格的數據,然而,儘管最近的基礎模型已經減少了在語言建模和計算機視覺等領域開發特定任務數據集和預測器的需求,但這種轉移學習範式在表格領域並沒有產生類似的影響。在這項工作中,我們旨在縮小這一差距,並提出了 TabuLa-8B,一個用於表格預測的語言模型。我們定義了從 TabLib 語料庫中提取大型高質量訓練數據集的過程,提出了表格數據篩選和質量控制的方法。使用結果數據集,其中包含來自 3.1M 個唯一表格的超過 16 十億行,我們對 Llama 3-8B 大型語言模型(LLM)進行微調,用於表格數據預測(分類和分箱回歸),並使用一種新的打包和注意機制進行表格預測。通過對 329 個數據集的測試套件進行評估,我們發現 TabuLa-8B 在看不見的表格上的零槍擊準確率比隨機猜測高出 15 個百分點以上,這是現有最先進的表格預測模型(例如 XGBoost、TabPFN)無法實現的成就。在少槍擊設置(1-32 槍擊)中,在目標數據集上沒有進行任何微調的情況下,TabuLa-8B 的準確率比明確訓練在相同數據上甚至多達 16 倍的 XGBoost 和 TabPFN 模型高出 5-15 個百分點。我們將在本文發表時公開我們的模型、代碼和數據。
English
Tabular data -- structured, heterogeneous, spreadsheet-style data with rows
and columns -- is widely used in practice across many domains. However, while
recent foundation models have reduced the need for developing task-specific
datasets and predictors in domains such as language modeling and computer
vision, this transfer learning paradigm has not had similar impact in the
tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B,
a language model for tabular prediction. We define a process for extracting a
large, high-quality training dataset from the TabLib corpus, proposing methods
for tabular data filtering and quality control. Using the resulting dataset,
which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama
3-8B large language model (LLM) for tabular data prediction (classification and
binned regression) using a novel packing and attention scheme for tabular
prediction. Through evaluation across a test suite of 329 datasets, we find
that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15
percentage points (pp) higher than random guessing, a feat that is not possible
with existing state-of-the-art tabular prediction models (e.g. XGBoost,
TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the
target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN
models that are explicitly trained on equal, or even up to 16x more data. We
release our model, code, and data along with the publication of this paper.Summary
AI-Generated Summary