基于语言建模的表格数据大规模迁移学习

摘要

表格数据——结构化、异构、类似电子表格的数据，具有行和列——在许多领域的实践中被广泛使用。然而，尽管最近的基础模型已经减少了在诸如语言建模和计算机视觉等领域开发特定任务数据集和预测器的需求，但这种迁移学习范式在表格领域并未产生类似的影响。在这项工作中，我们旨在缩小这一差距，并提出了 TabuLa-8B，一个用于表格预测的语言模型。我们定义了一种从 TabLib 语料库中提取大规模、高质量训练数据集的过程，提出了表格数据过滤和质量控制的方法。利用由 3.1M 个唯一表格中的超过 16 十亿行组成的结果数据集，我们对 Llama 3-8B 大型语言模型（LLM）进行微调，用于表格数据预测（分类和分箱回归），并使用了一种新颖的打包和注意力方案进行表格预测。通过在 329 个数据集的测试套件上进行评估，我们发现 TabuLa-8B 在未见过的表格上具有零猜测准确率，比随机猜测高出超过 15 个百分点，这是现有最先进的表格预测模型（例如 XGBoost、TabPFN）所无法实现的。在少样本设置（1-32 样本）中，在未对目标数据集进行任何微调的情况下，TabuLa-8B 比专门针对相同甚至多达 16 倍数据进行训练的 XGBoost 和 TabPFN 模型更准确 5-15 个百分点。我们将模型、代码和数据与本文一同发布。

English

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

基于语言建模的表格数据大规模迁移学习

Large Scale Transfer Learning for Tabular Data via Language Modeling

摘要

Support