TabSTAR:一种具备语义目标感知表征的基础表格模型
TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
May 23, 2025
作者: Alan Arazi, Eilam Shapira, Roi Reichart
cs.AI
摘要
尽管深度学习在众多领域取得了显著成就,但在表格学习任务上,其表现历来逊色于梯度提升决策树(GBDTs)。然而,近期的进展正为表格基础模型(Tabular Foundation Models)铺平道路,这类模型能够利用现实世界的知识,并在包含自由文本的多样化数据集上实现泛化。虽然将语言模型能力融入表格任务已有探索,但现有方法大多采用静态、目标无关的文本表示,限制了其效能。我们推出了TabSTAR:一种具备语义目标感知表示的表格基础模型。TabSTAR旨在支持带有文本特征的表格数据上的迁移学习,其架构不含数据集特定参数。它解冻了预训练的文本编码器,并以目标标记作为输入,为模型提供学习任务特定嵌入所需的上下文。TabSTAR在包含文本特征的分类任务基准测试中,对中大型数据集均实现了最先进的性能,其预训练阶段展现出数据集数量上的扩展规律,为进一步性能提升指明了路径。
English
While deep learning has achieved remarkable success across many domains, it
has historically underperformed on tabular learning tasks, which remain
dominated by gradient boosting decision trees (GBDTs). However, recent
advancements are paving the way for Tabular Foundation Models, which can
leverage real-world knowledge and generalize across diverse datasets,
particularly when the data contains free-text. Although incorporating language
model capabilities into tabular tasks has been explored, most existing methods
utilize static, target-agnostic textual representations, limiting their
effectiveness. We introduce TabSTAR: a Foundation Tabular Model with
Semantically Target-Aware Representations. TabSTAR is designed to enable
transfer learning on tabular data with textual features, with an architecture
free of dataset-specific parameters. It unfreezes a pretrained text encoder and
takes as input target tokens, which provide the model with the context needed
to learn task-specific embeddings. TabSTAR achieves state-of-the-art
performance for both medium- and large-sized datasets across known benchmarks
of classification tasks with text features, and its pretraining phase exhibits
scaling laws in the number of datasets, offering a pathway for further
performance improvements.Summary
AI-Generated Summary