TabSTAR：具備語義目標感知表徵的基礎表格模型

摘要

儘管深度學習在多個領域取得了顯著成功，但在表格學習任務上，其表現歷來不及梯度提升決策樹（GBDTs），後者仍佔據主導地位。然而，近期的進展正為表格基礎模型（Tabular Foundation Models）鋪平道路，這些模型能夠利用現實世界的知識並在多樣化數據集上實現泛化，尤其是在數據包含自由文本的情況下。雖然將語言模型能力融入表格任務已有探索，但現有方法大多採用靜態、目標無關的文本表示，限制了其效能。我們介紹了TabSTAR：一種具備語義目標感知表示的基礎表格模型。TabSTAR旨在實現帶有文本特徵的表格數據的遷移學習，其架構不含數據集特定參數。它解凍了預訓練的文本編碼器，並以目標標記作為輸入，這些標記為模型提供了學習任務特定嵌入所需的上下文。TabSTAR在已知的帶有文本特徵的分類任務基準測試中，對中型和大型數據集均達到了最先進的性能，其預訓練階段展現了數據集數量上的規模化定律，為進一步提升性能提供了途徑。

English

While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.

TabSTAR：具備語義目標感知表徵的基礎表格模型

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

摘要

Support