表格基础模型的事前对齐数据清洗

摘要

表格基础模型（TFMs）通过对合成数据生成过程进行元学习，在小规模表格数据集上实现了最先进的零样本准确率——这对于无法承担大规模标注数据成本的实践者极具吸引力。然而，其上下文学习机制假设输入数据大致洁净：真实数据中的缺失值、异常值和重复记录会造成先验失配，同时降低模型准确率与置信度校准效果。纠正这种失配需要对数据清洗操作符进行序列决策，其交互效应无法通过静态预处理规则预判——这恰好与强化学习（RL）天然契合。我们提出L2C2，首个将表格数据清洗框架化为先验对齐的深度强化学习框架：通过学习策略序列化操作符，最小化脏输入与TFM合成先验之间的分布差距。在十个OpenML基准数据集上的六组实验表明：1）七种奖励设计中有三种会退化为无效清洗策略——科学的奖励工程设计具有重要理论意义；2）我们提出的新型TFMAwareReward奖励在4/10数据集上选择了结构迥异的处理流程，并在这些分歧案例中获得了更高的TabPFN准确率（均值0.851 vs. 0.843；威尔科克森检验p=0.063, n=4），且从未出现性能下降；3）参数化清洗操作在9/10数据集上提升了最优流程奖励（威尔科克森检验p=0.004）；4）在单一源数据集上预训练的策略，在三个保留数据集上经过2,000步微调后均超越从头训练（完整微调后最高提升28.8%），证明了先验对齐知识具有跨数据集迁移能力。这些发现确立了先验对齐是TFM在真实世界表格数据上部署的原则性数据准备策略。

English

Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes -- making them highly attractive for practitioners who cannot afford large annotated corpora. However, their in-context learning mechanism assumes approximately clean inputs: missing values, outliers, and duplicates in the real-world data create a prior mismatch that degrades both accuracy and confidence calibration simultaneously. Correcting this mismatch requires sequential decisions over cleaning operators whose interactions no static preprocessing rule can anticipate -a natural fit for reinforcement learning~(RL). We introduce L2C2, the first deep RL framework framing tabular data cleaning as prior alignment: a learned policy sequences operators to minimize the distributional gap between dirty input and the TFM's synthetic prior. Six experiments on ten OpenML benchmark datasets establish: 1) three of seven reward designs collapse to degenerate trivial cleaning strategies -- principled reward engineering is scientifically non-trivial; 2) the novel TFMAwareReward reward we propose selects structurally distinct pipelines on 4/10 datasets and achieves higher TabPFN accuracy on those diverging cases (mean 0.851 vs. 0.843; Wilcoxon p=0.063, n=4) while never underperforming; 3) parameterized cleaning actions improve best-found pipeline reward on 9/10 datasets (Wilcoxon p=0.004); and 4) a policy pre-trained on one single source dataset exceeds scratch training at the 2,000-step fine-tuning checkpoint on all three held-out datasets (up to +28.8% after full fine-tuning) demonstrating cross-dataset transfer of prior-alignment knowledge. These findings establish that prior alignment is a principled data preparation strategy for TFM deployment on real-world tabular data.

表格基础模型的事前对齐数据清洗

Prior-Aligned Data Cleaning for Tabular Foundation Models

摘要

Support