表データ基盤モデルのための事前整合データクリーニング

要旨

表データ基盤モデル（TFM）は、合成データ生成プロセスに対するメタ学習を通じて、小規模な表データセットにおいて最先端のゼロショット精度を達成する。これにより、大規模な注釈付きコーパスを用意できない実践家にとって非常に魅力的な手法となっている。しかし、その文脈内学習メカニズムは、入力がほぼクリーンであることを前提としており、実世界のデータに存在する欠損値、外れ値、重複は事前分布のミスマッチを生じさせ、精度と信頼度較正の両方を同時に劣化させる。このミスマッチを補正するには、静的な前処理ルールでは予測不能な演算子間の相互作用を考慮した、クリーニング演算子に対する逐次的な決定が必要となる。これは強化学習（RL）の自然な適用領域である。本研究では、表データのクリーニングを事前分布の調整として定式化する初の深層強化学習フレームワークL2C2を提案する。学習されたポリシーは、汚れた入力とTFMの合成事前分布との間の分布的ギャップを最小化するように演算子を順次適用する。10のOpenMLベンチマークデータセットを用いた6つの実験により以下を明らかにした：1）7つの報酬設計のうち3つは退化した自明なクリーニング戦略に収束し、原則に基づいた報酬設計が科学的に非自明であること、2）提案する新規報酬関数TFMAwareRewardは、10データセット中4つで構造的に異なるパイプラインを選択し、それら分歧ケースにおいてより高いTabPFN精度を達成（平均0.851対0.843；ウィルコクソン検定p=0.063, n=4）、かつ決して低性能化しないこと、3）パラメータ化されたクリーニング操作は、10データセット中9つで最良パイプライン報酬を改善（ウィルコクソン検定p=0.004）すること、4）単一のソースデータセットで事前学習されたポリシーは、3つの保留データセット全てにおいて、2,000ステップのファインチューニング時点でスクラッチ学習を上回り（フルファインチューニング後最大+28.8%）、事前分布調整知識のデータセット間転移を実証した。これらの知見は、実世界の表データへのTFM展開において、事前分布調整が原則に基づいたデータ準備戦略であることを立証する。

English

Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes -- making them highly attractive for practitioners who cannot afford large annotated corpora. However, their in-context learning mechanism assumes approximately clean inputs: missing values, outliers, and duplicates in the real-world data create a prior mismatch that degrades both accuracy and confidence calibration simultaneously. Correcting this mismatch requires sequential decisions over cleaning operators whose interactions no static preprocessing rule can anticipate -a natural fit for reinforcement learning~(RL). We introduce L2C2, the first deep RL framework framing tabular data cleaning as prior alignment: a learned policy sequences operators to minimize the distributional gap between dirty input and the TFM's synthetic prior. Six experiments on ten OpenML benchmark datasets establish: 1) three of seven reward designs collapse to degenerate trivial cleaning strategies -- principled reward engineering is scientifically non-trivial; 2) the novel TFMAwareReward reward we propose selects structurally distinct pipelines on 4/10 datasets and achieves higher TabPFN accuracy on those diverging cases (mean 0.851 vs. 0.843; Wilcoxon p=0.063, n=4) while never underperforming; 3) parameterized cleaning actions improve best-found pipeline reward on 9/10 datasets (Wilcoxon p=0.004); and 4) a policy pre-trained on one single source dataset exceeds scratch training at the 2,000-step fine-tuning checkpoint on all three held-out datasets (up to +28.8% after full fine-tuning) demonstrating cross-dataset transfer of prior-alignment knowledge. These findings establish that prior alignment is a principled data preparation strategy for TFM deployment on real-world tabular data.

表データ基盤モデルのための事前整合データクリーニング

Prior-Aligned Data Cleaning for Tabular Foundation Models

要旨

Support