拡散モデルとフローベースXGBoostモデルのスケーリングアップ

要旨

表形式データ生成のための新しい機械学習手法は、しばしば科学アプリケーションに必要な規模に満たない小さなデータセットで開発されます。本研究では、XGBoostを拡散モデルやフローマッチングモデルの関数近似器として表形式データに適用する最近の提案を調査しましたが、これは非常にメモリ集約的であり、小さなデータセットでも顕著でした。本論文では、既存の実装を工学的観点から批判的に分析し、これらの制限が手法そのものに起因するものではないことを示します。より優れた実装により、従来使用されていたデータセットの370倍の規模までスケール可能であることを実証しました。我々の効率的な実装は、モデルをさらに大規模にスケーリングすることを可能にし、これがベンチマークタスクでの性能向上に直接つながることを示します。また、生成モデリングに適したマルチアウトプットツリーを含む、リソース使用量とモデル性能をさらに向上させるアルゴリズム的改良を提案します。最後に、高速カロリメータシミュレーションチャレンジの一環として、実験的粒子物理学から派生した大規模科学データセットでの結果を提示します。コードはhttps://github.com/layer6ai-labs/calo-forestで公開されています。

English

Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at https://github.com/layer6ai-labs/calo-forest.

拡散モデルとフローベースXGBoostモデルのスケーリングアップ

Scaling Up Diffusion and Flow-based XGBoost Models

要旨

Support