ChatPaper.aiChatPaper

扩散模型与基于流的XGBoost模型规模化研究

Scaling Up Diffusion and Flow-based XGBoost Models

August 28, 2024
作者: Jesse C. Cresswell, Taewoo Kim
cs.AI

摘要

针对表格数据生成的新型机器学习方法通常基于小型数据集开发,难以满足科学应用所需的规模要求。我们研究了近期提出的采用XGBoost作为扩散模型与流匹配模型中函数逼近器的方案,发现即使在微型数据集上,该方法也存在极高的内存消耗。本文从工程角度对现有实现方案进行批判性分析,证明这些局限性并非方法本身固有缺陷——通过优化实现方案,可将处理数据集规模扩大至先前研究的370倍。我们的高效实现还支持将模型扩展至更大规模,实验证明这能直接提升基准任务的性能表现。此外,我们提出可进一步优化资源利用和模型性能的算法改进,包括特别适用于生成式建模的多输出树结构。最后,我们在源自实验粒子物理学的Fast Calorimeter模拟挑战赛大型科学数据集上呈现了实验结果。代码已发布于https://github.com/layer6ai-labs/calo-forest。
English
Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at https://github.com/layer6ai-labs/calo-forest.
PDF102November 14, 2024