TabReD:一个野外表格机器学习基准。
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
June 27, 2024
作者: Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, Artem Babenko
cs.AI
摘要
为了促进表格机器学习(ML)中新研究的顺利应用,需要有能够紧密反映下游应用场景的基准测试。在这项工作中,我们研究了现有的表格基准测试,并发现行业级表格数据中的两个常见特征在学术界可用的数据集中被低估。首先,表格数据在实际部署场景中经常随时间变化。这会影响模型性能,并需要基于时间的训练和测试拆分以进行正确的模型评估。然而,现有的学术表格数据集通常缺乏时间戳元数据以支持这种评估。其次,在生产环境中,相当大一部分数据集源自大量数据获取和特征工程流程。对于每个特定数据集,这可能会对预测特征、无信息特征和相关特征的绝对和相对数量产生不同影响,进而影响模型选择。为填补学术基准测试中上述空白,我们引入了TabReD —— 一个涵盖从金融到食品配送服务等各种领域的八个行业级表格数据集的集合。我们在由TabReD提供的功能丰富、随时间演变的数据设置中评估了大量表格ML模型。我们展示了基于基于时间的数据拆分进行评估会导致不同的方法排名,与学术基准测试中更常见的随机拆分进行评估相比。此外,在TabReD数据集上,类似MLP的架构和GBDT表现最佳,而更复杂的DL模型尚未证明其有效性。
English
Benchmarks that closely reflect downstream application scenarios are
essential for the streamlined adoption of new research in tabular machine
learning (ML). In this work, we examine existing tabular benchmarks and find
two common characteristics of industry-grade tabular data that are
underrepresented in the datasets available to the academic community. First,
tabular data often changes over time in real-world deployment scenarios. This
impacts model performance and requires time-based train and test splits for
correct model evaluation. Yet, existing academic tabular datasets often lack
timestamp metadata to enable such evaluation. Second, a considerable portion of
datasets in production settings stem from extensive data acquisition and
feature engineering pipelines. For each specific dataset, this can have a
different impact on the absolute and relative number of predictive,
uninformative, and correlated features, which in turn can affect model
selection. To fill the aforementioned gaps in academic benchmarks, we introduce
TabReD -- a collection of eight industry-grade tabular datasets covering a wide
range of domains from finance to food delivery services. We assess a large
number of tabular ML models in the feature-rich, temporally-evolving data
setting facilitated by TabReD. We demonstrate that evaluation on time-based
data splits leads to different methods ranking, compared to evaluation on
random splits more common in academic benchmarks. Furthermore, on the TabReD
datasets, MLP-like architectures and GBDT show the best results, while more
sophisticated DL models are yet to prove their effectiveness.Summary
AI-Generated Summary