TabReD:一個野外表格機器學習基準測試
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
June 27, 2024
作者: Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, Artem Babenko
cs.AI
摘要
為了讓新的表格式機器學習(ML)研究能夠順利應用,具有密切反映下游應用場景的基準測試至關重要。本研究檢視現有的表格式基準測試,發現兩個在學術界可獲得的資料集中不足以代表產業級表格式資料的共同特點。首先,表格式資料在現實世界的部署場景中往往會隨著時間變化。這影響模型性能,需要基於時間的訓練和測試分割以進行正確的模型評估。然而,現有的學術界表格式資料集往往缺乏時間戳記元數據以支持此類評估。其次,在生產環境中,相當一部分資料集來自於大量的資料獲取和特徵工程流程。對於每個特定資料集,這可能對預測特徵、無信息特徵和相關特徵的絕對和相對數量產生不同影響,進而影響模型選擇。為填補學術基準測試中上述缺口,我們引入 TabReD -- 一組包含從金融到食品遞送服務等多個領域的八個產業級表格式資料集。我們在 TabReD 提供的功能豐富、隨時間演變的資料環境中評估大量表格式 ML 模型。我們展示基於時間分割資料進行評估導致不同方法排名,相較於學術基準測試中更常見的隨機分割評估。此外,在 TabReD 資料集中,類似 MLP 的架構和 GBDT 展現最佳結果,而更複雜的 DL 模型尚未證明其有效性。
English
Benchmarks that closely reflect downstream application scenarios are
essential for the streamlined adoption of new research in tabular machine
learning (ML). In this work, we examine existing tabular benchmarks and find
two common characteristics of industry-grade tabular data that are
underrepresented in the datasets available to the academic community. First,
tabular data often changes over time in real-world deployment scenarios. This
impacts model performance and requires time-based train and test splits for
correct model evaluation. Yet, existing academic tabular datasets often lack
timestamp metadata to enable such evaluation. Second, a considerable portion of
datasets in production settings stem from extensive data acquisition and
feature engineering pipelines. For each specific dataset, this can have a
different impact on the absolute and relative number of predictive,
uninformative, and correlated features, which in turn can affect model
selection. To fill the aforementioned gaps in academic benchmarks, we introduce
TabReD -- a collection of eight industry-grade tabular datasets covering a wide
range of domains from finance to food delivery services. We assess a large
number of tabular ML models in the feature-rich, temporally-evolving data
setting facilitated by TabReD. We demonstrate that evaluation on time-based
data splits leads to different methods ranking, compared to evaluation on
random splits more common in academic benchmarks. Furthermore, on the TabReD
datasets, MLP-like architectures and GBDT show the best results, while more
sophisticated DL models are yet to prove their effectiveness.