ChatPaper.aiChatPaper

為數據科學模型生成天際線數據集

Generating Skyline Datasets for Data Science Models

February 16, 2025
作者: Mengying Wang, Hanchao Ma, Yiyang Bian, Yangxin Fan, Yinghui Wu
cs.AI

摘要

為各種數據驅動的人工智慧與機器學習模型準備高品質數據集,已成為數據驅動分析中的基石任務。傳統的數據發現方法通常基於單一預定義的質量指標整合數據集,這可能導致下游任務出現偏差。本文介紹了MODis框架,該框架通過優化多個用戶定義的模型性能指標來發現數據集。給定一組數據源和一個模型,MODis選擇並整合數據源形成一個天際線數據集,在此之上,模型有望在所有性能指標上達到預期表現。我們將MODis建模為一個多目標有限狀態轉換器,並推導出三種可行的算法來生成天際線數據集。我們的第一種算法採用“從全集縮減”策略,從一個通用模式開始,迭代地剪除無望的數據。第二種算法進一步通過雙向策略降低成本,該策略交織進行數據增強與縮減。我們還引入了一種多樣化算法,以減輕天際線數據集中的偏差。我們通過實驗驗證了天際線數據發現算法的效率與有效性,並展示了它們在優化數據科學管道中的應用。
English
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.

Summary

AI-Generated Summary

PDF72February 22, 2025