分层式数据集筛选:促进高质量数据共享
Hierarchical Dataset Selection for High-Quality Data Sharing
December 11, 2025
作者: Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou
cs.AI
摘要
现代机器学习的成功依赖于高质量训练数据的获取。在实际应用中,如从公共存储库获取数据或跨机构共享时,数据往往以离散数据集形式存在,其相关性、质量和效用各不相同。因此,如何选择需要搜索有用数据集的存储库或机构,以及确定哪些数据集应纳入模型训练,成为关键决策。然而现有方法大多仅选择单个样本,且将所有数据视为同等相关,忽略了数据集及其来源之间的差异。本研究将数据集选择任务形式化:在资源约束条件下,从大规模异构数据池中选择完整数据集以提升下游性能。我们提出基于层次结构的数据集选择方法DaSH,该方法在数据集和群组(如数据集合、机构)层面分别建模效用,实现有限观测下的高效泛化。在两个公共基准测试(Digit-Five和DomainNet)中,DaSH的准确率较现有最优数据选择基线方法提升最高达26.2%,且所需探索步骤显著减少。消融实验表明DaSH对低资源环境和相关数据集缺失具有强鲁棒性,使其适用于实际多源学习工作流中的可扩展自适应数据集选择。
English
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.