ChatPaper.aiChatPaper

面向高质量数据共享的分层式数据集遴选机制

Hierarchical Dataset Selection for High-Quality Data Sharing

December 11, 2025
作者: Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou
cs.AI

摘要

现代机器学习的成功依赖于高质量训练数据的获取。在实际应用场景中,例如从公共存储库获取数据或跨机构共享时,数据通常会以离散数据集的形式存在,这些数据集在相关性、质量和效用方面存在显著差异。因此,如何选择需要检索有用数据集的存储库或机构,以及确定哪些数据集应纳入模型训练,成为关键决策。然而现有方法大多仅针对单个样本进行选择,且将所有数据视为同等相关,忽略了数据集及其来源之间的差异性。本研究正式提出数据集选择任务:在资源受限条件下,从大规模异构数据池中选择完整数据集以提升下游性能。我们设计了基于层级结构的数据集选择方法DaSH,该方法可在数据集和群组(如数据集合、机构)层面建模效用函数,从而通过有限观察实现高效泛化。在两个公共基准测试(Digit-Five和DomainNet)中,DaSH的准确率最高超越现有数据选择基线方法26.2%,且所需探索步骤显著减少。消融实验表明,DaSH在低资源环境和相关数据集匮乏的情况下仍保持稳健性,使其适用于实际多源学习工作流中可扩展的自适应数据集选择。
English
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.
PDF01December 18, 2025