数据集蒸馏中的基础性、信息量与实用性提升研究

摘要

数据集蒸馏(DD)旨在从大规模真实数据集中提炼出紧凑的数据子集。尽管现有方法常依赖启发式策略来平衡效率与质量，但原始数据与合成数据之间的本质关系仍待深入探索。本文在坚实理论框架下重新审视基于知识蒸馏的数据集蒸馏方法。我们提出信息度与效用度的概念，分别刻画样本内部的关键信息与训练集中的核心样本。基于这些原理，我们以数学方式定义了最优数据集蒸馏。随后提出InfoUtil框架，在合成蒸馏数据集时平衡信息度与效用度。该框架包含两个核心组件：(1)基于沙普利值归因的博弈论信息度最大化方法，用于从样本中提取关键信息；(2)基于梯度范数的全局重要性样本选择机制，实现原则性效用度最大化。这些组件共同确保蒸馏数据集兼具高信息含量与优化效用。实验表明，在ImageNet-1K数据集上使用ResNet-18架构时，本方法相较此前最优方案的性能提升达6.1%。

English

Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.

数据集蒸馏中的基础性、信息量与实用性提升研究

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

摘要

Support