ChatPaper.aiChatPaper

数据集蒸馏中的基础性、信息量与实用性提升研究

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

January 29, 2026
作者: Shaobo Wang, Yantai Yang, Guo Chen, Peiru Li, Kaixin Li, Yufa Zhou, Zhaorun Chen, Linfeng Zhang
cs.AI

摘要

数据集精馏(DD)旨在从大规模真实数据集中创建紧凑的数据集。尽管现有方法常依赖启发式策略来平衡效率与质量,但原始数据与合成数据之间的本质关系仍未得到充分探索。本文在坚实理论框架下重新审视基于知识精馏的数据集精馏方法,提出分别刻画样本内关键信息与训练集中核心样本的"信息度"与"效用度"概念。基于这些原理,我们数学化定义了最优数据集精馏,进而提出InfoUtil框架——该框架通过平衡信息度与效用度来合成精馏数据集,包含两大核心组件:(1)基于沙普利值归因的博弈论信息度最大化机制,用于提取样本关键信息;(2)基于梯度范数的全局影响力样本选择原则,实现理论驱动的效用度最大化。这些组件共同确保精馏数据集兼具高信息含量与优化效用。实验表明,在ImageNet-1K数据集上使用ResNet-18架构时,本方法相较此前最优技术实现了6.1%的性能提升。
English
Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.
PDF153February 7, 2026