摆脱繁重标签!使用标签空间精简进行数据集精炼
Heavy Labels Out! Dataset Distillation with Label Space Lightening
August 15, 2024
作者: Ruonan Yu, Songhua Liu, Zigeng Chen, Jingwen Ye, Xinchao Wang
cs.AI
摘要
数据集精炼或压缩的目标是将大规模训练数据集压缩成一个更小的合成数据集,使得神经网络在精炼和原始数据集上的训练性能相似。尽管训练样本数量可以大幅减少,但当前最先进的方法严重依赖于大量的软标签来实现令人满意的性能。因此,所需存储空间甚至可以与原始数据集相媲美,尤其是对于大规模数据集而言。为了解决这一问题,我们提出了一种新颖的轻标签框架,称为HeLlO,旨在实现有效的图像到标签投影器,通过这些投影器可以直接从合成图像在线生成合成标签。具体而言,为构建这样的投影器,我们利用开源基础模型中的先验知识,例如CLIP,并引入类似于LoRA的微调策略来缓解预训练模型和目标分布之间的差距,从而将用于生成软标签的原始模型精炼为一组低秩矩阵。此外,我们提出了一种有效的图像优化方法,进一步减少原始和精炼标签生成器之间的潜在误差。大量实验证明,仅需原始软标签完整集所需存储空间的约0.003%,我们就能在大规模数据集上实现与当前最先进的数据集精炼方法相媲美的性能。我们的代码将会提供。
English
Dataset distillation or condensation aims to condense a large-scale training
dataset into a much smaller synthetic one such that the training performance of
distilled and original sets on neural networks are similar. Although the number
of training samples can be reduced substantially, current state-of-the-art
methods heavily rely on enormous soft labels to achieve satisfactory
performance. As a result, the required storage can be comparable even to
original datasets, especially for large-scale ones. To solve this problem,
instead of storing these heavy labels, we propose a novel label-lightening
framework termed HeLlO aiming at effective image-to-label projectors, with
which synthetic labels can be directly generated online from synthetic images.
Specifically, to construct such projectors, we leverage prior knowledge in
open-source foundation models, e.g., CLIP, and introduce a LoRA-like
fine-tuning strategy to mitigate the gap between pre-trained and target
distributions, so that original models for soft-label generation can be
distilled into a group of low-rank matrices. Moreover, an effective image
optimization method is proposed to further mitigate the potential error between
the original and distilled label generators. Extensive experiments demonstrate
that with only about 0.003% of the original storage required for a complete set
of soft labels, we achieve comparable performance to current state-of-the-art
dataset distillation methods on large-scale datasets. Our code will be
available.Summary
AI-Generated Summary