ChatPaper.aiChatPaper

擺脫繁重標籤!使用標籤空間簡化進行數據集提煉

Heavy Labels Out! Dataset Distillation with Label Space Lightening

August 15, 2024
作者: Ruonan Yu, Songhua Liu, Zigeng Chen, Jingwen Ye, Xinchao Wang
cs.AI

摘要

資料集精煉或凝縮的目標是將大規模訓練資料集縮小為一個更小的合成資料集,使得經過精煉和原始資料集在神經網絡上的訓練表現相似。儘管可以大幅減少訓練樣本數量,但目前最先進的方法嚴重依賴龐大的軟標籤來達到令人滿意的表現。因此,所需的存儲空間可能與原始資料集相當,尤其是對於大規模資料集而言。為了解決這個問題,我們提出了一個新穎的輕量標籤框架,稱為HeLlO,旨在實現有效的圖像到標籤投影器,從而可以直接從合成圖像中線上生成合成標籤,而非存儲這些繁重的標籤。具體來說,為了構建這樣的投影器,我們利用開源基礎模型(例如CLIP)中的先前知識,並引入類似LoRA的微調策略來減輕預訓練和目標分佈之間的差距,使得用於軟標籤生成的原始模型可以被精煉為一組低秩矩陣。此外,我們提出了一種有效的圖像優化方法,進一步減輕原始和精煉標籤生成器之間的潛在誤差。大量實驗表明,僅需原始存儲空間的約0.003%,我們就能在大規模資料集上實現與當前最先進的資料集精煉方法相當的性能。我們的程式碼將會提供。
English
Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets. Our code will be available.

Summary

AI-Generated Summary

PDF192November 26, 2024