資料夢想:少樣本引導式資料集生成
DataDream: Few-shot Guided Dataset Generation
July 15, 2024
作者: Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, Zeynep Akata
cs.AI
摘要
儘管文本到圖像擴散模型已證明在圖像合成方面取得了最先進的成果,但它們尚未證明在下游應用中的有效性。先前的研究提出在僅有有限真實數據訪問權限的情況下生成圖像分類器訓練數據。然而,這些方法在生成符合分布的圖像或描繪細粒度特徵方面遇到困難,從而阻礙了在合成數據集上訓練的分類模型的泛化。我們提出了DataDream,一個框架用於合成更忠實地代表真實數據分布的分類數據集,當受到目標類別的少樣本示例引導時。DataDream在生成訓練數據之前,通過少量真實圖像對圖像生成模型的LoRA權重進行微調,然後使用適應後的模型生成訓練數據。然後,我們通過使用合成數據對CLIP進行LoRA權重的微調,以改善在眾多數據集上比以往方法更具下游圖像分類的性能。我們通過廣泛的實驗證明了DataDream的有效性,在10個數據集中有7個數據集中使用少樣本數據超越了最先進的分類準確性,而在其他3個數據集上則具有競爭力。此外,我們提供了有關各種因素的影響洞察,例如真實樣本和生成圖像的數量以及對模型性能的微調計算。代碼可在https://github.com/ExplainableML/DataDream找到。
English
While text-to-image diffusion models have been shown to achieve
state-of-the-art results in image synthesis, they have yet to prove their
effectiveness in downstream applications. Previous work has proposed to
generate data for image classifier training given limited real data access.
However, these methods struggle to generate in-distribution images or depict
fine-grained features, thereby hindering the generalization of classification
models trained on synthetic datasets. We propose DataDream, a framework for
synthesizing classification datasets that more faithfully represents the real
data distribution when guided by few-shot examples of the target classes.
DataDream fine-tunes LoRA weights for the image generation model on the few
real images before generating the training data using the adapted model. We
then fine-tune LoRA weights for CLIP using the synthetic data to improve
downstream image classification over previous approaches on a large variety of
datasets. We demonstrate the efficacy of DataDream through extensive
experiments, surpassing state-of-the-art classification accuracy with few-shot
data across 7 out of 10 datasets, while being competitive on the other 3.
Additionally, we provide insights into the impact of various factors, such as
the number of real-shot and generated images as well as the fine-tuning compute
on model performance. The code is available at
https://github.com/ExplainableML/DataDream.Summary
AI-Generated Summary