ChatPaper.aiChatPaper

数据梦想:少样本引导数据集生成

DataDream: Few-shot Guided Dataset Generation

July 15, 2024
作者: Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, Zeynep Akata
cs.AI

摘要

尽管文本到图像扩散模型已被证明在图像合成方面取得了最先进的结果,但它们尚未证明在下游应用中的有效性。先前的研究提出在有限真实数据访问的情况下生成图像分类器训练数据。然而,这些方法在生成符合分布的图像或描绘细粒度特征方面存在困难,从而阻碍了在合成数据集上训练的分类模型的泛化。我们提出了DataDream,这是一个框架,用于合成更忠实地代表真实数据分布的分类数据,当受到目标类别的少样本示例引导时。DataDream 在生成训练数据之前,通过少量真实图像对图像生成模型的 LoRA 权重进行微调,然后使用适应后的模型生成训练数据。然后,我们通过使用合成数据对 CLIP 进行 LoRA 权重微调,以改善在大量数据集上的下游图像分类,超越先前方法的分类准确性。我们通过大量实验展示了 DataDream 的有效性,在 10 个数据集中的 7 个数据集中,使用少样本数据的分类准确性超过了最先进水平,而在其他 3 个数据集上具有竞争力。此外,我们提供了有关各种因素的影响见解,例如真实样本和生成图像的数量以及对模型性能的微调计算。代码可在 https://github.com/ExplainableML/DataDream 找到。
English
While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 7 out of 10 datasets, while being competitive on the other 3. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance. The code is available at https://github.com/ExplainableML/DataDream.

Summary

AI-Generated Summary

PDF102November 28, 2024