Toffee:面向主题驱动的文本到图像生成的高效百万级数据集构建
Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
June 13, 2024
作者: Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng, Nanxuan Zhao, Jiuxiang Gu, Zichao Wang, Xin Eric Wang, Tong Sun
cs.AI
摘要
在以主题驱动的文本到图像生成中,最近的研究通过在合成数据集上训练模型取得了卓越的性能,这些数据集包含大量图像对。在这些数据集上训练后,生成模型可以以零样本的方式为特定主题从任意测试图像生成与文本对齐的图像。它们甚至胜过需要在测试图像上进行额外微调的方法。然而,创建这类数据集的成本对大多数研究人员来说是不可承受的。为了生成单个训练对,当前方法会在主题图像上对预训练的文本到图像模型进行微调,以捕捉细粒度细节,然后使用微调后的模型基于创意文本提示为同一主题创建图像。因此,构建一个包含数百万主题的大规模数据集可能需要数十万个GPU小时。为了解决这个问题,我们提出了一种高效的方法 Toffee,用于构建以主题驱动的编辑和生成的数据集。具体来说,我们的数据集构建不需要任何主题级微调。在预训练了两个生成模型之后,我们能够生成无限数量的高质量样本。我们构建了第一个以主题驱动的图像编辑和生成的大规模数据集,包含 500 万个图像对、文本提示和蒙版。我们的数据集是先前最大数据集的 5 倍大小,但成本降低了数万个 GPU 小时。为了测试提出的数据集,我们还提出了一个模型,能够进行主题驱动的图像编辑和生成。通过简单地在我们提出的数据集上训练模型,它获得了有竞争力的结果,展示了提出的数据集构建框架的有效性。
English
In subject-driven text-to-image generation, recent works have achieved
superior performance by training the model on synthetic datasets containing
numerous image pairs. Trained on these datasets, generative models can produce
text-aligned images for specific subject from arbitrary testing image in a
zero-shot manner. They even outperform methods which require additional
fine-tuning on testing images. However, the cost of creating such datasets is
prohibitive for most researchers. To generate a single training pair, current
methods fine-tune a pre-trained text-to-image model on the subject image to
capture fine-grained details, then use the fine-tuned model to create images
for the same subject based on creative text prompts. Consequently, constructing
a large-scale dataset with millions of subjects can require hundreds of
thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient
method to construct datasets for subject-driven editing and generation.
Specifically, our dataset construction does not need any subject-level
fine-tuning. After pre-training two generative models, we are able to generate
infinite number of high-quality samples. We construct the first large-scale
dataset for subject-driven image editing and generation, which contains 5
million image pairs, text prompts, and masks. Our dataset is 5 times the size
of previous largest dataset, yet our cost is tens of thousands of GPU hours
lower. To test the proposed dataset, we also propose a model which is capable
of both subject-driven image editing and generation. By simply training the
model on our proposed dataset, it obtains competitive results, illustrating the
effectiveness of the proposed dataset construction framework.Summary
AI-Generated Summary