Toffee:主題驅動的文本到圖像生成的高效百萬規模數據集構建
Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
June 13, 2024
作者: Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng, Nanxuan Zhao, Jiuxiang Gu, Zichao Wang, Xin Eric Wang, Tong Sun
cs.AI
摘要
在以主題驅動的文本到圖像生成中,最近的研究通過在合成數據集上訓練模型,其中包含眾多圖像對,已經實現了卓越的性能。通過在這些數據集上訓練,生成模型可以以零樣本的方式為特定主題從任意測試圖像生成與文本對齊的圖像。它們甚至優於需要在測試圖像上進行額外微調的方法。然而,創建這類數據集的成本對大多數研究人員來說是禁止的。為了生成單個訓練對,當前的方法是在主題圖像上對預先訓練的文本到圖像模型進行微調,以捕獲細粒度細節,然後使用微調後的模型基於創意文本提示為相同主題創建圖像。因此,構建包含數百萬主題的大規模數據集可能需要數十萬個 GPU 小時。為了應對這個問題,我們提出了 Toffee,一種有效的方法來構建用於主題驅動編輯和生成的數據集。具體來說,我們的數據集構建不需要任何主題級微調。在預訓練兩個生成模型之後,我們能夠生成無限數量的高質量樣本。我們構建了第一個用於主題驅動圖像編輯和生成的大規模數據集,其中包含 500 萬個圖像對、文本提示和遮罩。我們的數據集是先前最大數據集的 5 倍大小,但我們的成本要低數萬個 GPU 小時。為了測試提出的數據集,我們還提出了一個模型,能夠進行主題驅動的圖像編輯和生成。通過簡單地在我們提出的數據集上訓練模型,它獲得了競爭性的結果,說明了提出的數據集構建框架的有效性。
English
In subject-driven text-to-image generation, recent works have achieved
superior performance by training the model on synthetic datasets containing
numerous image pairs. Trained on these datasets, generative models can produce
text-aligned images for specific subject from arbitrary testing image in a
zero-shot manner. They even outperform methods which require additional
fine-tuning on testing images. However, the cost of creating such datasets is
prohibitive for most researchers. To generate a single training pair, current
methods fine-tune a pre-trained text-to-image model on the subject image to
capture fine-grained details, then use the fine-tuned model to create images
for the same subject based on creative text prompts. Consequently, constructing
a large-scale dataset with millions of subjects can require hundreds of
thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient
method to construct datasets for subject-driven editing and generation.
Specifically, our dataset construction does not need any subject-level
fine-tuning. After pre-training two generative models, we are able to generate
infinite number of high-quality samples. We construct the first large-scale
dataset for subject-driven image editing and generation, which contains 5
million image pairs, text prompts, and masks. Our dataset is 5 times the size
of previous largest dataset, yet our cost is tens of thousands of GPU hours
lower. To test the proposed dataset, we also propose a model which is capable
of both subject-driven image editing and generation. By simply training the
model on our proposed dataset, it obtains competitive results, illustrating the
effectiveness of the proposed dataset construction framework.