BootPIG:在預訓練擴散模型中引入零樣本個性化圖像生成能力
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models
January 25, 2024
作者: Senthil Purushwalkam, Akash Gokul, Shafiq Joty, Nikhil Naik
cs.AI
摘要
最近的文本到圖像生成模型展示了令人難以置信的成功,能夠生成忠實於輸入提示的圖像。然而,使用詞語描述所需概念的要求對於生成概念的外觀控制有限。在這項工作中,我們通過提出一種方法來賦予現有文本到圖像擴散模型個性化能力來解決這個缺點。我們提出了一種新穎的架構(BootPIG),允許用戶提供物體的參考圖像,以引導生成圖像中概念的外觀。
所提出的BootPIG架構對預訓練的文本到圖像擴散模型進行最小修改,並利用獨立的UNet模型來引導生成圖像朝著所需的外觀發展。我們引入了一種訓練程序,使我們能夠通過從預訓練的文本到圖像模型、LLM聊天代理和圖像分割模型生成的數據來在BootPIG架構中引導個性化能力。與需要數天預訓練的現有方法相比,BootPIG架構可以在約1小時內訓練。對DreamBooth數據集的實驗表明,BootPIG在超越現有的零樣本方法的同時,與測試時間微調方法相當。通過用戶研究,我們驗證了BootPIG生成相對於現有方法的偏好,無論是在保持忠實於參考物體外觀還是與文本提示一致方面。
English
Recent text-to-image generation models have demonstrated incredible success
in generating images that faithfully follow input prompts. However, the
requirement of using words to describe a desired concept provides limited
control over the appearance of the generated concepts. In this work, we address
this shortcoming by proposing an approach to enable personalization
capabilities in existing text-to-image diffusion models. We propose a novel
architecture (BootPIG) that allows a user to provide reference images of an
object in order to guide the appearance of a concept in the generated images.
The proposed BootPIG architecture makes minimal modifications to a pretrained
text-to-image diffusion model and utilizes a separate UNet model to steer the
generations toward the desired appearance. We introduce a training procedure
that allows us to bootstrap personalization capabilities in the BootPIG
architecture using data generated from pretrained text-to-image models, LLM
chat agents, and image segmentation models. In contrast to existing methods
that require several days of pretraining, the BootPIG architecture can be
trained in approximately 1 hour. Experiments on the DreamBooth dataset
demonstrate that BootPIG outperforms existing zero-shot methods while being
comparable with test-time finetuning approaches. Through a user study, we
validate the preference for BootPIG generations over existing methods both in
maintaining fidelity to the reference object's appearance and aligning with
textual prompts.