BootPIG: 사전 학습된 확산 모델에서 제로샷 개인화 이미지 생성 능력 부트스트랩핑

초록

최근 텍스트-이미지 생성 모델은 입력 프롬프트를 충실히 따르는 이미지를 생성하는 데 놀라운 성과를 보여주고 있습니다. 그러나 원하는 개념을 설명하기 위해 단어를 사용해야 한다는 요구사항은 생성된 개념의 외관을 제어하는 데 제한적입니다. 본 연구에서는 기존 텍스트-이미지 확산 모델에 개인화 기능을 추가하는 접근 방식을 제안하여 이러한 단점을 해결하고자 합니다. 우리는 사용자가 객체의 참조 이미지를 제공하여 생성된 이미지에서 개념의 외관을 안내할 수 있도록 하는 새로운 아키텍처(BootPIG)를 제안합니다. 제안된 BootPIG 아키텍처는 사전 학습된 텍스트-이미지 확산 모델에 최소한의 수정만을 가하며, 별도의 UNet 모델을 활용하여 생성물을 원하는 외관으로 유도합니다. 우리는 사전 학습된 텍스트-이미지 모델, LLM 채팅 에이전트, 이미지 분할 모델에서 생성된 데이터를 사용하여 BootPIG 아키텍처에 개인화 기능을 부트스트랩할 수 있는 학습 절차를 소개합니다. 기존 방법들이 며칠에 걸친 사전 학습을 필요로 하는 것과 달리, BootPIG 아키텍처는 약 1시간 내에 학습이 가능합니다. DreamBooth 데이터셋에 대한 실험 결과, BootPIG는 제로샷 방법을 능가하며 테스트 시점 미세 조정 접근법과도 비슷한 성능을 보여줍니다. 사용자 연구를 통해, BootPIG가 생성한 이미지가 참조 객체의 외관에 대한 충실도를 유지하고 텍스트 프롬프트와도 잘 맞는다는 점에서 기존 방법들보다 선호됨을 검증하였습니다.

English

Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.

BootPIG: 사전 학습된 확산 모델에서 제로샷 개인화 이미지 생성 능력 부트스트랩핑

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

초록

Support