ViCo: 개인화된 텍스트-이미지 생성을 위한 세부 정보 보존 시각적 조건

초록

최근 확산 모델을 활용한 개인화된 텍스트-이미지 생성 기술이 제안되며 많은 관심을 받고 있습니다. 새로운 개념(예: 독특한 장난감)을 포함한 소량의 이미지가 주어졌을 때, 우리는 생성 모델을 조정하여 해당 개념의 세밀한 시각적 특징을 포착하고 텍스트 조건에 따라 사실적인 이미지를 생성하는 것을 목표로 합니다. 본 논문에서는 빠르고 경량화된 개인화 생성을 위한 플러그인 방식인 ViCo를 제안합니다. 구체적으로, 우리는 패치 단위의 시각적 의미를 확산 과정에 조건화하기 위한 이미지 어텐션 모듈을 제안합니다. 또한, 어텐션 모듈에서 거의 추가 비용 없이 얻을 수 있는 어텐션 기반 객체 마스크를 도입했습니다. 더불어, 텍스트-이미지 어텐션 맵의 내재적 특성에 기반한 간단한 정규화를 설계하여 일반적인 과적합 문제를 완화했습니다. 기존의 많은 모델과 달리, 우리의 방법은 원본 확산 모델의 매개변수를 미세 조정하지 않습니다. 이를 통해 더 유연하고 전이 가능한 모델 배포가 가능해집니다. 경량의 매개변수 학습(확산 U-Net의 약 6%)만으로도, 우리의 방법은 정성적 및 정량적으로 모든 최신 모델과 견줄 만하거나 더 나은 성능을 달성했습니다.

English

Personalized text-to-image generation using diffusion models has recently been proposed and attracted lots of attention. Given a handful of images containing a novel concept (e.g., a unique toy), we aim to tune the generative model to capture fine visual details of the novel concept and generate photorealistic images following a text condition. We present a plug-in method, named ViCo, for fast and lightweight personalized generation. Specifically, we propose an image attention module to condition the diffusion process on the patch-wise visual semantics. We introduce an attention-based object mask that comes almost at no cost from the attention module. In addition, we design a simple regularization based on the intrinsic properties of text-image attention maps to alleviate the common overfitting degradation. Unlike many existing models, our method does not finetune any parameters of the original diffusion model. This allows more flexible and transferable model deployment. With only light parameter training (~6% of the diffusion U-Net), our method achieves comparable or even better performance than all state-of-the-art models both qualitatively and quantitatively.

ViCo: 개인화된 텍스트-이미지 생성을 위한 세부 정보 보존 시각적 조건

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

초록

Support