객체 주도 원샷 미세 조정을 통한 프로토타입 임베딩 기반 텍스트-이미지 확산 모델 튜닝

초록

대규모 텍스트-이미지 생성 모델이 텍스트-이미지 생성 분야에서 놀라운 발전을 이루면서, 다양한 미세 조정 방법들이 제안되어 왔다. 그러나 이러한 모델들은 새로운 객체, 특히 원샷 시나리오에서 어려움을 겪는 경우가 많다. 우리가 제안한 방법은 단일 입력 이미지와 객체 특정 관심 영역만을 사용하여 일반화성과 충실도의 문제를 객체 중심의 방식으로 해결하고자 한다. 일반화성을 향상시키고 과적합을 완화하기 위해, 우리의 패러다임에서는 확산 모델을 미세 조정하기 전에 객체의 외관과 클래스를 기반으로 프로토타입 임베딩을 초기화한다. 또한 미세 조정 과정에서 객체 클래스의 사전 지식을 보존하기 위해 클래스 특성화 정규화를 제안한다. 충실도를 더욱 향상시키기 위해, 우리는 객체 특정 손실을 도입하였으며, 이를 통해 여러 객체를 이식하는 데에도 사용할 수 있다. 전반적으로, 우리가 제안한 객체 중심의 새로운 객체 이식 방법은 기존 개념과도 원활하게 통합될 수 있을 뿐만 아니라 높은 충실도와 일반화성을 갖춘다. 우리의 방법은 여러 기존 연구들을 능가한다. 코드는 공개될 예정이다.

English

As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object's appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization. Our method outperforms several existing works. The code will be released.

객체 주도 원샷 미세 조정을 통한 프로토타입 임베딩 기반 텍스트-이미지 확산 모델 튜닝

Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

초록

Support