Kandinsky: 이미지 사전 정보와 잠재 확산 모델을 활용한 개선된 텍스트-이미지 합성

초록

텍스트-이미지 생성은 현대 컴퓨터 비전에서 중요한 분야이며, 생성적 아키텍처의 진화를 통해 상당한 개선을 이루어냈습니다. 이 중에서도 확산 기반 모델들은 필수적인 품질 향상을 보여주었습니다. 이러한 모델들은 일반적으로 픽셀 수준과 잠재 수준 접근법으로 나뉩니다. 우리는 이미지 사전 모델의 원리와 잠재 확산 기술을 결합한 새로운 잠재 확산 아키텍처 탐구인 Kandinsky1을 제시합니다. 이미지 사전 모델은 텍스트 임베딩을 CLIP의 이미지 임베딩으로 매핑하기 위해 별도로 학습됩니다. 제안된 모델의 또 다른 독특한 특징은 이미지 오토인코더 구성 요소로 사용된 수정된 MoVQ 구현입니다. 전체적으로 설계된 모델은 33억 개의 파라미터를 포함합니다. 또한 우리는 텍스트-이미지 생성, 이미지 융합, 텍스트 및 이미지 융합, 이미지 변형 생성, 텍스트 기반 인페인팅/아웃페인팅 등 다양한 생성 모드를 지원하는 사용자 친화적인 데모 시스템을 배포했습니다. 추가적으로, Kandinsky 모델의 소스 코드와 체크포인트를 공개했습니다. 실험적 평가 결과, COCO-30K 데이터셋에서 FID 점수 8.03을 기록하여 측정 가능한 이미지 생성 품질 측면에서 최고의 오픈소스 성능을 보여주었습니다.

English

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

Kandinsky: 이미지 사전 정보와 잠재 확산 모델을 활용한 개선된 텍스트-이미지 합성

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

초록

Support