칸딘스키 3.0 기술 보고서

초록

우리는 잠재 확산(latent diffusion) 기반의 대규모 텍스트-이미지 생성 모델인 Kandinsky 3.0을 소개합니다. 이 모델은 Kandinsky 시리즈의 텍스트-이미지 모델을 계승하며, 더 높은 품질과 사실감을 갖춘 이미지 생성을 달성하기 위한 우리의 진전을 반영합니다. Kandinsky 2.x 이전 버전과 비교하여, Kandinsky 3.0은 두 배 더 큰 U-Net 백본, 열 배 더 큰 텍스트 인코더를 활용하며 확산 매핑(diffusion mapping)을 제거했습니다. 우리는 모델의 아키텍처, 데이터 수집 절차, 훈련 기법, 그리고 사용자 상호작용을 위한 프로덕션 시스템을 설명합니다. 또한, 대규모 실험을 통해 확인한 주요 구성 요소들에 초점을 맞추어, 이들이 다른 모델 대비 우리 모델의 품질 향상에 가장 큰 영향을 미쳤음을 강조합니다. 우리의 병렬 비교를 통해 Kandinsky는 텍스트 이해 능력이 개선되었으며, 특정 도메인에서 더 나은 성능을 보임을 확인했습니다. 프로젝트 페이지: https://ai-forever.github.io/Kandinsky-3

English

We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3

칸딘스키 3.0 기술 보고서

Kandinsky 3.0 Technical Report

초록

Support