텍스트-이미지 확산 모델을 위한 제로샷 공간 레이아웃 조건 설정

초록

대규모 텍스트-이미지 확산 모델은 생성적 이미지 모델링의 최신 기술을 크게 향상시켰으며, 이미지 생성 과정을 이끌기 위한 직관적이고 강력한 사용자 인터페이스를 제공합니다. 그러나 특정 위치에 특정 객체를 배치하는 것과 같은 공간적 제약을 텍스트로 표현하는 것은 번거로우며, 현재의 텍스트 기반 이미지 생성 모델은 이러한 지시를 정확히 따르지 못합니다. 본 논문에서는 이미지 캔버스 상의 세그먼트와 연관된 텍스트를 통해 이미지를 생성하는 방법을 고려합니다. 이 방법은 직관적인 자연어 인터페이스와 생성된 콘텐츠에 대한 정밀한 공간적 제어를 결합합니다. 우리는 사전 훈련된 텍스트-이미지 확산 모델에 플러그인할 수 있고 추가적인 훈련이 필요 없는 제로샷 세그멘테이션 가이던스 접근법인 ZestGuide를 제안합니다. 이 방법은 크로스-어텐션 레이어에서 추출할 수 있는 암묵적 세그멘테이션 맵을 활용하여 입력 마스크와 생성 과정을 정렬합니다. 우리의 실험 결과는 높은 이미지 품질과 입력 세그멘테이션과 생성된 콘텐츠의 정확한 정렬을 결합하며, 해당 세그멘테이션이 포함된 이미지에 대한 훈련이 필요한 방법들보다 양적 및 질적으로 개선된 성능을 보여줍니다. 제로샷 세그멘테이션 조건을 사용한 이미지 생성 분야의 이전 최신 기술인 Paint with Words와 비교했을 때, 우리의 방법은 유사한 FID 점수를 유지하면서 COCO 데이터셋에서 5에서 10 mIoU 포인트의 향상을 달성했습니다.

English

Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.

텍스트-이미지 확산 모델을 위한 제로샷 공간 레이아웃 조건 설정

Zero-shot spatial layout conditioning for text-to-image diffusion models

초록

Support