시맨틱 이미지 합성을 위한 사전 학습된 이미지 백본 활용

초록

시맨틱 이미지 합성, 즉 사용자가 제공한 시맨틱 레이블 맵에서 이미지를 생성하는 작업은 생성된 이미지의 내용과 공간적 배치를 모두 제어할 수 있게 해주는 중요한 조건부 이미지 생성 작업입니다. 확산 모델(diffusion models)이 생성적 이미지 모델링 분야에서 최첨단을 달리고 있지만, 그들의 반복적인 추론 프로세스는 계산적으로 많은 부담을 줍니다. GAN(Generative Adversarial Networks)과 같은 다른 접근 방식은 단일 순방향 전달만으로 생성이 가능해 더 효율적이지만, 대규모 및 다양한 데이터셋에서 이미지 품질이 저하되는 경향이 있습니다. 본 연구에서는 이미지 분류와 같은 작업을 위해 사전 훈련된 특징 백본 네트워크를 활용하여 매우 사실적인 이미지를 생성하는 새로운 클래스의 GAN 판별기를 제안합니다. 또한, 더 나은 컨텍스트 모델링과 잠재 변수에 노이즈를 주입하기 위해 교차 주의(cross-attention)를 사용하는 새로운 생성기 아키텍처를 도입하여 더 다양한 이미지를 생성합니다. 우리는 이 모델을 DP-SIMS라고 명명했으며, ADE-20K, COCO-Stuff, Cityscapes 데이터셋에서 입력 레이블 맵과의 일관성 및 이미지 품질 측면에서 최첨단 결과를 달성했습니다. 이는 최근의 확산 모델을 능가하면서도 추론에 필요한 계산량을 두 자릿수로 줄였습니다.

English

Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.

시맨틱 이미지 합성을 위한 사전 학습된 이미지 백본 활용

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

초록

Support