解锁预训练图像主干用于语义图像合成

摘要

语义图像合成，即从用户提供的语义标签地图生成图像，是一项重要的有条件图像生成任务，因为它允许控制生成图像的内容和空间布局。尽管扩散模型推动了生成图像建模的最新技术，但其推理过程的迭代性质使其在计算上具有挑战性。其他方法，如生成对抗网络（GANs），更有效率，因为它们只需要进行一次前向传递即可生成图像，但在大型和多样化数据集上图像质量往往会受到影响。在这项工作中，我们提出了一种新类别的GAN鉴别器，用于语义图像合成，通过利用预先针对图像分类等任务进行预训练的特征骨干网络生成高度逼真的图像。我们还引入了一种新的生成器架构，具有更好的上下文建模，并使用交叉注意力将噪声注入潜在变量，从而生成更多样化的图像。我们的模型，命名为DP-SIMS，在ADE-20K、COCO-Stuff和Cityscapes数据集上以图像质量和与输入标签地图的一致性方面取得了最新的成果，超越了最近的扩散模型，同时推理过程所需的计算量减少了两个数量级。

English

Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.

解锁预训练图像主干用于语义图像合成

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

摘要

Support