解鎖預訓練圖像主幹以進行語義圖像合成

摘要

語義圖像合成，即從性提供語義標籤地圖生成圖像，是一項重要的有條件圖像生成任務，因為它允許控制生成圖像的內容和空間佈局。儘管擴散模型推動了生成圖像建模的最新技術，但其推理過程的迭代性質使其在計算上要求很高。其他方法如 GANs 更有效率，因為它們只需要進行一次前向傳遞來進行生成，但在大型和多樣化數據集上，圖像質量往往會下降。在這項工作中，我們提出了一種新類型的 GAN 判別器，用於語義圖像合成，通過利用為圖像分類等任務預先訓練的特徵骨幹網絡生成高度逼真的圖像。我們還引入了一種新的生成器架構，具有更好的上下文建模，並使用交叉注意力將噪音注入潛在變量，從而生成更多樣化的圖像。我們的模型，被我們稱為 DP-SIMS，在 ADE-20K、COCO-Stuff 和 Cityscapes 數據集上以圖像質量和與輸入標籤地圖的一致性方面取得了最先進的結果，超越了最近的擴散模型，同時在推理過程中需要少兩個數量級的計算。

English

Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.

解鎖預訓練圖像主幹以進行語義圖像合成

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

摘要

Support