事前学習済み画像バックボーンをセマンティック画像合成に活用する

要旨

セマンティック画像合成、すなわちユーザー提供のセマンティックラベルマップから画像を生成する技術は、生成される画像の内容と空間的レイアウトの両方を制御できる重要な条件付き画像生成タスクです。拡散モデルは生成画像モデリングの最先端を押し上げていますが、その推論プロセスの反復的な性質により計算コストが高くなります。一方、GAN（Generative Adversarial Network）などの他のアプローチは、生成に単一の順伝播のみを必要とするため効率的ですが、大規模で多様なデータセットでは画像品質が低下する傾向があります。本研究では、画像分類などのタスクで事前学習された特徴バックボーンネットワークを活用することで、非常にリアルな画像を生成する新しいクラスのGAN識別器を提案します。また、より優れたコンテキストモデリングを実現し、クロスアテンションを用いて潜在変数にノイズを注入することで、より多様な画像を生成する新しいジェネレータアーキテクチャを導入します。私たちがDP-SIMSと名付けたこのモデルは、ADE-20K、COCO-Stuff、Cityscapesにおいて、入力ラベルマップとの整合性と画像品質の両方で最先端の結果を達成し、最近の拡散モデルを上回りながら、推論に必要な計算量を2桁削減しています。

English

Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.

事前学習済み画像バックボーンをセマンティック画像合成に活用する

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

要旨

Support