拡散モデルを用いた制御可能な画像生成のための自己ガイダンス

要旨

大規模生成モデルは、詳細なテキスト記述から高品質な画像を生成することが可能です。しかし、画像の多くの側面はテキストを通じて伝えることが困難または不可能です。本論文では、拡散モデルの内部表現をガイドすることで、生成画像に対するより高度な制御を実現する「セルフガイダンス」という手法を提案します。物体の形状、位置、外観といった特性がこれらの表現から抽出可能であり、サンプリングを誘導するために利用できることを実証します。セルフガイダンスは、分類器ガイダンスと同様に機能しますが、事前学習済みモデル自体に存在する信号を利用するため、追加のモデルや学習を必要としません。単純な特性のセットを組み合わせることで、物体の位置やサイズの変更、ある画像の物体の外観を別の画像のレイアウトと融合する、複数の画像から物体を合成するなど、挑戦的な画像操作を実行できることを示します。また、セルフガイダンスが実画像の編集にも利用可能であることも実証します。結果とインタラクティブなデモについては、プロジェクトページ（https://dave.ml/selfguidance/）をご覧ください。

English

Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling. Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images. For results and an interactive demo, see our project page at https://dave.ml/selfguidance/

拡散モデルを用いた制御可能な画像生成のための自己ガイダンス

Diffusion Self-Guidance for Controllable Image Generation

要旨

Support