テキストから画像への拡散モデルにおけるゼロショット空間レイアウト条件付け

要旨

大規模なテキストから画像への拡散モデルは、生成画像モデリングの最先端を大幅に進化させ、画像生成プロセスを直感的かつ強力に制御するユーザーインターフェースを実現しました。しかし、特定のオブジェクトを特定の位置に配置するといった空間的な制約をテキストで表現することは煩雑であり、現在のテキストベースの画像生成モデルはそのような指示を正確に追従することができません。本論文では、画像キャンバス上のセグメントに関連付けられたテキストからの画像生成を考察します。これは、直感的な自然言語インターフェースと生成コンテンツに対する精密な空間制御を組み合わせたものです。我々はZestGuideを提案します。これは、事前学習済みのテキストから画像への拡散モデルに組み込むことができるゼロショットセグメンテーションガイダンスアプローチであり、追加の学習を必要としません。この手法は、クロスアテンションレイヤーから抽出可能な暗黙的なセグメンテーションマップを活用し、それらを使用して生成を入力マスクと整合させます。実験結果は、高品質な画像と入力セグメンテーションとの正確な整合性を組み合わせており、対応するセグメンテーションを持つ画像での学習を必要とする手法を含む先行研究を量的・質的に改善しています。ゼロショットセグメンテーション条件付き画像生成の従来の最先端手法であるPaint with Wordsと比較して、COCOデータセットにおいて同様のFIDスコアで5から10 mIoUポイントの改善を達成しました。

English

Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.

テキストから画像への拡散モデルにおけるゼロショット空間レイアウト条件付け

Zero-shot spatial layout conditioning for text-to-image diffusion models

要旨

Support