InstanceGen: インスタンスレベル指示による画像生成

要旨

生成モデルの能力が急速に進歩しているにもかかわらず、事前学習済みのテキストから画像へのモデルは、複数のオブジェクトやインスタンスレベルの属性を組み合わせた複雑なプロンプトが伝える意味を捉えることに依然として苦戦しています。その結果、このような難しいケースにおいて生成プロセスをより適切に導くために、粗いバウンディングボックスの形で追加の構造的制約を統合することへの関心が高まっています。本研究では、現代の画像生成モデルが直接的に妥当な細粒度の構造的初期化を提供できるという観察に基づいて、構造的ガイダンスのアイデアをさらに一歩進めます。我々は、この画像ベースの構造的ガイダンスとLLMベースのインスタンスレベル指示を組み合わせる技術を提案し、オブジェクトの数、インスタンスレベルの属性、インスタンス間の空間的関係を含むテキストプロンプトのすべての部分に従った出力画像を生成します。

English

Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.

InstanceGen: インスタンスレベル指示による画像生成

InstanceGen: Image Generation with Instance-level Instructions

要旨

Support