Instruct-Imagen: マルチモーダル命令による画像生成

要旨

本論文では、異種の画像生成タスクに対応し、未見のタスクにも汎化するモデルであるinstruct-imagenを提案する。我々は、画像生成のための*マルチモーダル命令*を導入し、多様な生成意図を精密に表現するタスク表現を提示する。これは、自然言語を用いてテキスト、エッジ、スタイル、被写体などの異なるモダリティを統合し、豊富な生成意図を統一された形式で標準化するものである。次に、事前学習済みのテキストから画像への拡散モデルを2段階のフレームワークで微調整し、instruct-imagenを構築する。まず、検索拡張型トレーニングを用いてモデルを適応させ、外部のマルチモーダルコンテキストに基づいて生成を行う能力を強化する。その後、視覚と言語の理解を必要とする多様な画像生成タスク（例：被写体駆動型生成など）に適応させ、各タスクの本質をカプセル化したマルチモーダル命令とペアで微調整を行う。様々な画像生成データセットにおける人間による評価では、instruct-imagenが従来のタスク特化型モデルと同等またはそれ以上の性能を示し、未見のより複雑なタスクへの有望な汎化能力を実証している。

English

This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.

Instruct-Imagen: マルチモーダル命令による画像生成

Instruct-Imagen: Image Generation with Multi-modal Instruction

要旨

Support