Idea2Img: GPT-4V(ision)を用いた自動画像設計と生成のための反復的自己改善

要旨

「Idea to Image」を紹介する。これは、GPT-4V(ision)を用いたマルチモーダルな反復的自己改善を可能にし、自動的な画像設計と生成を実現するシステムである。人間は、反復的な探索を通じて異なるテキストから画像への変換（T2I）モデルの特性を迅速に特定できる。これにより、高レベルの生成アイデアを効果的なT2Iプロンプトに効率的に変換し、優れた画像を生成することが可能となる。我々は、大規模マルチモーダルモデル（LMM）に基づくシステムが、未知のモデルや環境を自己改善的な試行を通じて探索する能力を発揮できるかどうかを調査する。Idea2Imgは、修正されたT2Iプロンプトを循環的に生成し、ドラフト画像を合成し、プロンプトの修正に向けた方向性のあるフィードバックを提供する。これらは、調査されたT2Iモデルの特性に関する記憶に基づいて行われる。反復的な自己改善により、Idea2Imgは従来のT2Iモデルに対して様々な利点を持つ。特に、Idea2Imgは画像とテキストが交互に現れる入力アイデアを処理し、設計指示を含むアイデアに従い、意味的および視覚的に優れた品質の画像を生成できる。ユーザー選好調査により、自動的な画像設計と生成におけるマルチモーダルな反復的自己改善の有効性が検証された。

English

We introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. Humans can quickly identify the characteristics of different text-to-image (T2I) models via iterative explorations. This enables them to efficiently convert their high-level generation ideas into effective T2I prompts that can produce good images. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities that enable exploring unknown models or environments via self-refining tries. Idea2Img cyclically generates revised T2I prompts to synthesize draft images, and provides directional feedback for prompt revision, both conditioned on its memory of the probed T2I model's characteristics. The iterative self-refinement brings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img can process input ideas with interleaved image-text sequences, follow ideas with design instructions, and generate images of better semantic and visual qualities. The user preference study validates the efficacy of multimodal iterative self-refinement on automatic image design and generation.

Idea2Img: GPT-4V(ision)を用いた自動画像設計と生成のための反復的自己改善

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation

要旨

Support