IA-T2I: インターネット拡張型テキスト-to-イメージ生成

要旨

現在のテキストから画像を生成する（T2I）モデルは有望な結果を達成していますが、テキストプロンプトに含まれる知識が不確かなシナリオでは失敗します。例えば、2月にリリースされたT2Iモデルは、4月に公開される映画の適切なポスターを生成するのに苦労するでしょう。なぜなら、キャラクターデザインやスタイルがモデルにとって不確かだからです。この問題を解決するために、我々はインターネットを活用したテキストから画像生成（IA-T2I）フレームワークを提案し、参照画像を提供することでT2Iモデルがそのような不確かな知識を明確に理解できるようにします。具体的には、与えられたテキストプロンプトに基づいて参照画像が必要かどうかを判断するアクティブ検索モジュールを設計し、画像検索エンジンから返される最も適切な画像を見つける階層型画像選択モジュールを導入してT2Iモデルを強化し、生成された画像を継続的に評価・改良してテキストプロンプトとの忠実な整合性を確保する自己反映メカニズムを提示します。提案フレームワークの性能を評価するために、テキストプロンプトに3種類の不確かな知識（1）知られているが稀なもの、（2）未知のもの、（3）曖昧なものを含むImg-Ref-T2Iデータセットを収集しました。さらに、GPT-4oが好み評価を行うための複雑なプロンプトを慎重に作成し、その評価精度が人間の好み評価と同程度であることを示しました。実験結果は、我々のフレームワークの有効性を示し、人間評価においてGPT-4oを約30%上回る性能を実証しました。

English

Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain. For example, a T2I model released in February would struggle to generate a suitable poster for a movie premiering in April, because the character designs and styles are uncertain to the model. To solve this problem, we propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images. Specifically, an active retrieval module is designed to determine whether a reference image is needed based on the given text prompt; a hierarchical image selection module is introduced to find the most suitable image returned by an image search engine to enhance the T2I model; a self-reflection mechanism is presented to continuously evaluate and refine the generated image to ensure faithful alignment with the text prompt. To evaluate the proposed framework's performance, we collect a dataset named Img-Ref-T2I, where text prompts include three types of uncertain knowledge: (1) known but rare. (2) unknown. (3) ambiguous. Moreover, we carefully craft a complex prompt to guide GPT-4o in making preference evaluation, which has been shown to have an evaluation accuracy similar to that of human preference evaluation. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4o by about 30% in human evaluation.