テキストと画像の一貫性向上のための自動プロンプト最適化

要旨

テキストから画像（T2I）生成モデルの目覚ましい進歩により、美的に魅力的で写真のようにリアルな画像を生成できる高性能なモデルが多数登場しています。しかし、これらのモデルは依然として入力プロンプトと一致する画像を生成するのに苦労しており、オブジェクトの数量、関係、属性を適切に捉えることができないことが多々あります。プロンプトと画像の一貫性を向上させるための既存の解決策は、以下の課題に直面しています：（1）モデルのファインチューニングが必要な場合が多い、（2）近傍のプロンプトサンプルにのみ焦点を当てている、（3）画像品質、表現の多様性、プロンプトと画像の一貫性の間で不利なトレードオフが生じる。本論文では、これらの課題に対処し、大規模言語モデル（LLM）を活用してT2Iモデルのプロンプトと画像の一貫性を向上させるT2I最適化プロンプティングフレームワーク、OPT2Iを紹介します。私たちのフレームワークは、ユーザーのプロンプトから始まり、一貫性スコアを最大化することを目的として、反復的に修正されたプロンプトを生成します。MSCOCOとPartiPromptsの2つのデータセットでの広範な検証により、OPT2Iが初期の一貫性スコアをDSGスコアで最大24.9%向上させながら、FIDを維持し、生成データと実データ間のリコールを増加させることが示されました。私たちの研究は、LLMの力を活用して、より信頼性が高く堅牢なT2Iシステムを構築する道を切り開くものです。

English

Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.

テキストと画像の一貫性向上のための自動プロンプト最適化

Improving Text-to-Image Consistency via Automatic Prompt Optimization

要旨

Support