IMAGINE-E：最先端のテキストから画像へのモデルの画像生成知能評価

要旨

拡散モデルの急速な発展により、テキストから画像へのモデル（T2I）は著しい進歩を遂げ、素早い応答や画像生成において印象的な能力を示しています。最近登場したFLUX.1やIdeogram2.0などのモデルは、Dall-E3やStable Diffusion 3など他のモデルと共に、様々な複雑なタスクで優れた性能を発揮し、T2Iモデルが汎用性を持つ方向に向かっているかどうかという疑問を呼び起こしています。従来の画像生成にとどまらず、これらのモデルは制御可能な生成、画像編集、ビデオ、音声、3D、動画生成、セマンティックセグメンテーション、深度推定などのコンピュータビジョンタスクを含む様々な分野で能力を示しています。しかし、現在の評価フレームワークは、これらのモデルの性能を拡大する領域全体で包括的に評価するには不十分です。これらのモデルを徹底的に評価するために、私たちはIMAGINE-Eを開発し、FLUX.1、Ideogram2.0、Midjourney、Dall-E3、Stable Diffusion 3、Jimengの6つの有力なモデルをテストしました。私たちの評価は、構造化された出力生成、リアリズムと物理的一貫性、特定のドメイン生成、困難なシナリオ生成、およびマルチスタイル作成タスクの5つの主要な領域に分かれています。この包括的な評価は、各モデルの強みと限界を明らかにし、特にFLUX.1とIdeogram2.0が構造化および特定のドメインタスクで優れたパフォーマンスを発揮しており、T2Iモデルの応用範囲と潜在能力を強調しています。この研究は、T2Iモデルが汎用性を持つ方向に進化する中での現在の状況と将来の軌跡について貴重な示唆を提供します。評価スクリプトはhttps://github.com/jylei16/Imagine-eで公開されます。

English

With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.