ImagenWorld: オープンエンドな実世界タスクにおける説明可能な人間評価による画像生成モデルのストレステスト

要旨

拡散モデル、自己回帰モデル、ハイブリッドモデルの進歩により、テキストからの画像生成、編集、参照画像誘導型合成などのタスクにおいて、高品質な画像合成が可能となった。しかし、既存のベンチマークは限られており、個別のタスクに焦点を当てるか、狭い領域のみを対象とするか、あるいは失敗モードを説明しない不透明なスコアを提供するに留まっている。本研究では、ImagenWorldを紹介する。これは、6つの核心タスク（生成と編集、単一または複数参照）と6つのトピック領域（美術作品、写実的画像、情報グラフィック、テキストグラフィック、コンピュータグラフィック、スクリーンショット）にわたる3.6Kの条件セットからなるベンチマークである。このベンチマークは、20Kの詳細な人間による注釈と、自動化されたVLMベースの指標を補完する、局所的なオブジェクトレベルおよびセグメントレベルのエラーにタグ付けする説明可能な評価スキーマによって支えられている。14のモデルに対する大規模評価により、いくつかの知見が得られた：（1）モデルは一般に、生成タスクよりも編集タスク、特に局所的な編集において苦戦する。（2）モデルは芸術的および写実的設定では優れるが、スクリーンショットや情報グラフィックなどの記号的でテキストの多い領域では苦戦する。（3）クローズドソースシステムが全体的にリードするが、対象を絞ったデータキュレーション（例：Qwen-Image）により、テキストの多いケースでの差は縮まる。（4）現代のVLMベースの指標は、最大0.79のKendall精度を達成し、人間の順位付けに近似するが、詳細で説明可能なエラー帰属までは至らない。ImagenWorldは、堅牢な画像生成の進歩に向けた、厳密なベンチマークと診断ツールの両方を提供する。

English

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

ImagenWorld: オープンエンドな実世界タスクにおける説明可能な人間評価による画像生成モデルのストレステスト

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

要旨

Support