GPT-ImgEval: GPT4oの画像生成能力を診断するための包括的ベンチマーク

要旨

OpenAIのGPT4oモデルにおける最近のブレークスルーは、画像生成と編集において驚くほど優れた能力を示し、コミュニティに大きな興奮をもたらしました。本技術レポートでは、GPT-4oの性能を定量的かつ定性的に診断するための初の評価ベンチマーク（GPT-ImgEvalと命名）を提示します。このベンチマークは、以下の3つの重要な次元に焦点を当てています：(1) 生成品質、(2) 編集能力、(3) 世界知識に基づく意味的合成。これら3つのタスクにおいて、GPT-4oは強力な性能を示し、画像生成の制御と出力品質の両面で既存の手法を大幅に上回るとともに、卓越した知識推論能力を発揮しました。さらに、GPT-4oの生成データに基づいて、そのアーキテクチャの根底にある構造を調査するための分類モデルベースのアプローチを提案します。我々の実験結果は、GPT-4oがVARのようなアーキテクチャではなく、画像デコードのために自己回帰（AR）と拡散ベースのヘッドを組み合わせた構造を持つことを示唆しています。また、GPT-4oの全体的なアーキテクチャに関する完全な推測も提供します。加えて、GPT-4oの特定の制限と、その画像生成において頻繁に観察される合成アーティファクトを特定し、可視化するための一連の分析を行います。さらに、GPT-4oとGemini 2.0 Flashの多段階画像編集に関する比較研究を提示し、GPT-4oの出力、特に既存の画像フォレンジックモデルによる検出可能性に関する安全性の意味について議論します。我々の研究が、将来の研究を導き、再現性を促進し、画像生成およびその他の分野におけるイノベーションを加速するための貴重な洞察と信頼性の高いベンチマークを提供することを願っています。GPT-4oの評価に使用されたコードとデータセットは、https://github.com/PicoTrex/GPT-ImgEval で公開されています。

English

The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.

GPT-ImgEval: GPT4oの画像生成能力を診断するための包括的ベンチマーク

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

要旨

Support