GPT-ImgEval：全面診斷GPT4o圖像生成能力的基準測試

摘要

OpenAI的GPT4o模型近期取得的突破性進展，在圖像生成與編輯方面展現出令人驚喜的能力，引發了學術界的廣泛關注。本技術報告首次提出了一個名為GPT-ImgEval的評估基準，從定量與定性兩個角度，對GPT-4o在三個關鍵維度的表現進行了診斷：(1)生成質量，(2)編輯能力，以及(3)基於世界知識的語義合成。在所有三項任務中，GPT-4o均表現出色，在圖像生成控制與輸出質量上大幅超越現有方法，同時展現了卓越的知識推理能力。此外，基於GPT-4o生成的數據，我們提出了一種基於分類模型的方法來探究GPT-4o的底層架構，實驗結果表明該模型由自回歸（AR）與基於擴散的圖像解碼頭部結合而成，而非類似VAR的架構。我們還對GPT-4o的整體架構進行了完整的推測。此外，我們進行了一系列分析，以識別並可視化GPT-4o在圖像生成中的特定限制及常見的合成偽影。我們還對GPT-4o與Gemini 2.0 Flash在多輪圖像編輯方面進行了比較研究，並探討了GPT-4o輸出的安全性問題，特別是現有圖像鑑識模型對其的檢測能力。我們希望這項工作能提供有價值的見解，並建立一個可靠的基準，以指導未來研究，促進可重現性，並加速圖像生成及其他領域的創新。用於評估GPT-4o的代碼與數據集可在https://github.com/PicoTrex/GPT-ImgEval找到。

English

The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.