GPT-ImgEval:全面診斷GPT4o圖像生成能力的基準測試
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
April 3, 2025
作者: Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan
cs.AI
摘要
OpenAI的GPT4o模型近期取得的突破性進展,在圖像生成與編輯方面展現出令人驚喜的能力,引發了學術界的廣泛關注。本技術報告首次提出了一個名為GPT-ImgEval的評估基準,從定量與定性兩個角度,對GPT-4o在三個關鍵維度的表現進行了診斷:(1)生成質量,(2)編輯能力,以及(3)基於世界知識的語義合成。在所有三項任務中,GPT-4o均表現出色,在圖像生成控制與輸出質量上大幅超越現有方法,同時展現了卓越的知識推理能力。此外,基於GPT-4o生成的數據,我們提出了一種基於分類模型的方法來探究GPT-4o的底層架構,實驗結果表明該模型由自回歸(AR)與基於擴散的圖像解碼頭部結合而成,而非類似VAR的架構。我們還對GPT-4o的整體架構進行了完整的推測。此外,我們進行了一系列分析,以識別並可視化GPT-4o在圖像生成中的特定限制及常見的合成偽影。我們還對GPT-4o與Gemini 2.0 Flash在多輪圖像編輯方面進行了比較研究,並探討了GPT-4o輸出的安全性問題,特別是現有圖像鑑識模型對其的檢測能力。我們希望這項工作能提供有價值的見解,並建立一個可靠的基準,以指導未來研究,促進可重現性,並加速圖像生成及其他領域的創新。用於評估GPT-4o的代碼與數據集可在https://github.com/PicoTrex/GPT-ImgEval找到。
English
The recent breakthroughs in OpenAI's GPT4o model have demonstrated
surprisingly good capabilities in image generation and editing, resulting in
significant excitement in the community. This technical report presents the
first-look evaluation benchmark (named GPT-ImgEval), quantitatively and
qualitatively diagnosing GPT-4o's performance across three critical dimensions:
(1) generation quality, (2) editing proficiency, and (3) world
knowledge-informed semantic synthesis. Across all three tasks, GPT-4o
demonstrates strong performance, significantly surpassing existing methods in
both image generation control and output quality, while also showcasing
exceptional knowledge reasoning capabilities. Furthermore, based on the
GPT-4o's generated data, we propose a classification-model-based approach to
investigate the underlying architecture of GPT-4o, where our empirical results
suggest the model consists of an auto-regressive (AR) combined with a
diffusion-based head for image decoding, rather than the VAR-like
architectures. We also provide a complete speculation on GPT-4o's overall
architecture. In addition, we conduct a series of analyses to identify and
visualize GPT-4o's specific limitations and the synthetic artifacts commonly
observed in its image generation. We also present a comparative study of
multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the
safety implications of GPT-4o's outputs, particularly their detectability by
existing image forensic models. We hope that our work can offer valuable
insight and provide a reliable benchmark to guide future research, foster
reproducibility, and accelerate innovation in the field of image generation and
beyond. The codes and datasets used for evaluating GPT-4o can be found at
https://github.com/PicoTrex/GPT-ImgEval.Summary
AI-Generated Summary