GPT-ImgEval: 이미지 생성에서 GPT4o를 진단하기 위한 포괄적 벤치마크

초록

OpenAI의 GPT4o 모델에서 최근 이루어진 획기적인 발전은 이미지 생성 및 편집에서 놀라울 정도로 뛰어난 능력을 보여주며, 커뮤니티 내에서 큰 흥미를 불러일으켰습니다. 이 기술 보고서는 GPT-4o의 성능을 세 가지 중요한 차원에서 정량적 및 정성적으로 진단하는 첫 번째 평가 벤치마크(GPT-ImgEval)를 소개합니다: (1) 생성 품질, (2) 편집 숙련도, (3) 세계 지식 기반 의미론적 합성. 이 세 가지 작업 모두에서 GPT-4o는 강력한 성능을 보이며, 이미지 생성 제어와 출력 품질에서 기존 방법을 크게 능가하는 동시에 탁월한 지식 추론 능력을 보여줍니다. 또한, GPT-4o의 생성 데이터를 기반으로, 우리는 GPT-4o의 내부 아키텍처를 조사하기 위한 분류 모델 기반 접근 방식을 제안합니다. 실험 결과는 이 모델이 VAR(Vector Autoregressive)과 같은 아키텍처가 아닌, 이미지 디코딩을 위한 확산 기반 헤드와 자기회귀(AR)가 결합된 구조로 이루어져 있음을 시사합니다. 또한, 우리는 GPT-4o의 전체 아키텍처에 대한 완전한 추측을 제공합니다. 추가적으로, GPT-4o의 특정 한계와 이미지 생성에서 흔히 관찰되는 합성 아티팩트를 식별하고 시각화하기 위한 일련의 분석을 수행합니다. 또한, GPT-4o와 Gemini 2.0 Flash 간의 다중 라운드 이미지 편집에 대한 비교 연구를 제시하고, GPT-4o의 출력물, 특히 기존 이미지 포렌식 모델에 의한 탐지 가능성과 관련된 안전성 문제를 논의합니다. 우리의 작업이 미래 연구를 안내하고 재현성을 촉진하며, 이미지 생성 및 그 이상의 분야에서 혁신을 가속화하는 데 유용한 통찰력과 신뢰할 수 있는 벤치마크를 제공할 수 있기를 바랍니다. GPT-4o 평가에 사용된 코드와 데이터셋은 https://github.com/PicoTrex/GPT-ImgEval에서 확인할 수 있습니다.

English

The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.

GPT-ImgEval: 이미지 생성에서 GPT4o를 진단하기 위한 포괄적 벤치마크

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

초록

Support