GPT-4oの画像生成能力に関する実証的研究

要旨

画像生成の分野は急速に進化を遂げており、初期のGANベースのアプローチから拡散モデルを経て、最近では理解と生成タスクを統合しようとする統一的な生成アーキテクチャへと発展してきた。特にGPT-4oのような最新の進展は、高忠実度のマルチモーダル生成の実現可能性を示しているが、そのアーキテクチャ設計は未だに謎に包まれており、公開されていない。この状況は、画像とテキスト生成が既にこれらの手法において統一的なフレームワークに統合されているかどうかという疑問を投げかけている。本研究では、GPT-4oの画像生成能力を実証的に調査し、主要なオープンソースおよび商用モデルと比較する。評価は、テキストから画像、画像から画像、画像から3D、画像からX生成といった4つの主要カテゴリーにわたる20以上のタスクを網羅している。分析を通じて、GPT-4oの様々な設定下での強みと限界を明らかにし、生成モデリングの広範な進化の中に位置づける。この調査を通じて、将来の統一的な生成モデルに向けた有望な方向性を特定し、アーキテクチャ設計とデータスケーリングの役割を強調する。

English

The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.