Echo-4o:利用GPT-4o合成圖像的力量提升圖像生成品質
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation
August 13, 2025
作者: Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li
cs.AI
摘要
近期,GPT-4o在图像生成领域的卓越表现引起了广泛关注,然而开源模型仍显逊色。多项研究探索了从GPT-4o中蒸馏图像数据以提升开源模型性能,取得了显著进展。但一个核心问题依然存在:既然现实世界的图像数据集已是高质量数据的天然来源,为何还要使用GPT-4o生成的合成数据?本研究中,我们揭示了合成图像的两大关键优势。首先,它们能够补充现实数据集中罕见的场景,如超现实幻想或多参考图像生成,这些场景在用户查询中频繁出现。其次,合成图像提供了干净且可控的监督信号。现实数据常包含复杂的背景噪声及文本描述与图像内容之间的固有偏差,而合成图像则提供了纯净背景和长尾监督信号,有助于实现更精确的文本到图像对齐。基于这些洞见,我们推出了Echo-4o-Image,一个由GPT-4o生成的180K规模合成数据集,旨在利用合成图像数据的力量填补现实世界覆盖的盲区。利用此数据集,我们对统一多模态生成基线模型Bagel进行微调,得到了Echo-4o。此外,我们提出了两个新的评估基准,以更准确且具挑战性地评估图像生成能力:GenEval++,通过增加指令复杂度来缓解评分饱和现象;以及Imagine-Bench,专注于评估对创意内容的理解与生成能力。Echo-4o在标准基准测试中展现了强劲性能。更重要的是,将Echo-4o-Image应用于其他基础模型(如OmniGen2、BLIP3-o)时,在多项指标上均实现了性能提升,凸显了该数据集强大的可迁移性。
English
Recently, GPT-4o has garnered significant attention for its strong
performance in image generation, yet open-source models still lag behind.
Several studies have explored distilling image data from GPT-4o to enhance
open-source models, achieving notable progress. However, a key question
remains: given that real-world image datasets already constitute a natural
source of high-quality data, why should we use GPT-4o-generated synthetic data?
In this work, we identify two key advantages of synthetic images. First, they
can complement rare scenarios in real-world datasets, such as surreal fantasy
or multi-reference image generation, which frequently occur in user queries.
Second, they provide clean and controllable supervision. Real-world data often
contains complex background noise and inherent misalignment between text
descriptions and image content, whereas synthetic images offer pure backgrounds
and long-tailed supervision signals, facilitating more accurate text-to-image
alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale
synthetic dataset generated by GPT-4o, harnessing the power of synthetic image
data to address blind spots in real-world coverage. Using this dataset, we
fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o.
In addition, we propose two new evaluation benchmarks for a more accurate and
challenging assessment of image generation capabilities: GenEval++, which
increases instruction complexity to mitigate score saturation, and
Imagine-Bench, which focuses on evaluating both the understanding and
generation of imaginative content. Echo-4o demonstrates strong performance
across standard benchmarks. Moreover, applying Echo-4o-Image to other
foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains
across multiple metrics, highlighting the datasets strong transferability.