Echo-4o：利用GPT-4o合成图像能力提升图像生成质量

摘要

近期，GPT-4o因其在图像生成方面的卓越表现而备受瞩目，然而开源模型仍显逊色。多项研究探索了从GPT-4o中蒸馏图像数据以提升开源模型性能，取得了显著进展。但一个核心问题依然存在：既然现实世界的图像数据集已是高质量数据的天然来源，为何还要使用GPT-4o生成的合成数据？本研究中，我们揭示了合成图像的两大关键优势。首先，它们能够补充现实数据集中罕见的场景，如超现实幻想或多参考图像生成，这些场景在用户查询中频繁出现。其次，合成图像提供了干净且可控的监督信号。现实数据常包含复杂的背景噪声及文本描述与图像内容之间的固有偏差，而合成图像则具备纯净背景和长尾监督信号，有助于实现更精确的文本到图像对齐。基于这些洞见，我们推出了Echo-4o-Image，一个由GPT-4o生成的18万规模合成数据集，旨在利用合成图像数据的力量填补现实世界覆盖的盲区。借助此数据集，我们对统一多模态生成基线模型Bagel进行微调，得到了Echo-4o。此外，我们提出了两个新的评估基准，以更准确且具挑战性地评估图像生成能力：GenEval++通过增加指令复杂度来缓解分数饱和现象，Imagine-Bench则专注于评估对创意内容的理解与生成能力。Echo-4o在标准基准测试中展现了强劲性能。更重要的是，将Echo-4o-Image应用于其他基础模型（如OmniGen2、BLIP3-o）时，在多项指标上均实现了性能提升，凸显了该数据集强大的可迁移性。

English

Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.