OpenGPT-4o-Image:面向高级图像生成与编辑的综合数据集
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
September 29, 2025
作者: Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang
cs.AI
摘要
统一多模态模型在图像生成与编辑方面的性能,从根本上受限于其训练数据的质量与全面性。尽管现有数据集已涵盖风格迁移和简单对象操控等基础任务,但它们往往缺乏现实应用所需的系统化结构和挑战性场景。为突破这一瓶颈,我们推出了OpenGPT-4o-Image,这是一个采用新颖方法构建的大规模数据集,该方法结合了层次化任务分类与自动化数据生成。我们的分类体系不仅包含文本渲染和风格控制等基础能力,还引入了高度实用且具挑战性的类别,如用于化学图示的科学图像和需要同时执行多项操作的复杂指令编辑。通过利用结构化资源池和GPT-4o的自动化流程,我们生成了8万对高质量指令-图像对,控制其多样性,覆盖11个主要领域和51个子任务。大量实验表明,基于我们数据集微调的领先模型在多个基准测试中均取得了显著性能提升,其中编辑任务(UniWorld-V1在ImgEdit-Bench上)提升高达18%,生成任务(Harmon在GenEval上)提升13%。我们的工作证明,系统化的数据构建是推动多模态AI能力进步的关键。
English
The performance of unified multimodal models for image generation and editing
is fundamentally constrained by the quality and comprehensiveness of their
training data. While existing datasets have covered basic tasks like style
transfer and simple object manipulation, they often lack the systematic
structure and challenging scenarios required for real-world applications. To
address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset
constructed using a novel methodology that combines hierarchical task taxonomy
with automated data generation. Our taxonomy not only includes fundamental
capabilities such as text rendering and style control but also introduces
highly practical yet challenging categories like scientific imagery for
chemistry illustrations and complex instruction editing requiring simultaneous
execution of multiple operations. Through an automated pipeline leveraging
structured resource pools and GPT-4o, we generate 80k high-quality
instruction-image pairs with controlled diversity, covering 11 major domains
and 51 subtasks. Extensive experiments show that fine-tuning leading models on
our dataset achieves significant performance gains across multiple benchmarks,
with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench)
and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that
systematic data construction is key to advancing multimodal AI capabilities.