OpenGPT-4o-Image：面向高级圖像生成與編輯的綜合數據集

摘要

統一多模態模型在圖像生成與編輯上的表現，根本上受制於其訓練數據的質量與全面性。現有數據集雖已涵蓋風格轉換及簡單物體操作等基本任務，卻往往缺乏現實應用所需的系統性結構與挑戰性場景。為解決這一瓶頸，我們推出了OpenGPT-4o-Image，這是一個大規模數據集，採用了一種結合層次化任務分類與自動化數據生成的新穎方法構建。我們的分類體系不僅包含文本渲染與風格控制等基礎能力，還引入了化學插圖所需的科學圖像及需同時執行多項操作的複雜指令編輯等高度實用且具挑戰性的類別。通過利用結構化資源池與GPT-4o的自動化流程，我們生成了8萬對高質量指令-圖像對，控制多樣性，覆蓋11個主要領域與51個子任務。大量實驗表明，在我們的數據集上微調領先模型，在多個基準測試中取得了顯著的性能提升，編輯任務（UniWorld-V1在ImgEdit-Bench上）提升高達18%，生成任務（Harmon在GenEval上）提升13%。我們的工作證明了系統化的數據構建是推進多模態AI能力的關鍵。

English

The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.