OpenGPT-4o-Image: 고급 이미지 생성 및 편집을 위한 포괄적인 데이터셋

초록

이미지 생성 및 편집을 위한 통합 멀티모달 모델의 성능은 근본적으로 학습 데이터의 품질과 포괄성에 의해 제약받습니다. 기존 데이터셋들은 스타일 전환 및 단순 객체 조작과 같은 기본 작업들을 다루고 있지만, 실제 응용에 필요한 체계적인 구조와 도전적인 시나리오가 부족한 경우가 많습니다. 이러한 병목 현상을 해결하기 위해, 우리는 계층적 작업 분류체계와 자동화된 데이터 생성을 결합한 새로운 방법론을 사용하여 구축한 대규모 데이터셋인 OpenGPT-4o-Image를 소개합니다. 우리의 분류체계는 텍스트 렌더링 및 스타일 제어와 같은 기본 기능뿐만 아니라, 화학 일러스트레이션을 위한 과학적 이미지와 다중 작업 동시 실행이 필요한 복잡한 지시 편집과 같은 실용적이면서도 도전적인 범주들을 포함합니다. 구조화된 자원 풀과 GPT-4o를 활용한 자동화 파이프라인을 통해, 우리는 11개의 주요 도메인과 51개의 하위 작업을 아우르는 80,000개의 고품질 지시-이미지 쌍을 제어된 다양성으로 생성했습니다. 광범위한 실험을 통해, 우리의 데이터셋으로 주요 모델을 미세 조정했을 때 여러 벤치마크에서 상당한 성능 향상을 달성했으며, 편집 작업(ImgEdit-Bench에서 UniWorld-V1)에서는 최대 18%, 생성 작업(GenEval에서 Harmon)에서는 13%의 개선을 보였습니다. 우리의 연구는 체계적인 데이터 구축이 멀티모달 AI 역량을 발전시키는 데 핵심임을 입증합니다.

English

The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.