ShareGPT-4o-Image: GPT-4o 수준의 이미지 생성과 다중모달 모델 정렬

초록

최근 멀티모달 생성 모델의 발전으로 사진처럼 사실적이고 지시에 부합하는 이미지 생성이 가능해졌지만, GPT-4o-Image와 같은 선도적인 시스템은 여전히 독점적이며 접근이 제한적입니다. 이러한 기능을 대중화하기 위해, 우리는 GPT-4o의 이미지 생성 능력을 활용하여 합성된 45K 텍스트-이미지 및 46K 텍스트-이미지-이미지 데이터로 구성된 첫 번째 데이터셋인 ShareGPT-4o-Image를 소개합니다. 이 데이터셋을 활용하여, 우리는 텍스트-이미지 및 텍스트-이미지-이미지 생성을 모두 지원하는 멀티모달 대형 언어 모델인 Janus-4o를 개발했습니다. Janus-4o는 이전 모델인 Janus-Pro에 비해 텍스트-이미지 생성에서 상당한 개선을 이루었을 뿐만 아니라, 텍스트-이미지-이미지 생성을 새롭게 지원합니다. 특히, 단 91K 합성 샘플과 8개의 A800-GPU 머신에서 6시간의 훈련만으로 텍스트-이미지-이미지 생성에서 인상적인 성능을 달성했습니다. ShareGPT-4o-Image와 Janus-4o의 공개가 사진처럼 사실적이고 지시에 부합하는 이미지 생성 분야의 개방형 연구를 촉진하기를 바랍니다.

English

Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.