ShareGPT-4o-Image: GPT-4oレベルの画像生成とマルチモーダルモデルの整合化

要旨

近年のマルチモーダル生成モデルの進展により、フォトリアルで指示に沿った画像生成が可能となったが、GPT-4o-Imageのような主要なシステムは依然としてプロプライエタリでアクセスが制限されている。これらの能力を民主化するため、我々はShareGPT-4o-Imageを提案する。これは、GPT-4oの画像生成能力を活用して合成された45,000件のテキストから画像へのデータと46,000件のテキストと画像から画像へのデータを含む初のデータセットである。このデータセットを活用し、我々はJanus-4oを開発した。これは、テキストから画像への生成とテキストと画像から画像への生成の両方が可能なマルチモーダル大規模言語モデルである。Janus-4oは、前身モデルであるJanus-Proを大幅に上回るテキストから画像への生成性能を示すだけでなく、新たにテキストと画像から画像への生成もサポートする。特に、わずか91,000件の合成サンプルと8台のA800-GPUマシンでの6時間のトレーニングで、ゼロからテキストと画像から画像への生成において印象的な性能を達成した。ShareGPT-4o-ImageとJanus-4oの公開が、フォトリアルで指示に沿った画像生成のオープンな研究を促進することを期待する。

English

Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.