ShareGPT-4o-Image：实现多模态模型与GPT-4o级图像生成的对齐

摘要

近期，多模态生成模型的突破性进展实现了逼真且指令对齐的图像生成，然而诸如GPT-4o-Image等领先系统仍属专有，难以普及。为普及这些能力，我们推出了ShareGPT-4o-Image，这是首个包含45K文本到图像及46K文本加图像到图像数据的数据集，所有数据均利用GPT-4o的图像生成能力合成，旨在提炼其先进的图像生成技术。基于此数据集，我们开发了Janus-4o，一个多模态大语言模型，既能进行文本到图像生成，也能实现文本加图像到图像的生成。Janus-4o不仅显著提升了文本到图像生成的质量，超越了前代Janus-Pro，还新增了对文本加图像到图像生成的支持。尤为突出的是，它仅使用91K合成样本，在8台A800-GPU机器上经过6小时训练，便实现了从零开始的文本加图像到图像生成的卓越性能。我们期望ShareGPT-4o-Image与Janus-4o的发布，能推动逼真且指令对齐的图像生成领域的开放研究。

English

Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.