ShareGPT-4o-Image:实现多模态模型与GPT-4o级图像生成的对齐
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
June 22, 2025
作者: Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang
cs.AI
摘要
近期,多模态生成模型的突破性进展实现了逼真且指令对齐的图像生成,然而诸如GPT-4o-Image等领先系统仍属专有,难以普及。为普及这些能力,我们推出了ShareGPT-4o-Image,这是首个包含45K文本到图像及46K文本加图像到图像数据的数据集,所有数据均利用GPT-4o的图像生成能力合成,旨在提炼其先进的图像生成技术。基于此数据集,我们开发了Janus-4o,一个多模态大语言模型,既能进行文本到图像生成,也能实现文本加图像到图像的生成。Janus-4o不仅显著提升了文本到图像生成的质量,超越了前代Janus-Pro,还新增了对文本加图像到图像生成的支持。尤为突出的是,它仅使用91K合成样本,在8台A800-GPU机器上经过6小时训练,便实现了从零开始的文本加图像到图像生成的卓越性能。我们期望ShareGPT-4o-Image与Janus-4o的发布,能推动逼真且指令对齐的图像生成领域的开放研究。
English
Recent advances in multimodal generative models have unlocked photorealistic,
instruction-aligned image generation, yet leading systems like GPT-4o-Image
remain proprietary and inaccessible. To democratize these capabilities, we
present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and
46K text-and-image-to-image data, all synthesized using GPT-4o's image
generation capabilities for distilling its advanced image generation abilities.
Leveraging this dataset, we develop Janus-4o, a multimodal large language model
capable of both text-to-image and text-and-image-to-image generation. Janus-4o
not only significantly improves text-to-image generation over its predecessor,
Janus-Pro, but also newly supports text-and-image-to-image generation. Notably,
it achieves impressive performance in text-and-image-to-image generation from
scratch, using only 91K synthetic samples and 6 hours of training on an 8
A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will
foster open research in photorealistic, instruction-aligned image generation.