ChatPaper.aiChatPaper

ShareGPT-4o-圖像:將多模態模型與GPT-4o級別圖像生成對齊

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

June 22, 2025
作者: Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang
cs.AI

摘要

近期,多模态生成模型的最新進展已實現了與指令對齊的逼真圖像生成,然而,如GPT-4o-Image等領先系統仍屬專有且難以接觸。為普及這些能力,我們推出了ShareGPT-4o-Image,這是首個包含45K文本到圖像及46K文本與圖像到圖像數據的數據集,所有數據均利用GPT-4o的圖像生成功能合成,旨在提煉其先進的圖像生成能力。基於此數據集,我們開發了Janus-4o,一個多模态大型語言模型,能夠進行文本到圖像及文本與圖像到圖像的生成。Janus-4o不僅在文本到圖像生成上較其前身Janus-Pro有顯著提升,還新增了文本與圖像到圖像生成功能。值得注意的是,它僅使用91K合成樣本及在8台A800-GPU機器上6小時的訓練,便從零開始在文本與圖像到圖像生成中取得了令人印象深刻的表現。我們期望ShareGPT-4o-Image與Janus-4o的發布,能促進在逼真、指令對齊圖像生成領域的開放研究。
English
Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
PDF593June 26, 2025