Seedream 4.0:邁向新一代多模態圖像生成
Seedream 4.0: Toward Next-generation Multimodal Image Generation
September 24, 2025
作者: Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
cs.AI
摘要
我們推出Seedream 4.0,這是一個高效能的多模態圖像生成系統,它將文本到圖像(T2I)合成、圖像編輯以及多圖像組合統一在單一框架內。我們開發了一種高效的擴散變換器,配備強大的變分自編碼器(VAE),這也大幅減少了圖像標記的數量。這使得我們的模型能夠高效訓練,並快速生成原生高分辨率圖像(例如1K-4K)。Seedream 4.0預訓練於涵蓋多樣分類學和知識中心概念的數十億文本-圖像對。跨數百個垂直場景的全面數據收集,加上優化策略,確保了穩定且大規模的訓練,具有強大的泛化能力。通過整合精心微調的視覺語言模型(VLM),我們進行了多模態後訓練,以聯合訓練T2I和圖像編輯任務。為了加速推理,我們集成了對抗性蒸餾、分佈匹配和量化,以及推測解碼技術。在生成2K圖像時(不使用LLM/VLM作為PE模型),其推理時間可達1.8秒。全面評估顯示,Seedream 4.0在T2I和多模態圖像編輯上均能達到頂尖水平。特別是在複雜任務中展現出卓越的多模態能力,包括精確圖像編輯和上下文推理,並支持多圖像參考,能夠生成多個輸出圖像。這將傳統的T2I系統擴展為更具互動性和多維度的創意工具,推動生成式AI在創意和專業應用領域的邊界。Seedream 4.0現已於https://www.volcengine.com/experience/ark?launch=seedream開放訪問。
English
We introduce Seedream 4.0, an efficient and high-performance multimodal image
generation system that unifies text-to-image (T2I) synthesis, image editing,
and multi-image composition within a single framework. We develop a highly
efficient diffusion transformer with a powerful VAE which also can reduce the
number of image tokens considerably. This allows for efficient training of our
model, and enables it to fast generate native high-resolution images (e.g.,
1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning
diverse taxonomies and knowledge-centric concepts. Comprehensive data
collection across hundreds of vertical scenarios, coupled with optimized
strategies, ensures stable and large-scale training, with strong
generalization. By incorporating a carefully fine-tuned VLM model, we perform
multi-modal post-training for training both T2I and image editing tasks
jointly. For inference acceleration, we integrate adversarial distillation,
distribution matching, and quantization, as well as speculative decoding. It
achieves an inference time of up to 1.8 seconds for generating a 2K image
(without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream
4.0 can achieve state-of-the-art results on both T2I and multimodal image
editing. In particular, it demonstrates exceptional multimodal capabilities in
complex tasks, including precise image editing and in-context reasoning, and
also allows for multi-image reference, and can generate multiple output images.
This extends traditional T2I systems into an more interactive and
multidimensional creative tool, pushing the boundary of generative AI for both
creativity and professional applications. Seedream 4.0 is now accessible on
https://www.volcengine.com/experience/ark?launch=seedream.