Seedream 4.0:迈向新一代多模态图像生成
Seedream 4.0: Toward Next-generation Multimodal Image Generation
September 24, 2025
作者: Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
cs.AI
摘要
我们推出Seedream 4.0,这是一个高效且高性能的多模态图像生成系统,它将文本到图像(T2I)合成、图像编辑以及多图像组合统一在一个框架内。我们开发了一种高效的扩散变换器,配备强大的变分自编码器(VAE),能够显著减少图像标记的数量。这使得我们的模型能够高效训练,并快速生成原生高分辨率图像(例如1K-4K)。Seedream 4.0在涵盖多种分类和以知识为中心概念的数十亿文本-图像对上进行了预训练。通过跨数百个垂直场景的全面数据收集,结合优化策略,确保了稳定且大规模的训练,并具备强大的泛化能力。通过整合精心微调的视觉语言模型(VLM),我们进行了多模态后训练,以联合训练T2I和图像编辑任务。为了加速推理,我们集成了对抗性蒸馏、分布匹配、量化以及推测性解码技术。在生成2K图像时(未使用LLM/VLM作为PE模型),推理时间最快可达1.8秒。全面评估显示,Seedream 4.0在T2I和多模态图像编辑上均能达到最先进的成果。特别是在复杂任务中,如精确图像编辑和上下文推理,它展现了卓越的多模态能力,并支持多图像参考,能够生成多个输出图像。这使传统的T2I系统扩展为更具互动性和多维度的创作工具,推动了生成式AI在创意和专业应用领域的边界。Seedream 4.0现已上线,访问https://www.volcengine.com/experience/ark?launch=seedream即可体验。
English
We introduce Seedream 4.0, an efficient and high-performance multimodal image
generation system that unifies text-to-image (T2I) synthesis, image editing,
and multi-image composition within a single framework. We develop a highly
efficient diffusion transformer with a powerful VAE which also can reduce the
number of image tokens considerably. This allows for efficient training of our
model, and enables it to fast generate native high-resolution images (e.g.,
1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning
diverse taxonomies and knowledge-centric concepts. Comprehensive data
collection across hundreds of vertical scenarios, coupled with optimized
strategies, ensures stable and large-scale training, with strong
generalization. By incorporating a carefully fine-tuned VLM model, we perform
multi-modal post-training for training both T2I and image editing tasks
jointly. For inference acceleration, we integrate adversarial distillation,
distribution matching, and quantization, as well as speculative decoding. It
achieves an inference time of up to 1.8 seconds for generating a 2K image
(without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream
4.0 can achieve state-of-the-art results on both T2I and multimodal image
editing. In particular, it demonstrates exceptional multimodal capabilities in
complex tasks, including precise image editing and in-context reasoning, and
also allows for multi-image reference, and can generate multiple output images.
This extends traditional T2I systems into an more interactive and
multidimensional creative tool, pushing the boundary of generative AI for both
creativity and professional applications. Seedream 4.0 is now accessible on
https://www.volcengine.com/experience/ark?launch=seedream.