Seedream 4.0：邁向新一代多模態圖像生成

摘要

我們推出Seedream 4.0，這是一個高效能的多模態圖像生成系統，它將文本到圖像（T2I）合成、圖像編輯以及多圖像組合統一在單一框架內。我們開發了一種高效的擴散變換器，配備強大的變分自編碼器（VAE），這也大幅減少了圖像標記的數量。這使得我們的模型能夠高效訓練，並快速生成原生高分辨率圖像（例如1K-4K）。Seedream 4.0預訓練於涵蓋多樣分類學和知識中心概念的數十億文本-圖像對。跨數百個垂直場景的全面數據收集，加上優化策略，確保了穩定且大規模的訓練，具有強大的泛化能力。通過整合精心微調的視覺語言模型（VLM），我們進行了多模態後訓練，以聯合訓練T2I和圖像編輯任務。為了加速推理，我們集成了對抗性蒸餾、分佈匹配和量化，以及推測解碼技術。在生成2K圖像時（不使用LLM/VLM作為PE模型），其推理時間可達1.8秒。全面評估顯示，Seedream 4.0在T2I和多模態圖像編輯上均能達到頂尖水平。特別是在複雜任務中展現出卓越的多模態能力，包括精確圖像編輯和上下文推理，並支持多圖像參考，能夠生成多個輸出圖像。這將傳統的T2I系統擴展為更具互動性和多維度的創意工具，推動生成式AI在創意和專業應用領域的邊界。Seedream 4.0現已於https://www.volcengine.com/experience/ark?launch=seedream開放訪問。

English

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

Seedream 4.0：邁向新一代多模態圖像生成

Seedream 4.0: Toward Next-generation Multimodal Image Generation

摘要

Support