Seedream 2.0:原生中英雙語圖像生成基礎模型
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
March 10, 2025
作者: Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
cs.AI
摘要
擴散模型的快速發展極大地推動了圖像生成領域的顯著進步。然而,諸如Flux、SD3.5和Midjourney等主流模型,仍面臨模型偏差、文本渲染能力有限以及對中國文化細微之處理解不足等問題。為解決這些限制,我們推出了Seedream 2.0,這是一款原生中英雙語圖像生成基礎模型,在多個維度上表現卓越,能夠熟練處理中英文文本提示,支持雙語圖像生成與文本渲染。我們開發了一個強大的數據系統,促進知識整合,以及一個平衡圖像描述準確性與豐富性的標題系統。特別地,Seedream整合了自研的雙語大語言模型作為文本編碼器,使其能直接從海量數據中學習原生知識,從而生成高保真圖像,精確捕捉中英文描述的文化細微差異與美學表達。此外,應用Glyph-Aligned ByT5實現靈活的字符級文本渲染,而Scaled ROPE則能很好地泛化到未經訓練的分辨率。多階段後訓練優化,包括SFT和RLHF迭代,進一步提升了整體能力。通過大量實驗,我們證明Seedream 2.0在多個方面達到了最先進的性能,包括提示跟隨、美學、文本渲染和結構正確性。此外,Seedream 2.0經過多次RLHF迭代優化,使其輸出與人類偏好高度一致,這從其出色的ELO得分中可見一斑。此外,它還能輕鬆適應基於指令的圖像編輯模型,如SeedEdit,具備強大的編輯能力,在指令遵循與圖像一致性之間取得平衡。
English
Rapid advancement of diffusion models has catalyzed remarkable progress in
the field of image generation. However, prevalent models such as Flux, SD3.5
and Midjourney, still grapple with issues like model bias, limited text
rendering capabilities, and insufficient understanding of Chinese cultural
nuances. To address these limitations, we present Seedream 2.0, a native
Chinese-English bilingual image generation foundation model that excels across
diverse dimensions, which adeptly manages text prompt in both Chinese and
English, supporting bilingual image generation and text rendering. We develop a
powerful data system that facilitates knowledge integration, and a caption
system that balances the accuracy and richness for image description.
Particularly, Seedream is integrated with a self-developed bilingual large
language model as a text encoder, allowing it to learn native knowledge
directly from massive data. This enable it to generate high-fidelity images
with accurate cultural nuances and aesthetic expressions described in either
Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible
character-level text rendering, while a Scaled ROPE generalizes well to
untrained resolutions. Multi-phase post-training optimizations, including SFT
and RLHF iterations, further improve the overall capability. Through extensive
experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art
performance across multiple aspects, including prompt-following, aesthetics,
text rendering, and structural correctness. Furthermore, Seedream 2.0 has been
optimized through multiple RLHF iterations to closely align its output with
human preferences, as revealed by its outstanding ELO score. In addition, it
can be readily adapted to an instruction-based image editing model, such as
SeedEdit, with strong editing capability that balances instruction-following
and image consistency.Summary
AI-Generated Summary