Seedream 4.0：次世代マルチモーダル画像生成に向けて

要旨

Seedream 4.0を紹介します。これは、テキストから画像への合成（T2I）、画像編集、複数画像の合成を単一のフレームワークに統合した、効率的で高性能なマルチモーダル画像生成システムです。我々は、強力なVAEを備えた高度に効率的な拡散トランスフォーマーを開発し、画像トークンの数を大幅に削減することに成功しました。これにより、モデルの効率的なトレーニングが可能となり、ネイティブの高解像度画像（例：1K-4K）を高速に生成することができます。Seedream 4.0は、多様な分類体系や知識中心の概念にわたる数十億のテキスト-画像ペアで事前学習されています。数百の垂直シナリオにわたる包括的なデータ収集と最適化された戦略により、安定した大規模なトレーニングと強力な汎化能力を実現しています。慎重に微調整されたVLMモデルを組み込むことで、T2Iと画像編集タスクを共同でトレーニングするためのマルチモーダル事後学習を行います。推論の高速化のために、敵対的蒸留、分布マッチング、量子化、および投機的デコーディングを統合しています。これにより、2K画像の生成に最大1.8秒の推論時間を達成します（LLM/VLMをPEモデルとして使用しない場合）。包括的な評価により、Seedream 4.0がT2Iとマルチモーダル画像編集の両方で最先端の結果を達成できることが明らかになりました。特に、精密な画像編集やコンテキスト内推論などの複雑なタスクにおいて卓越したマルチモーダル能力を示し、複数画像の参照を可能にし、複数の出力画像を生成することができます。これにより、従来のT2Iシステムをよりインタラクティブで多次元的なクリエイティブツールに拡張し、生成AIの境界をクリエイティビティと専門的なアプリケーションの両方において押し広げます。Seedream 4.0は現在、https://www.volcengine.com/experience/ark?launch=seedream でアクセス可能です。

English

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.