自由形式インターリーブ型テキスト画像生成のための統一マルチモーダルモデルの解明

要旨

テキストと画像を生成可能な生成AIモデルの進歩は、特に両モダリティを交錯させるタスクにおいて、マルチモーダル知能の分野における重要な一歩を示す。この知能を次の段階に進めるためには、モデルが自由形式の交錯テキスト-画像シーケンスを自律的に生成することが極めて重要である。本稿では、ILLUME-Xを紹介する。これは、マルチモーダルデータ効率を改善し、マルチモーダル学習プロセスを安定化させることで、高品質で自由形式の交錯テキスト-画像生成を可能にする、高度な統合マルチモーダルパラダイムである。ILLUME-Xは以下の3つの主要コンポーネントから構成される：(i) 交錯テキスト-画像生成用に最適化された拡張トレーニングデータパイプライン、(ii) 自由長マルチモーダルトークンシーケンスに対する自己適応型目的関数を用いた段階的トレーニング戦略、(iii) 交錯テキスト-画像シーケンスに対する客観的かつ包括的な評価手法ILScore。特筆すべき点として、我々のILLUME-Xは、スタイル変換、画像分解、ストーリーテリングなど、複数の交錯テキスト-画像生成タスクにおいて、従来の統合モデルを上回る性能を示す。

English

The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of multimodal intelligence, particularly for tasks involving the interleaving of both modalities. To advance this intelligence to the next stage, it is crucial for models to autonomously generate free-form interleaved text-image sequences. In this paper, we introduce ILLUME-X, an advanced unified multimodal paradigm that enables high-quality, free-form interleaved text-image generation by improving multimodal data efficiency and stabilizing the multimodal training process. ILLUME-X comprises three key components: (i) an expanded training data pipeline optimized for interleaved text-image generation, (ii) a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and (iii) an objective and comprehensive evaluation method ILScore for interleaved text-image sequences. Notably, our ILLUME-X outperforms previous unified models across multiple interleaved text-image generation tasks like style transfer, image decomposition and storytelling.