テキストから画像生成の精度向上のためのインタリーブ推論

要旨

最近、統一されたマルチモーダル理解と生成モデルは、画像生成能力において大幅な進歩を遂げています。しかし、GPT-4oのように理解と生成を密接に連携させたシステムと比較すると、指示の追従や詳細の保持において大きな隔たりが残っています。最近のインタリーブ推論の進展に触発され、私たちはそのような推論がテキストから画像（T2I）生成をさらに改善できるかどうかを探求します。本論文では、テキストベースの思考と画像合成を交互に行う「インタリーブ推論生成（IRG）」というフレームワークを紹介します。このモデルは、まずテキストベースの思考を行って初期画像を導き出し、その結果を反映して細部の詳細、視覚的品質、美学を洗練させながら意味を保持します。IRGを効果的に訓練するために、私たちは「インタリーブ推論生成学習（IRGL）」を提案します。これは2つのサブゴールを目指します：（1）初期の思考と生成段階を強化してコアコンテンツと基本品質を確立すること、（2）高品質なテキスト反映とその洗練を後続の画像に忠実に実装することです。私たちはIRGL-300Kというデータセットをキュレーションし、テキストベースの思考と完全な思考-画像軌跡をカバーする6つの分解された学習モードに整理しました。インタリーブされたテキスト-画像出力を自然に発する統一された基盤モデルから始め、2段階のトレーニングを行います。最初に堅牢な思考と反映を構築し、その後、完全な思考-画像軌跡データでIRGパイプラインを効率的にチューニングします。広範な実験により、GenEval、WISE、TIIF、GenAI-Bench、OneIG-ENにおいて5～10ポイントの絶対的な向上を示し、視覚的品質と細部の忠実度においても大幅な改善が見られました。コード、モデルウェイト、データセットは以下のURLで公開されます：https://github.com/Osilly/Interleaving-Reasoning-Generation。

English

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

テキストから画像生成の精度向上のためのインタリーブ推論

Interleaving Reasoning for Better Text-to-Image Generation

要旨

Support