交錯推理以提升文本到圖像生成質量

摘要

近期，統一的多模態理解與生成模型在圖像生成能力上取得了顯著進步，但在指令遵循與細節保留方面，與如GPT-4o等將理解與生成緊密結合的系統相比，仍存在較大差距。受交錯推理最新進展的啟發，我們探討了此類推理是否能進一步提升文本到圖像（T2I）生成的效果。我們引入了交錯推理生成（IRG）框架，該框架在基於文本的思考與圖像合成之間交替進行：模型首先生成基於文本的思考以指導初始圖像的生成，隨後對結果進行反思，以精細化細節、視覺質量及美學表現，同時保持語義一致性。為有效訓練IRG，我們提出了交錯推理生成學習（IRGL），其目標聚焦於兩個子任務：(1) 強化初始的思考與生成階段，以確立核心內容與基礎質量；(2) 實現高質量的文本反思，並在後續圖像中忠實執行這些改進。我們構建了IRGL-300K數據集，該數據集被組織成六種分解的學習模式，共同涵蓋了基於文本的思考及完整的思考-圖像軌跡的學習。從一個原生支持交錯文本-圖像輸出的統一基礎模型出發，我們的兩階段訓練首先構建了堅實的思考與反思能力，隨後在完整的思考-圖像軌跡數據上高效微調IRG流程。大量實驗展示了其在多項指標上的領先性能，在GenEval、WISE、TIIF、GenAI-Bench及OneIG-EN上實現了5至10個百分點的絕對提升，同時在視覺質量與細粒度保真度上取得了顯著改善。代碼、模型權重及數據集將發佈於：https://github.com/Osilly/Interleaving-Reasoning-Generation。

English

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

交錯推理以提升文本到圖像生成質量

Interleaving Reasoning for Better Text-to-Image Generation

摘要

Support