交錯推理以提升文本到圖像生成質量
Interleaving Reasoning for Better Text-to-Image Generation
September 8, 2025
作者: Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
cs.AI
摘要
近期,統一的多模態理解與生成模型在圖像生成能力上取得了顯著進步,但在指令遵循與細節保留方面,與如GPT-4o等將理解與生成緊密結合的系統相比,仍存在較大差距。受交錯推理最新進展的啟發,我們探討了此類推理是否能進一步提升文本到圖像(T2I)生成的效果。我們引入了交錯推理生成(IRG)框架,該框架在基於文本的思考與圖像合成之間交替進行:模型首先生成基於文本的思考以指導初始圖像的生成,隨後對結果進行反思,以精細化細節、視覺質量及美學表現,同時保持語義一致性。為有效訓練IRG,我們提出了交錯推理生成學習(IRGL),其目標聚焦於兩個子任務:(1) 強化初始的思考與生成階段,以確立核心內容與基礎質量;(2) 實現高質量的文本反思,並在後續圖像中忠實執行這些改進。我們構建了IRGL-300K數據集,該數據集被組織成六種分解的學習模式,共同涵蓋了基於文本的思考及完整的思考-圖像軌跡的學習。從一個原生支持交錯文本-圖像輸出的統一基礎模型出發,我們的兩階段訓練首先構建了堅實的思考與反思能力,隨後在完整的思考-圖像軌跡數據上高效微調IRG流程。大量實驗展示了其在多項指標上的領先性能,在GenEval、WISE、TIIF、GenAI-Bench及OneIG-EN上實現了5至10個百分點的絕對提升,同時在視覺質量與細粒度保真度上取得了顯著改善。代碼、模型權重及數據集將發佈於:https://github.com/Osilly/Interleaving-Reasoning-Generation。
English
Unified multimodal understanding and generation models recently have achieve
significant improvement in image generation capability, yet a large gap remains
in instruction following and detail preservation compared to systems that
tightly couple comprehension with generation such as GPT-4o. Motivated by
recent advances in interleaving reasoning, we explore whether such reasoning
can further improve Text-to-Image (T2I) generation. We introduce Interleaving
Reasoning Generation (IRG), a framework that alternates between text-based
thinking and image synthesis: the model first produces a text-based thinking to
guide an initial image, then reflects on the result to refine fine-grained
details, visual quality, and aesthetics while preserving semantics. To train
IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL),
which targets two sub-goals: (1) strengthening the initial think-and-generate
stage to establish core content and base quality, and (2) enabling high-quality
textual reflection and faithful implementation of those refinements in a
subsequent image. We curate IRGL-300K, a dataset organized into six decomposed
learning modes that jointly cover learning text-based thinking, and full
thinking-image trajectories. Starting from a unified foundation model that
natively emits interleaved text-image outputs, our two-stage training first
builds robust thinking and reflection, then efficiently tunes the IRG pipeline
in the full thinking-image trajectory data. Extensive experiments show SoTA
performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF,
GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality
and fine-grained fidelity. The code, model weights and datasets will be
released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .