ChatPaper.aiChatPaper

交错推理助力更优文本到图像生成

Interleaving Reasoning for Better Text-to-Image Generation

September 8, 2025
作者: Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
cs.AI

摘要

近期,统一的多模态理解与生成模型在图像生成能力上取得了显著进步,然而在指令遵循和细节保留方面,与如GPT-4o等将理解与生成紧密耦合的系统相比,仍存在较大差距。受交错推理最新进展的启发,我们探索了此类推理能否进一步提升文本到图像(T2I)生成的效果。我们提出了交错推理生成(Interleaving Reasoning Generation, IRG)框架,该框架在基于文本的思考与图像合成之间交替进行:模型首先生成基于文本的思考以指导初始图像的生成,随后对结果进行反思,以精炼细粒度细节、视觉质量和美学表现,同时保持语义一致性。为了有效训练IRG,我们提出了交错推理生成学习(Interleaving Reasoning Generation Learning, IRGL),其目标包括两个子任务:(1) 强化初始的“思考-生成”阶段,以确立核心内容与基础质量;(2) 实现高质量的文本反思,并在后续图像中忠实执行这些优化。我们构建了IRGL-300K数据集,该数据集被组织成六种分解的学习模式,共同覆盖了基于文本的思考学习以及完整的思考-图像轨迹学习。从一个原生支持交错文本-图像输出的统一基础模型出发,我们的两阶段训练首先构建了稳健的思考与反思能力,随后在完整的思考-图像轨迹数据上高效微调了IRG流程。大量实验展示了其达到的顶尖性能,在GenEval、WISE、TIIF、GenAI-Bench和OneIG-EN等基准上实现了5至10个百分点的绝对提升,同时在视觉质量和细粒度保真度方面也取得了显著改善。代码、模型权重及数据集将发布于:https://github.com/Osilly/Interleaving-Reasoning-Generation。
English
Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .
PDF132September 9, 2025