交错推理助力更优文本到图像生成

摘要

近期，统一的多模态理解与生成模型在图像生成能力上取得了显著进步，然而在指令遵循和细节保留方面，与如GPT-4o等将理解与生成紧密耦合的系统相比，仍存在较大差距。受交错推理最新进展的启发，我们探索了此类推理能否进一步提升文本到图像（T2I）生成的效果。我们提出了交错推理生成（Interleaving Reasoning Generation, IRG）框架，该框架在基于文本的思考与图像合成之间交替进行：模型首先生成基于文本的思考以指导初始图像的生成，随后对结果进行反思，以精炼细粒度细节、视觉质量和美学表现，同时保持语义一致性。为了有效训练IRG，我们提出了交错推理生成学习（Interleaving Reasoning Generation Learning, IRGL），其目标包括两个子任务：(1) 强化初始的“思考-生成”阶段，以确立核心内容与基础质量；(2) 实现高质量的文本反思，并在后续图像中忠实执行这些优化。我们构建了IRGL-300K数据集，该数据集被组织成六种分解的学习模式，共同覆盖了基于文本的思考学习以及完整的思考-图像轨迹学习。从一个原生支持交错文本-图像输出的统一基础模型出发，我们的两阶段训练首先构建了稳健的思考与反思能力，随后在完整的思考-图像轨迹数据上高效微调了IRG流程。大量实验展示了其达到的顶尖性能，在GenEval、WISE、TIIF、GenAI-Bench和OneIG-EN等基准上实现了5至10个百分点的绝对提升，同时在视觉质量和细粒度保真度方面也取得了显著改善。代码、模型权重及数据集将发布于：https://github.com/Osilly/Interleaving-Reasoning-Generation。

English

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

交错推理助力更优文本到图像生成

Interleaving Reasoning for Better Text-to-Image Generation

摘要

Support