ChatPaper.aiChatPaper

思辨交织:在视觉生成中贯穿文本推理

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

November 20, 2025
作者: Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng
cs.AI

摘要

视觉生成领域的最新进展日益探索推理能力的整合。现有方法虽在生成前(作为预规划)或生成后(作为后优化)引入了文本推理,但缺乏生成过程中实时的多模态交互。在本初步研究中,我们提出了"边生成边思考"(TwiG)框架——首个实现文本推理与视觉生成全过程协同演进的交错式架构。该框架通过在视觉内容渐进生成时交错进行文本推理,既能指导后续局部区域的生成,又能对已合成内容进行反思。这种动态交互产生了更具上下文感知能力且语义丰富的视觉输出。为挖掘该框架潜力,我们探索了三种策略:基于我们构建的TwiG-50K数据集进行零样本提示、监督微调,以及通过定制化TwiG-GRPO策略实施强化学习,每种策略都为交错式推理的动态机制提供了独特视角。我们期望这项工作能推动文本推理交错技术赋能视觉生成的相关研究。代码将发布于:https://github.com/ZiyuGuo99/Thinking-while-Generating。
English
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
PDF152December 1, 2025