ChatPaper.aiChatPaper

边生成边思考:在视觉生成中交织文本推理

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

November 20, 2025
作者: Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng
cs.AI

摘要

视觉生成领域的最新进展正日益探索推理能力的整合。现有方法虽已引入文本推理(即在生成前作为预规划或生成后作为精炼环节),但缺乏生成过程中的实时多模态交互。在本初步研究中,我们提出"边生成边推理"(TwiG)框架——首个实现文本推理与视觉生成全程协同演进的交错式架构。该框架在视觉内容渐进生成过程中,通过交错进行文本推理来指导即将生成的局部区域,并对已合成内容进行反思。这种动态交互能产生更具上下文感知能力且语义丰富的视觉输出。为挖掘该框架潜力,我们探索了三种候选策略:零样本提示、基于自建TwiG-50K数据集的有监督微调,以及通过定制化TwiG-GRPO策略实现的强化学习,每种策略都为交错式推理的动态机制提供独特见解。本研究有望推动文本推理交错技术在增强视觉生成方面的深入探索。代码将发布于:https://github.com/ZiyuGuo99/Thinking-while-Generating。
English
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
PDF152December 1, 2025