ChatPaper.aiChatPaper

以笔画为基,非像素为本:通过交错推理实现过程驱动的图像生成

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

April 8, 2026
作者: Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He
cs.AI

摘要

人类以渐进方式绘制图像:先规划整体布局,再勾勒粗略草图,继而审视并细化细节,最关键的是每个步骤都基于不断演变的视觉状态。然而,在文本-图像交错数据集上训练的统一多模态模型,是否也能构想出中间状态的链条?本文提出过程驱动的图像生成方法——一种多步骤范式,将合成过程分解为思维与行动交错的推理轨迹。我们的方法并非单步生成图像,而是通过多轮迭代展开,每轮包含四个阶段:文本规划、视觉草拟、文本反思与视觉精修。文本推理显式地规定了视觉状态的演变方向,而生成的视觉中间结果又反过来约束并锚定下一轮文本推理。过程驱动生成的核心挑战在于中间状态的模糊性:模型如何评估每幅部分完成的图像?我们通过密集的逐步监督来解决这一问题,该监督保持两个互补约束:对于视觉中间状态,我们强化空间与语义一致性;对于文本中间状态,我们在保留先前视觉知识的同时,使模型能够识别并修正违反提示的元素。这使得生成过程具有显式性、可解释性和直接可监督性。为验证所提方法,我们在多种文本到图像生成基准测试中开展了实验。
English
Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.
PDF452April 10, 2026