ChatPaper.aiChatPaper

以筆觸為思,非像素為念:交錯推理驅動的流程化圖像生成

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

April 8, 2026
作者: Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He
cs.AI

摘要

人類以漸進方式繪製圖像:先規劃整體佈局,草繪粗略輪廓,審視並精修細節,最關鍵的是每個步驟都基於不斷演變的視覺狀態。然而,在文本-圖像交織數據集上訓練的統一多模態模型,是否也能設想中間狀態的鏈條?本文提出程序驅動式圖像生成——一種多步驟範式,將合成過程分解為思維與行動交織的推理軌跡。有別於單步生成圖像,我們的方法通過多輪迭代展開,每輪包含四個階段:文本規劃、視覺草繪、文本反思與視覺精修。文本推理明確制約視覺狀態的演進方向,而生成的視覺中間結果又反過來約束並錨定下一輪文本推理。程序驅動生成的核心挑戰在於中間狀態的模糊性:模型如何評估每幅部分完成的圖像?我們通過密集的逐步監督機制解決此問題,該機制維持兩項互補約束:對於視覺中間狀態,我們強化空間與語義一致性;對於文本中間狀態,我們在保留既有視覺知識的同時,使模型能識別並修正違反提示要求的元素。這使得生成過程具備顯性化、可解釋性與直接可監督性。為驗證所提方法,我們在多種文本到圖像生成基準測試中開展實驗。
English
Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.
PDF452April 10, 2026