ChatPaper.aiChatPaper

潛移默化:將隱性推理融入圖像生成的藝術

Show, Don't Tell: Morphing Latent Reasoning into Image Generation

February 2, 2026
作者: Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu, Hongfei Zhang, Zixin Zhang, Chenfei Liao, Litao Guo, Qifeng Chen, Ying-Cong Chen
cs.AI

摘要

文本到图像生成技术虽已取得显著进展,但现有方法普遍缺乏人类创造力特有的动态推理与优化能力。当前基于推理增强的范式大多依赖显性思维过程,需在固定步骤将中间推理解码为离散文本,并频繁进行图像编解码,导致效率低下、信息丢失及认知错配。为弥补这一缺陷,我们提出LatentMorph——一种将隐式潜在推理无缝集成到T2I生成过程的新框架。其核心包含四个轻量级组件:(一)用于将中间生成状态压缩为紧凑视觉记忆的冷凝器;(二)将潜在思维转化为可操作指引的转换器;(三)动态引导后续图像令牌预测的塑形器;(四)通过强化学习训练、自适应决定推理时机的调用器。通过在连续潜在空间中完成推理,LatentMorph规避了显式推理的瓶颈,实现了更高效的自适应优化。大量实验表明:LatentMorph(I)在GenEval和T2I-CompBench基准上分别将基础模型Janus-Pro性能提升16%和25%;(II)在WISE和IPV-Txt等抽象推理任务中超越显式推理范式(如TwiG)15%和11%;(III)同时将推理时间减少44%,令牌消耗降低51%;(IV)在推理调用机制上展现出与人类直觉71%的认知一致性。
English
Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by 16% on GenEval and 25% on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by 15% and 11% on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by 44% and token consumption by 51%; and (IV) exhibits 71% cognitive alignment with human intuition on reasoning invocation.
PDF102February 7, 2026