ChatPaper.aiChatPaper

EndoCoT:擴散模型中內生性思維鏈推理的規模化應用

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

March 12, 2026
作者: Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang
cs.AI

摘要

近日,多模态大语言模型(MLLMs)被广泛集成至扩散框架中,主要作为文本编码器以解决空间推理等复杂任务。然而该范式存在两大局限:(i)MLLMs文本编码器的推理深度不足。单步编码无法激活思维链过程,而该过程对MLLMs为复杂任务提供精准指导至关重要;(ii)指导信号在解码过程中保持恒定。即使获得正确的MLLM编码,恒定的指导信号也会阻碍扩散变换器(DiT)将复杂指令逐步分解为可执行的去噪步骤。为此,我们提出内源思维链(EndoCoT)新框架:首先通过迭代思维指导模块细化潜在思维状态,激活MLLMs的推理潜能;其次采用终端思维锚定模块,通过将最终状态与真实答案对齐,确保推理轨迹始终受文本监督约束。借助这两个组件,MLLM文本编码器可提供精细推理的指导信号,使DiT能够逐步执行并最终以分步方式解决复杂任务。在多样化基准测试(如迷宫、旅行商问题、车辆路径问题、数独)中的大量评估显示,该框架平均准确率达92.1%,较最强基线提升8.3个百分点。
English
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
PDF92March 15, 2026