ChatPaper.aiChatPaper

心智内推理:潜在空间中的动态多模态交错处理

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

December 14, 2025
作者: Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang
cs.AI

摘要

近期,多模态大语言模型(MLLMs)通过在语义空间中引入思维链(CoT)推理机制,显著提升了跨模态理解与推理能力。基于此,最新研究将CoT机制扩展至视觉模态,使模型能够借助外部工具或显式图像生成在推理过程中整合视觉信息。然而,这些方法仍依赖于显式的分步推理,存在感知-推理交互不稳定及显著计算开销的问题。受人类认知机制启发,我们认为思维并非线性展开,而是通过推理与感知在脑海中的动态交织推进。基于这一视角,我们提出DMLR——一种测试时动态多模态潜在推理框架,该框架采用置信度引导的潜在策略梯度优化方法,对潜在思维标记进行精细化深度推理。此外,我们引入动态视觉注入策略,在每一潜在思维标记处检索最相关的视觉特征并更新最佳视觉补丁集合,进而将更新后的补丁注入潜在思维标记,实现动态的视觉-文本交织。在七大跨模态推理基准及多种模型架构上的实验表明,DMLR在保持高推理效率的同时,显著提升了模型的推理与感知性能。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
PDF11December 20, 2025