ChatPaper.aiChatPaper

思維內推理:潛在空間中的動態多模態交織

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

December 14, 2025
作者: Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang
cs.AI

摘要

近期多模態大型語言模型(MLLM)的突破性進展,通過在語義空間中引入思維鏈(CoT)推理機制,顯著提升了跨模態理解與推理能力。基於此,最新研究將CoT機制擴展至視覺模態,使模型能借助外部工具或顯式圖像生成在推理過程中整合視覺信息。然而現有方法仍存在三方面侷限:依賴顯式的逐步推理、感知-推理交互不穩定,以及顯著的計算開銷。受人類認知機制啟發,我們認為思維的展開並非線性過程,而是推理與感知在腦內動態交織的結果。基於此觀點,我們提出DMLR——一種測試時動態多模態潛在推理框架,採用置信度引導的潛在策略梯度優化方法來精煉潛在思維標記以實現深度推理。此外,我們引入動態視覺注入策略,在每個潛在思維標記處檢索最相關的視覺特徵並更新最佳視覺補丁集合,隨後將更新後的補丁注入潛在思維標記,實現動態的視覺-文本交織。在七個多模態推理基準測試及多種模型架構上的實驗表明,DMLR在保持高推理效率的同時,能顯著提升模型的推理與感知性能。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
PDF11December 20, 2025