ChatPaper.aiChatPaper

何时想象与想象多少:基于世界模型的自适应测试时缩放策略在视觉空间推理中的应用

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

February 9, 2026
作者: Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal
cs.AI

摘要

尽管多模态大语言模型(MLLMs)发展迅速,但在正确答案依赖于场景在未观察或替代视角下如何呈现时,视觉空间推理仍不可靠。近期研究通过引入世界模型进行视觉想象以增强推理能力,但关于想象何时真正必要、其有益程度如何以及何时会产生负面影响等问题仍缺乏深入理解。实践中,无差别的想象不仅会增加计算量,还可能因引入误导性证据而降低性能。本研究对测试时视觉想象作为可控资源在空间推理中的作用展开深入分析,探究静态视觉证据何时足够、想象何时能提升推理能力,以及过度或不必要的想象如何影响准确性与效率。为支持分析,我们提出AVIC——一种自适应测试时框架,其世界模型能显式推理当前视觉证据的充分性,进而选择性调用并调整视觉想象的规模。在空间推理基准(SAT、MMSI)和具身导航基准(R2R)上的实验表明:想象在关键场景、边缘场景或有害场景中作用差异显著,且选择性控制策略能以显著更少的世界模型调用和语言标记量,达到或超越固定想象策略的效果。总体而言,我们的研究揭示了测试时想象的分析与控制对实现高效可靠空间推理的重要性。
English
Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.
PDF72February 11, 2026