想象之时机与程度：基于世界模型的自适应测试时缩放策略在视觉空间推理中的应用

摘要

尽管多模态大语言模型（MLLMs）发展迅猛，但在正确答案依赖于场景在未知或替代视角下如何呈现时，视觉空间推理仍然不可靠。近期研究通过引入世界模型进行视觉想象以增强推理能力，但关于想象何时真正必要、多少想象量有益以及何时会产生负面影响等问题仍缺乏深入理解。实践中，无差别的想象不仅会增加计算量，还可能因引入误导性证据而导致性能下降。本文提出一种将测试时视觉想象作为可控资源用于空间推理的深度分析。我们研究了静态视觉证据何时足够、想象何时能提升推理能力，以及过度或不必要的想象如何影响准确性与效率。为支持此分析，我们设计了AVIC——一种自适应测试时框架，其世界模型会先显式推理当前视觉证据的充分性，再选择性调用并缩放视觉想象。在空间推理基准（SAT、MMSI）和具身导航基准（R2R）上的实验表明：我们的结果清晰揭示了想象具有关键作用、边际效益或负面效应的具体场景，并证明选择性控制策略能以显著更少的世界模型调用和语言标记量，达到或超越固定想象策略的效果。总体而言，我们的研究结果凸显了在测试阶段分析和控制想象对于实现高效可靠空间推理的重要性。

English

Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

想象之时机与程度：基于世界模型的自适应测试时缩放策略在视觉空间推理中的应用

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

摘要

Support