ChatPaper.aiChatPaper

CoV:面向空间推理的视域链提示法

CoV: Chain-of-View Prompting for Spatial Reasoning

January 8, 2026
作者: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
cs.AI

摘要

三维环境中的具身问答任务常需收集分散于多视角且部分遮挡的上下文信息。然而当前主流视觉语言模型受限于固定且有限的输入视角,这限制了其在推理时获取问题相关上下文的能力,并阻碍了复杂空间推理。我们提出链式视角提示框架——一种无需训练、在测试时通过粗到精的探索过程将视觉语言模型转化为主动视角推理器的解决方案。该框架首先通过视角选择代理筛选冗余帧并定位与问题对齐的锚点视角,随后通过离散相机动作与迭代推理交替进行细粒度视角调整,从底层三维场景表征中持续获取新观测,直至收集足够上下文或达到步数限制。 我们在OpenEQA基准上对四种主流视觉语言模型进行评估,链式视角提示框架在LLM-Match指标上实现平均11.56%的提升,其中Qwen3-VL-Flash模型最高提升达13.62%。该框架还展现测试时扩展性:增加最小动作预算可带来额外2.51%的平均改进,Gemini-2.5-Flash模型峰值提升达3.73%。在ScanQA和SQA3D数据集上,该框架同样表现优异(ScanQA达到116 CIDEr/31.9 EM@1,SQA3D达到51.1 EM@1)。总体而言,这些结果表明:问题导向的视角选择与开放视角搜索相结合,是一种无需额外训练即可有效提升三维具身问答空间推理能力的模型无关策略。
English
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
PDF41January 10, 2026