ChatPaper.aiChatPaper

何处观察:基础模型能否通过主动探索到达目标视角?

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

May 31, 2026
作者: Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen
cs.AI

摘要

人类能够通过主动的头部和身体运动复现目标图像所指定的视角,但基础模型中的空间智能长期以来主要被研究为对预采集观测数据的被动理解。我们提出目标视角复现(Target Viewpoint Reproduction, TVR)——一种主动任务,要求智能体在3D环境中调整视角直至其观测与给定目标图像匹配——并构建了TVRBench,一个涵盖场景尺度与目标视角视觉丰富度的室内仿真基准。TVR远未得到解决:在评估集上,最强的开源和闭源模型仅达到7.8%和12.0%的成功率。细粒度分析揭示两个一致的瓶颈:现成模型难以处理多轮视觉历史,且当视角复现需要身体平移而非原地旋转时性能急剧下降,这暴露了空间差异映射到具身运动之间的鸿沟。为缩小这一差距,我们构建了统一的TVR后训练框架,涵盖专家轨迹SFT、理由监督的CoT-SFT、离线单轮GRPO以及基于实时仿真器交互的策略内多轮GRPO。视觉-动作SFT提供了主要增益,将9B开源模型提升至50.8%的成功率;多轮GRPO提供了针对性的多房间细化能力,整体达到51.4%,而CoT监督和单轮GRPO反而降低了闭环性能。这些结果使TVRBench成为衡量和训练具备主动感知与行动能力的3D环境基础模型的测试平台。我们的代码、数据和模型已开源至https://github.com/aim-uofa/TVRBench。
English
Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.