看向何處：基礎模型能否透過主動探索達到目標視角？

摘要

人類能透過主動的頭部與身體運動，重現目標影像指定的視角，然而基礎模型中的空間智能大多僅被研究為對預先收集觀測的被動理解。我們提出目標視角重現（Target Viewpoint Reproduction, TVR）——一項主動任務，要求智能體在三維環境中調整其視角，直至其觀測結果與給定的目標影像匹配——並建構TVRBench，一個涵蓋場景尺度與目標視覺豐富度的室內模擬基準。TVR遠未得到解決：在評測分割中，最強的開源與閉源模型僅分別達到7.8%與12.0%的成功率。細粒度分析發現兩個一致的瓶頸：現成模型難以處理多輪視覺歷史，且當視角重現需要身體平移而非原地旋轉時，性能急遽下降，暴露出將空間差異映射至具身運動的缺口。為研究縮小此缺口，我們建立統一的TVR後訓練框架，涵蓋專家軌跡監督微調（SFT）、理由監督的思維鏈監督微調（CoT-SFT）、離線單輪群體相對策略優化（GRPO），以及來自即時模擬器展開的在線多輪GRPO。視覺-動作SFT提供主要增益，將9B開源模型提升至50.8%成功率；多輪GRPO提供針對性的多房間精煉，整體達到51.4%，而CoT監督與單輪GRPO則降低閉環性能。這些結果使TVRBench成為衡量與訓練主動在三維環境中感知與行動的基礎模型的測試平台。我們的程式碼、資料與模型可於 https://github.com/aim-uofa/TVRBench 取得。

English

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.