どこを見るべきか：基盤モデルは能動的探索を通じて目標視点に到達できるか？

要旨

人間は、能動的な頭部および体の動きによって、目標画像で指定された視点を再現することができる。しかしながら、基盤モデルにおける空間知能は、主に事前収集された観測データを受動的に理解するものとして研究されてきた。本稿では、エージェントが3次元環境において自身の観測が与えられた目標画像と一致するまで視点を調整する能動的タスクである目標視点再現（Target Viewpoint Reproduction: TVR）と、TVRBench（シーンスケールと目標視点の視覚的多様性を網羅する屋内シミュレーションベンチマーク）を紹介する。TVRは未解決の課題である。評価分割において、最も強力なオープンソースモデルとクローズドソースモデルはそれぞれ7.8%と12.0%の成功率しか達成していない。詳細な分析により、一貫した二つのボトルネックが明らかになった。既存モデルは複数回の視覚履歴の扱いに難があり、また、視点再現がその場での回転ではなく身体の並進移動を必要とする場合に性能が急激に低下し、空間的差異を身体動作にマッピングする際のギャップが露呈した。このギャップを縮小するための研究として、専門家軌道による教師ありファインチューニング（SFT）、理由付け監督による思考連鎖SFT（CoT-SFT）、オフライン単一ターンGRPO（Group Relative Policy Optimization）、および実シミュレータロールアウトからのオン方策複数ターンGRPOをカバーする統一的なTVR後続学習フレームワークを構築した。視覚行動SFTが主な改善をもたらし、9Bのオープンソースモデルの成功率を50.8%に引き上げた。複数ターンGRPOはマルチルームの洗練を目的とした改善をもたらし、全体で51.4%の成功率を達成した。一方で、CoT監督と単一ターンGRPOはクローズドループ性能を低下させた。これらの結果により、TVRBenchは3次元環境において能動的に知覚し行動する基盤モデルを評価・訓練するためのテストベッドとして確立された。我々のコード、データ、モデルは https://github.com/aim-uofa/TVRBench で公開されている。

English

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.