어디를 볼 것인가: 파운데이션 모델이 능동적 탐색을 통해 목표 시점에 도달할 수 있는가?

초록

인간은 능동적인 머리 및 몸체 움직임을 통해 목표 이미지가 지정하는 시점을 재현할 수 있지만, 기초 모델의 공간 지능은 대부분 사전 수집된 관찰 데이터에 대한 수동적 이해로 연구되어 왔다. 우리는 Target Viewpoint Reproduction (TVR, 목표 시점 재현)을 소개한다. 이는 에이전트가 관찰이 주어진 목표 이미지와 일치할 때까지 3D 환경에서 시점을 조정하는 능동적 과제이다. 또한 장면 규모와 목표 시점의 시각적 풍부함을 아우르는 실내 시뮬레이션 벤치마크인 TVRBench도 함께 소개한다. TVR은 해결되기에는 아직 멀었다. 평가 분할에서 가장 강력한 오픈소스 및 폐쇄소스 모델은 각각 7.8%와 12.0%의 성공률에 그친다. 세밀한 분석을 통해 두 가지 일관된 병목 현상이 확인되었다. 기성 모델은 다중 턴 시각적 히스토리 처리에 어려움을 겪으며, 시점 재현이 제자리 회전이 아닌 몸체 이동을 필요로 할 때 성능이 급격히 하락하여 공간적 불일치를 체화된 움직임으로 매핑하는 데 있어 격차가 드러난다. 이러한 격차를 줄이기 위해, 우리는 전문가 궤적 SFT, 근거 지도 CoT-SFT, 오프라인 단일 턴 GRPO, 그리고 실제 시뮬레이터 롤아웃을 통한 온-정책 다중 턴 GRPO를 포함하는 통합 TVR 사후 학습 프레임워크를 구축한다. 시각-행동 SFT가 주된 성능 향상을 제공하여 9B 오픈소스 모델의 성공률을 50.8%까지 끌어올렸다. 다중 턴 GRPO는 목표 지향적 다중 방 정밀 조정을 제공하여 전체 51.4%를 달성한 반면, CoT 감독과 단일 턴 GRPO는 폐루프 성능을 저하시켰다. 이러한 결과는 TVRBench가 3D 환경에서 능동적으로 지각하고 행동하는 기초 모델을 측정하고 훈련하기 위한 테스트베드로서 자리매김하게 한다. 우리의 코드, 데이터 및 모델은 https://github.com/aim-uofa/TVRBench 에서 확인할 수 있다.

English

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.