推論、そして再推論：クロスビュー再訪が空間推論を向上させる

要旨

自己中心視点動画からの空間推論は、観測可能な証拠がカメラ軌道によって制約されるため、本質的に困難である。既存手法は単一ターン推論に依存しており、検証可能な証拠ではなく意味的前提を通じて幾何学的曖昧性を解決することをモデルに強いている。我々は、空間推論は再考可能であるべきだと主張する。つまり、限られた証拠の下で形成された結論は、補完的な視点が利用可能になった際に修正の余地を残すべきである。この洞察に基づき、我々はReason, then Re-reason (ReRe)を提案する。これは訓練不要の推論時フレームワークであり、2つのフェーズから成る。Reasonフェーズでは、MLLMが元の動画から空間仮説を形成する。Re-reasonフェーズでは、合成された新規視点動画を観察することでその仮説を検証または修正する。効果的なクロスビュー再考を可能にするために、予測された3D幾何学から戦略的に補完的な新規視点をレンダリングするGeometry-to-Videoパイプラインを設計する。これらの視点は、シーン全体をカバーする高所からの斜め視点を特徴とし、MLLMの本来の動画インターフェースをアーキテクチャの変更なしに保持する。VSI-BenchおよびSTI-Benchでの広範な評価により、ReReがオープンソースMLLMの性能を大幅に向上させ、プロプライエタリな最先端手法に匹敵することを示す。プロジェクトページ: https://zhenjiemao.github.io/ReRe/

English

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/