推理而後再推理：跨視角重訪提升空間推理

摘要

從自我中心影片中進行空間推理本質上具有挑戰性，因為可觀察的證據受到攝影機軌跡的限制。現有方法依賴於單輪推理，迫使模型透過語義先驗而非可驗證的證據來解決幾何模糊性。我們主張空間推理應當是可反覆檢視的：在有限證據下形成的結論，應在獲得互補視角時保持可修正的空間。基於此觀點，我們提出「先推理，再推理」（Reason, then Re-reason, ReRe）方法，這是一個無需訓練、在推論階段運作的框架，包含兩個階段：在推理階段，多模態大語言模型（MLLM）從原始影片形成空間假設；在再推理階段，該模型透過觀察合成的全新視角影片來驗證或修正該假設。為了實現有效的跨視角重新檢視，我們設計了一條「幾何到影片」的流程，從預測的3D幾何中渲染出策略上互補的全新視角畫面。這些畫面採用抬升的傾斜視角並涵蓋場景範圍，同時保留MLLM原生的影片介面，無需修改架構。在VSI-Bench和STI-Bench上的廣泛評估顯示，ReRe大幅提升開源MLLM的表現，使其能與專利的頂尖模型相匹敵。專案頁面：https://zhenjiemao.github.io/ReRe/

English

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/