推理而後再推理:跨視角重訪提升空間推理
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
June 10, 2026
作者: Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao
cs.AI
摘要
從自我中心影片中進行空間推理本質上具有挑戰性,因為可觀察的證據受到攝影機軌跡的限制。現有方法依賴於單輪推理,迫使模型透過語義先驗而非可驗證的證據來解決幾何模糊性。我們主張空間推理應當是可反覆檢視的:在有限證據下形成的結論,應在獲得互補視角時保持可修正的空間。基於此觀點,我們提出「先推理,再推理」(Reason, then Re-reason, ReRe)方法,這是一個無需訓練、在推論階段運作的框架,包含兩個階段:在推理階段,多模態大語言模型(MLLM)從原始影片形成空間假設;在再推理階段,該模型透過觀察合成的全新視角影片來驗證或修正該假設。為了實現有效的跨視角重新檢視,我們設計了一條「幾何到影片」的流程,從預測的3D幾何中渲染出策略上互補的全新視角畫面。這些畫面採用抬升的傾斜視角並涵蓋場景範圍,同時保留MLLM原生的影片介面,無需修改架構。在VSI-Bench和STI-Bench上的廣泛評估顯示,ReRe大幅提升開源MLLM的表現,使其能與專利的頂尖模型相匹敵。專案頁面:https://zhenjiemao.github.io/ReRe/
English
Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/