推理,再推理:跨视角重访提升空间推理
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
June 10, 2026
作者: Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao
cs.AI
摘要
从自我中心视频进行空间推理本质上具有挑战性,因为可观察到的证据受限于相机轨迹。现有方法依赖单轮推理,迫使模型通过语义先验而非可验证的证据来解决几何模糊性。我们认为空间推理应是可回溯的:在有限证据下形成的结论,当获得互补视角时,应当保持修正的可能性。基于这一见解,我们提出“推理,再推理”(ReRe)——一种免训练、推理时的双阶段框架:在推理阶段,多模态大语言模型根据原始视频形成空间假设;在再推理阶段,模型通过观察合成的新视角视频来验证或修正该假设。为实现有效的跨视角回溯,我们设计了“几何到视频”流水线,从预测的3D几何中渲染策略性互补的新视角。这些视角具有抬高的斜视视角,覆盖场景全景,同时保留多模态大语言模型的原生视频接口,无需架构修改。在VSI-Bench和STI-Bench上的广泛评估表明,ReRe显著提升了开源多模态大语言模型的性能,使其与专有最优性能相匹敌。项目页面:https://zhenjiemao.github.io/ReRe/
English
Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/