推理，再推理：跨视角重访提升空间推理

摘要

从自我中心视频进行空间推理本质上具有挑战性，因为可观察到的证据受限于相机轨迹。现有方法依赖单轮推理，迫使模型通过语义先验而非可验证的证据来解决几何模糊性。我们认为空间推理应是可回溯的：在有限证据下形成的结论，当获得互补视角时，应当保持修正的可能性。基于这一见解，我们提出“推理，再推理”（ReRe）——一种免训练、推理时的双阶段框架：在推理阶段，多模态大语言模型根据原始视频形成空间假设；在再推理阶段，模型通过观察合成的新视角视频来验证或修正该假设。为实现有效的跨视角回溯，我们设计了“几何到视频”流水线，从预测的3D几何中渲染策略性互补的新视角。这些视角具有抬高的斜视视角，覆盖场景全景，同时保留多模态大语言模型的原生视频接口，无需架构修改。在VSI-Bench和STI-Bench上的广泛评估表明，ReRe显著提升了开源多模态大语言模型的性能，使其与专有最优性能相匹敌。项目页面：https://zhenjiemao.github.io/ReRe/

English

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/