空间进化：通过确定性几何环境实现自我演进的空间智能

摘要

三维场景的空间推理是具身智能的核心能力，然而几何标注的高成本持续制约着模型的持续改进。自演进范式虽前景可期，但其依赖模型共识构建伪标签的做法会导致训练过程强化而非修正模型自身的几何误差。我们发现三维空间推理独有的特性可突破此局限：真实标注是底层几何的确定性结果，可直接通过点云和相机位姿精确计算而无需模型参与。基于此洞见，我们提出面向三维空间推理的自演进框架SpatialEvo，其核心为确定性几何环境（DGE）。DGE通过显式几何验证规则将16类空间推理任务形式化，将未标注三维场景转化为零噪声的交互式验证器，以客观物理反馈替代模型共识。单一共享参数策略在DGE约束下协同演进问答双角色：提问者基于场景观测生成符合物理规律的空间问题，求解者则依据DGE验证的真实标注推导精确答案。任务自适应调度器内生地将训练聚焦于模型最薄弱环节，无需人工设计即可形成动态课程。在九个基准测试上的实验表明，SpatialEvo在3B和7B规模下均取得最高平均分，在空间推理任务上持续提升，且未损害通用视觉理解能力。

English

Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model's own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model's weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.