PanoWorld：邁向360°全景世界中的空間超感知

摘要

多模态大型实验室模型（MLLMs）在主流透视图像范式下仍难以实现空间理解，该范式继承了类似人类感知的狭窄视野。对于导航、机器人搜索和3D场景理解，360度全景感测通过一次性捕获整个周围环境，提供了一种超感测形式。然而，现有的MLLM流程通常将全景图分解为多个透视视图，使得等距柱状投影（ERP）的球面结构在很大程度上处于隐含状态。在本文中，我们研究全景原生理解，这要求MLLM将ERP全景图作为一个连续的、以观察者为中心的空间进行推理。为此，我们首先定义了全景原生理解所需的关键能力，包括语义锚定、球面定位、参考系变换以及深度感知的三维空间推理。我们随后构建了一个大规模元数据构建管道，将混合来源的ERP全景图转换为几何感知、语言基础和深度感知的监督信号，并将这些信号实例化为与能力对齐的指令微调数据。在模型方面，我们引入了带有球面空间交叉注意力的PanoWorld，将球面几何注入视觉流中。我们进一步构建了PanoSpace-Bench，这是一个用于评估ERP原生空间推理的诊断性基准测试。实验表明，PanoWorld在PanoSpace-Bench、H* Bench和R2R-CE Val-Unseen基准测试上显著优于专有和开源基线。这些结果表明，稳健的全景推理需要专用的全景原生监督和几何感知的模型适配。所有源代码和提出的数据将公开发布。

English

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.