PanoWorld：360°パノラマ世界における空間スーパーセンシングに向けて

要旨

マルチモーダル大規模ラボモデル（MLLM）は、人間の知覚に近い視野の狭さを継承する支配的な透視画像パラダイムの下で、依然として空間理解に苦慮している。ナビゲーション、ロボット探索、3Dシーン理解において、360度パノラマセンシングは、周囲環境全体を一度に捕捉することで一種のスーパーセンシングを提供する。しかし、既存のMLLMパイプラインは通常、パノラマを複数の透視ビューに分解し、正距円筒図法（ERP）の球面構造を大部分暗黙的に扱っている。本論文では、MLLMが連続的で観測者中心の空間としてERPパノラマ上で推論することを必要とする、パノラマネイティブ理解を研究する。この目的のために、まず、意味的アンカリング、球面位置特定、参照フレーム変換、深度認識型3D空間推論を含む、パノラマネイブ理解に必要な主要能力を定義する。次に、混合ソースのERPパノラマを幾何学認識型、言語接地型、深度認識型の教師信号に変換する大規模メタデータ構築パイプラインを構築し、これらの信号を能力対応型の指示チューニングデータとして具体化する。モデル側では、球面空間クロスアテンションを備えたPanoWorldを導入し、球面幾何学を視覚ストリームに注入する。さらに、ERPネイティブな空間推論を評価するための診断ベンチマークであるPanoSpace-Benchを構築する。実験により、PanoWorldはPanoSpace-Bench、H* Bench、R2R-CE Val-Unseenベンチマークにおいて、プロプライエタリモデルとオープンソースモデルの両方を大幅に上回る性能を示す。これらの結果は、ロバストなパノラマ推論には専用のパノラマネイティブ教師信号と幾何学認識型モデル適応が必要であることを実証している。すべてのソースコードと提案データは公開される予定である。

English

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.