PanoWorld: 360도 파노라마 세계에서의 공간 초감각을 향하여

초록

다중 모달 대규모 실험실 모델(MLLM)은 인간과 유사한 지각의 좁은 시야를 그대로 이어받은 지배적인 투시 이미지 패러다임 하에서 여전히 공간 이해에 어려움을 겪는다. 항법, 로봇 탐색 및 3D 장면 이해에 있어 360도 파노라마 센싱은 주변 환경 전체를 한 번에 포착함으로써 일종의 초월적 감지(supersensing)를 제공한다. 그러나 기존 MLLM 파이프라인은 일반적으로 파노라마를 여러 투시 뷰로 분해하여 등장각 투영법(ERP)의 구형 구조를 대부분 암시적으로만 남겨둔다. 본 논문에서는 파노라마 본연 이해(pano-native understanding)를 연구하며, 이는 MLLM이 ERP 파노라마를 연속적이고 관찰자 중심의 공간으로 추론하도록 요구한다. 이를 위해 먼저 파노라마 본연 이해에 필요한 핵심 능력, 즉 의미 고정(semantic anchoring), 구형 위치 파악(spherical localization), 기준 좌표계 변환(reference-frame transformation), 깊이 인식 3D 공간 추론(depth-aware 3D spatial reasoning)을 정의한다. 그런 다음, 혼합 소스의 ERP 파노라마를 기하학 인식, 언어 기반, 깊이 인식 감독 신호로 변환하는 대규모 메타데이터 구축 파이프라인을 구축하고, 이 신호들을 능력 정렬 명령 튜닝 데이터로 구체화한다. 모델 측면에서는 구형 공간 교차 주의(Spherical Spatial Cross-Attention)를 갖춘 PanoWorld를 도입하여 시각적 흐름에 구형 기하학을 주입한다. 또한 ERP 본연 공간 추론을 평가하기 위한 진단 벤치마크인 PanoSpace-Bench를 구축한다. 실험 결과, PanoWorld는 PanoSpace-Bench, H* Bench, R2R-CE Val-Unseen 벤치마크에서 독점 및 오픈소스 베이스라인을 모두 크게 능가한다. 이러한 결과는 강건한 파노라마 추론을 위해서는 전용 파노라마 본연 감독과 기하학 인식 모델 적응이 필요함을 입증한다. 모든 소스 코드와 제안된 데이터는 공개될 예정이다.

English

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.