파노라마 행동가능성 예측

초록

어포던스 예측은 구현형 AI에서 지각과 행동을 잇는 중요한 다리 역할을 합니다. 그러나 기존 연구는 좁은 시야각과 단편적인 관찰로 인해 종종 중요한 전체적 환경 맥락을 놓치는 핀홀 카메라 모델에 한정되어 있습니다. 본 논문에서는 전역적 공간 관계와 전체적 장면 이해를 포착하기 위해 360도 이미지를 활용하는 파노라믹 어포던스 예측을 최초로 탐구합니다. 이 새로운 과제를 지원하기 위해, 우리는 먼저 1,000개 이상의 초고해상도(12k, 11904 x 5952) 파노라마 이미지와 12,000개 이상의 정밀하게 주석 처리된 QA 쌍 및 어포던스 마스크를 포함하는 대규모 벤치마크 데이터셋인 PAP-12K를 소개합니다. 더 나아가, 파노라마 이미지의 고유한 초고해상도와 심한 왜곡 문제를 해결하기 위해 인간의 중심와 시각 시스템에서 영감을 받은 훈련이 필요 없는 coarse-to-fine 파이프라인인 PAP를 제안합니다. PAP는 그리드 프롬프팅을 통한 재귀적 시각 라우팅을 사용하여 대상을 점진적으로定位(locate)하고, 적응형 시선 메커니즘을 적용하여 지역적 기하학적 왜곡을 교정하며, 캐스케이딩 그라운딩 파이프라인을 활용하여 정밀한 인스턴스 수준 마스크를 추출합니다. PAP-12K에 대한 실험 결과, 표준 투시 이미지를 위해 설계된 기존 어포던스 예측 방법들은 파노라믹 시각의 고유한 과제로 인해 심각한 성능 저하를 겪고 실패하는 것으로 나타났습니다. 반면, PAP 프레임워크는 이러한 장애물을 효과적으로 극복하여 최첨단 기준선들을 크게 능가하며 강건한 구현형 인텔리전스를 위한 파노라믹 지각의 엄청난 잠재력을 부각시킵니다.

English

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.