パノラマ的アフォーダンス予測

要旨

アフォーダンス予測は、具身化AIにおける知覚と行動の重要な橋渡しとして機能する。しかし、既存研究はピンホールカメラモデルに限定されており、視野角が狭く断片的な観測しか得られないため、環境全体の文脈を見落とすことが多い。本論文では、大域的な空間関係と包括的なシーン理解を捉えるために360度画像を活用する、パノラマアフォーダンス予測に関する初の探求を発表する。この新規課題を推進するため、まず1,000枚以上の超高解像度（12k, 11904×5952）パノラマ画像と、12,000以上の厳密に注釈付けされたQAペア及びアフォーダンスマスクを含む大規模ベンチマークデータセットPAP-12Kを構築した。さらに、人間の中心視覚システムにヒントを得た、学習不要の粗密連携パイプラインPAPを提案する。本手法はグリッドプロンプトを用いた再帰的視覚ルーティングで対象を段階的に特定し、適応的注視機構で局所的な幾何歪みを補正し、カスケード型グラウンディングパイプラインで精密なインスタンスレベルマスクを抽出する。PAP-12Kでの実験結果から、標準透視画像用に設計された既存手法はパノラマ視覚特有の課題により性能が大幅に劣化し失敗するのに対し、PAPフレームワークはこれらの障害を効果的に克服し、最先端ベースラインを大幅に上回り、ロバストな具身化知能におけるパノラマ知覚の巨大な可能性を実証した。

English

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.