ChatPaper.aiChatPaper

全景可供性预测

Panoramic Affordance Prediction

March 16, 2026
作者: Zixin Zhang, Chenfei Liao, Hongfei Zhang, Harold Haodong Chen, Kanghao Chen, Zichen Wen, Litao Guo, Bin Ren, Xu Zheng, Yinchuan Li, Xuming Hu, Nicu Sebe, Ying-Cong Chen
cs.AI

摘要

在具身人工智能领域,可供性预测是连接感知与行动的关键桥梁。然而现有研究局限于针孔相机模型,其存在视场角狭窄和观测碎片化的问题,常常遗漏关键的整体环境上下文。本文首次探索全景可供性预测,利用360度图像捕捉全局空间关系与整体场景理解。为推进这一新颖任务,我们首先提出PAP-12K大规模基准数据集,包含逾千张超高分辨率(12k,11904×5952)全景图像,并标注超过1.2万个精心设计的问答对及可供性掩码。进一步地,受人类中央凹视觉系统启发,我们提出PAP训练无关的由粗到精处理流程,以应对全景图像固有的超高分辨率和严重畸变。该框架通过网格提示递归执行视觉路由来逐步定位目标,采用自适应注视机制校正局部几何畸变,并利用级联接地管道提取精确的实例级掩码。在PAP-12K上的实验表明,针对标准透视图像设计的现有可供性预测方法因全景视觉的特殊挑战而出现严重性能退化甚至失效。相比之下,PAP框架有效克服了这些障碍,显著超越现有先进基线,彰显了全景感知对构建鲁棒具身智能的巨大潜力。
English
Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
PDF92March 18, 2026