ChatPaper.aiChatPaper

AFUN:面向功能理解的可供性基础模型

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

June 1, 2026
作者: Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao
cs.AI

摘要

功能可供性理解连接了视觉感知与物理动作,为机器人操作在开放、非结构化的真实世界环境中提供了可解释的接口。然而,构建一个不仅能理解交互发生的位置与方式,还能在多样化环境、物体和任务中泛化的功能可供性基础模型,仍是一个长期的研究挑战。现有方法通常仅解决部分挑战——要么定位任务相关区域但未指定可执行运动,要么预测运动但可扩展性有限。本文提出我们的模型,旨在迈向功能理解的功能可供性基础模型。该模型通过单张RGB-D观测和语言任务描述,预测任务条件功能掩膜(交互位置)和三维接触后运动曲线(交互方式)。为支持开放世界泛化,我们构建了一个大规模标准化数据管道,将异构机器人、人类、仿真及真实世界扫描数据转换为共享的功能可供性架构,包含语言、掩膜和以物体为中心的三维运动标签。我们从三个方面评估模型:在功能可供性分割方面,模型在来自4个基准的8个测试集上大幅优于所有基线,平均gIoU/cIoU提升+23.9/+26.3;在接触点预测方面,模型预测的点精度显著提高,相比最佳基线命中率提升12.7%~61.3%;在三维运动预测方面,模型在全部三个测试集上达到最优性能。该模型可直接部署于真实世界机器人操作任务,无需对机器人本体进行微调或使用任务特定启发式方法,展现出适应开放世界功能可供性任务的能力。项目页面:https://www.zhaoningwang.com/AFUN
English
Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN