AFUN：面向功能理解的可供性基础模型

摘要

功能可供性理解连接了视觉感知与物理动作，为机器人操作在开放、非结构化的真实世界环境中提供了可解释的接口。然而，构建一个不仅能理解交互发生的位置与方式，还能在多样化环境、物体和任务中泛化的功能可供性基础模型，仍是一个长期的研究挑战。现有方法通常仅解决部分挑战——要么定位任务相关区域但未指定可执行运动，要么预测运动但可扩展性有限。本文提出我们的模型，旨在迈向功能理解的功能可供性基础模型。该模型通过单张RGB-D观测和语言任务描述，预测任务条件功能掩膜（交互位置）和三维接触后运动曲线（交互方式）。为支持开放世界泛化，我们构建了一个大规模标准化数据管道，将异构机器人、人类、仿真及真实世界扫描数据转换为共享的功能可供性架构，包含语言、掩膜和以物体为中心的三维运动标签。我们从三个方面评估模型：在功能可供性分割方面，模型在来自4个基准的8个测试集上大幅优于所有基线，平均gIoU/cIoU提升+23.9/+26.3；在接触点预测方面，模型预测的点精度显著提高，相比最佳基线命中率提升12.7%~61.3%；在三维运动预测方面，模型在全部三个测试集上达到最优性能。该模型可直接部署于真实世界机器人操作任务，无需对机器人本体进行微调或使用任务特定启发式方法，展现出适应开放世界功能可供性任务的能力。项目页面：https://www.zhaoningwang.com/AFUN

English

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN