AFUN：邁向功能理解的可供性基礎模型

摘要

功能可供性理解橋接了視覺感知與物理行動，為機器人在開放且非結構化的真實世界中進行操作提供了一個可解釋的介面。然而，建立一個不僅理解互動應在何處及如何發生，還能跨多樣環境、物體與任務進行泛化的功能可供性基礎模型，仍是長期存在的研究挑戰。現有方法通常僅解決此挑戰的部分問題：要麼定位任務相關區域而未指定可執行的動作，要麼預測動作但擴展性有限。本文提出我們的模型，朝向功能理解的功能可供性基礎模型邁出一步。根據單一RGB-D觀測與語言任務描述，我們的模型能預測任務條件下的功能遮罩（在何處互動）與3D接觸後運動曲線（如何互動）。為支援開放世界泛化，我們建立了一個大規模標準化資料管道，將異質的機器人、人類、模擬及真實世界掃描資料轉換為共享的功能可供性架構，包含語言、遮罩及以物體為中心的3D運動標籤。我們從三個面向評估模型：在功能可供性分割方面，我們的模型在來自4個基準的8個測試集中大幅優於所有基線，平均gIoU/cIoU分別提升+23.9/+26.3；在接觸點預測方面，它預測出更精確的點，相較最佳基線命中率提升12.7%至61.3%；在3D運動方面，它在三個測試集上均達到最佳表現。我們的模型可直接部署於真實世界機器人操作，無需針對機器人本體進行微調或使用任務特定啟發式策略，展現出適應開放世界功能可供性任務的能力。專案頁面：https://www.zhaoningwang.com/AFUN

English

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN