AFUN: 機能理解のためのアフォーダンス基盤モデルを目指して

要旨

アフォーダンス理解は視覚認識と物理的行動を橋渡しし、開放的な非構造化現実環境におけるロボット操作の説明可能なインタフェースとして機能する。しかし、相互作用が行われるべき場所と方法を理解するだけでなく、多様な環境、物体、タスクに一般化できるアフォーダンス基盤モデルの構築は、長年にわたる研究課題である。既存の手法は通常、この課題の一部のみを扱っており、実行可能な動作を指定せずにタスク関連領域を特定するか、動作を予測するがスケーラビリティに制限がある。本論文では、機能理解のためのアフォーダンス基盤モデルへの一歩として、ourmodelを提示する。単一のRGB-D観測と言語タスク記述から、ourmodelはタスク条件付き機能マスク（どこで相互作用するか）と3D接触後動作曲線（どのように相互作用するか）を予測する。オープンワールド一般化を支援するために、異種のロボット、人間、シミュレーション、実世界スキャンデータを言語、マスク、物体中心の3D動作ラベルとともに共有アフォーダンススキーマに変換する大規模標準化データパイプラインを構築する。我々はourmodelを3つの側面から評価する：アフォーダンスセグメンテーションにおいて、ourmodelは4つのベンチマークからの8テストセット全体で全ベースラインを大幅に上回り、平均gIoU/cIoUを+23.9/+26.3改善する；接触点予測において、最良ベースラインに対して12.7%から61.3%のヒット率向上で、大幅に正確な点を予測する；3D動作において、3つのテストセットすべてで最良の性能を達成する。ourmodelは、ロボットの身体性への微調整やタスク固有のヒューリスティックスを使用せずに実世界のロボット操作に展開でき、オープンワールドアフォーダンスタスクに適応する能力を示す。プロジェクトページ: https://www.zhaoningwang.com/AFUN

English

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN