AFUN: 기능 이해를 위한 어포던스 파운데이션 모델을 향하여

초록

어포던스 이해는 시각적 지각과 물리적 행동을 연결하여 개방적이고 비구조화된 실제 환경에서 로봇 조작을 위한 설명 가능한 인터페이스 역할을 한다. 그러나 상호작용이 어디서, 어떻게 이루어져야 하는지를 이해할 뿐만 아니라 다양한 환경, 객체, 작업에 걸쳐 일반화할 수 있는 어포던스 기초 모델을 구축하는 것은 여전히 오랜 연구 과제로 남아 있다. 기존 방법들은 일반적으로 이 과제의 일부만 다루는데, 실행 가능한 동작을 명시하지 않고 작업 관련 영역을 국소화하거나, 확장성이 제한된 동작을 예측한다. 본 논문에서는 기능 이해를 위한 어포던스 기초 모델로 나아가는 한 걸음인 본 모델을 제시한다. 단일 RGB-D 관측과 언어 작업 설명으로부터 본 모델은 작업 조건부 기능 마스크(상호작용 위치)와 3D 접촉 후 동작 곡선(상호작용 방법)을 예측한다. 개방형 세계 일반화를 지원하기 위해 로봇, 인간, 시뮬레이션 및 실제 스캔 데이터로부터의 이질적 데이터를 언어, 마스크, 객체 중심 3D 동작 레이블이 포함된 공유 어포던스 스키마로 변환하는 대규모 표준화 데이터 파이프라인을 구축한다. 본 모델을 세 가지 측면에서 평가한다. 어포던스 분할의 경우, 4개 벤치마크의 8개 테스트 세트에서 모든 기준선을 큰 폭으로 능가하며 평균 gIoU/cIoU가 +23.9/+26.3 향상되었다. 접촉점 예측의 경우, 가장 우수한 기준선 대비 12.7~61.3%의 적중률 향상으로 훨씬 더 정확한 점을 예측한다. 3D 동작의 경우, 세 가지 테스트 세트 모두에서 최고 성능을 달성한다. 본 모델은 로봇 구현체에 대한 미세 조정이나 작업별 휴리스틱 없이 실제 로봇 조작에 배포될 수 있으며, 개방형 세계 어포던스 작업에 적응하는 능력을 입증한다. 프로젝트 페이지: https://www.zhaoningwang.com/AFUN

English

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN