INTRA: 상호 작용 관계 인식을 고려한 약 감독 가능성 기반화

초록

Affordance는 물체에 내재된 상호 작용의 잠재적 가능성을 나타냅니다. Affordance의 지각은 지능적 에이전트가 새로운 환경에서 효율적으로 탐색하고 상호 작용할 수 있게 합니다. 약하게 지도된 affordance grounding은 고비용의 픽셀 수준 주석 없이 외부 중심 이미지를 사용하여 에이전트에게 affordance 개념을 가르치는 것입니다. 최근 약하게 지도된 affordance grounding의 발전은 유망한 결과를 냈지만, 외부 중심 및 자아 중심 이미지 데이터셋의 짝을 필요로 한다는 문제와 단일 물체에 대한 다양한 affordance를 지지는 복잡성과 같은 도전 과제가 남아 있습니다. 이를 해결하기 위해 우리는 INTeraction 관계 인식 약하게 지도된 Affordance grounding (INTRA)를 제안합니다. 이전 연구와 달리, INTRA는 이 문제를 표현 학습으로 재구성하여 외부 중심 이미지만을 사용하여 대조 학습을 통해 상호 작용의 고유한 특징을 식별합니다. 이를 통해 짝 데이터셋이 필요 없어집니다. 더불어, 우리는 시각-언어 모델 임베딩을 활용하여 어떤 텍스트에도 유연하게 affordance grounding을 수행하고, 대조 학습을 위해 상호 작용 관계를 반영하는 텍스트 조건부 affordance 맵 생성을 설계하며, 텍스트 동의어 증강을 통해 강건성을 향상시킵니다. 우리의 방법은 AGD20K, IIT-AFF, CAD 및 UMD와 같은 다양한 데이터셋에서 이전 연구를 능가했습니다. 게다가 실험 결과는 우리의 방법이 합성 이미지/일러스트에 대한 도메인 확장성이 뛰어나며, 새로운 상호 작용 및 물체에 대한 affordance grounding을 수행할 수 있는 것을 보여줍니다.

English

Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images / illustrations and is capable of performing affordance grounding for novel interactions and objects.

INTRA: 상호 작용 관계 인식을 고려한 약 감독 가능성 기반화

INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding

초록

Support