ChatPaper.aiChatPaper

选择性对比学习用于弱监督功能定位

Selective Contrastive Learning for Weakly Supervised Affordance Grounding

August 11, 2025
作者: WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
cs.AI

摘要

促进实体与物体的交互,需要准确识别出支持特定动作的功能部件。弱监督功能定位(WSAG)旨在模仿人类从第三人称演示中学习的方式,即人类无需像素级标注便能直观理解功能部件。为此,通常通过跨不同视角图像共享分类器,并结合部件发现过程的蒸馏策略来实现定位。然而,由于功能相关部件并非总是易于区分,模型主要依赖分类,往往关注与功能无关的类别特定模式。为克服这一局限,我们超越了孤立的部件级学习,引入了选择性原型和像素对比目标,根据可用信息的粒度,自适应地在部件和物体两个层面学习功能相关线索。首先,我们利用CLIP在自我中心(聚焦物体)和他人中心(第三人称示例)图像中找出与动作关联的物体。接着,通过交叉参考互补视角下发现的物体,我们在每个视角中挖掘出精确的部件级功能线索。通过持续学习区分功能相关区域与无关背景,我们的方法有效将激活从无关区域转向有意义的功能线索。实验结果验证了该方法的有效性。代码可在github.com/hynnsk/SelectiveCL获取。
English
Facilitating an entity's interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method. Codes are available at github.com/hynnsk/SelectiveCL.
PDF113August 25, 2025