选择性对比学习用于弱监督功能定位
Selective Contrastive Learning for Weakly Supervised Affordance Grounding
August 11, 2025
作者: WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
cs.AI
摘要
促进实体与物体的交互,需要准确识别出支持特定动作的功能部件。弱监督功能定位(WSAG)旨在模仿人类从第三人称演示中学习的方式,即人类无需像素级标注便能直观理解功能部件。为此,通常通过跨不同视角图像共享分类器,并结合部件发现过程的蒸馏策略来实现定位。然而,由于功能相关部件并非总是易于区分,模型主要依赖分类,往往关注与功能无关的类别特定模式。为克服这一局限,我们超越了孤立的部件级学习,引入了选择性原型和像素对比目标,根据可用信息的粒度,自适应地在部件和物体两个层面学习功能相关线索。首先,我们利用CLIP在自我中心(聚焦物体)和他人中心(第三人称示例)图像中找出与动作关联的物体。接着,通过交叉参考互补视角下发现的物体,我们在每个视角中挖掘出精确的部件级功能线索。通过持续学习区分功能相关区域与无关背景,我们的方法有效将激活从无关区域转向有意义的功能线索。实验结果验证了该方法的有效性。代码可在github.com/hynnsk/SelectiveCL获取。
English
Facilitating an entity's interaction with objects requires accurately
identifying parts that afford specific actions. Weakly supervised affordance
grounding (WSAG) seeks to imitate human learning from third-person
demonstrations, where humans intuitively grasp functional parts without needing
pixel-level annotations. To achieve this, grounding is typically learned using
a shared classifier across images from different perspectives, along with
distillation strategies incorporating part discovery process. However, since
affordance-relevant parts are not always easily distinguishable, models
primarily rely on classification, often focusing on common class-specific
patterns that are unrelated to affordance. To address this limitation, we move
beyond isolated part-level learning by introducing selective prototypical and
pixel contrastive objectives that adaptively learn affordance-relevant cues at
both the part and object levels, depending on the granularity of the available
information. Initially, we find the action-associated objects in both
egocentric (object-focused) and exocentric (third-person example) images by
leveraging CLIP. Then, by cross-referencing the discovered objects of
complementary views, we excavate the precise part-level affordance clues in
each perspective. By consistently learning to distinguish affordance-relevant
regions from affordance-irrelevant background context, our approach effectively
shifts activation from irrelevant areas toward meaningful affordance cues.
Experimental results demonstrate the effectiveness of our method. Codes are
available at github.com/hynnsk/SelectiveCL.