選擇性對比學習用於弱監督功能基礎定位
Selective Contrastive Learning for Weakly Supervised Affordance Grounding
August 11, 2025
作者: WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
cs.AI
摘要
促進實體與物體的互動需要準確識別支持特定動作的部件。弱監督功能基礎(WSAG)旨在模仿人類從第三人稱示範中學習的過程,在此過程中,人類無需像素級註釋即可直觀理解功能部件。為實現這一目標,通常通過跨不同視角圖像的共享分類器以及融合部件發現過程的蒸餾策略來學習基礎。然而,由於與功能相關的部件並非總是易於區分,模型主要依賴於分類,往往關注與功能無關的類別特定模式。為解決這一限制,我們超越了孤立的部件層面學習,引入了選擇性原型和像素對比目標,根據可用信息的粒度,在部件和物體層面自適應地學習功能相關線索。首先,我們利用CLIP在自我中心(物體聚焦)和異我中心(第三人稱示例)圖像中找到與動作相關的物體。然後,通過交叉參考互補視角中發現的物體,我們挖掘出每個視角中精確的部件層面功能線索。通過持續學習區分功能相關區域與功能無關的背景上下文,我們的方法有效地將激活從無關區域轉向有意義的功能線索。實驗結果證明了我們方法的有效性。代碼可在github.com/hynnsk/SelectiveCL獲取。
English
Facilitating an entity's interaction with objects requires accurately
identifying parts that afford specific actions. Weakly supervised affordance
grounding (WSAG) seeks to imitate human learning from third-person
demonstrations, where humans intuitively grasp functional parts without needing
pixel-level annotations. To achieve this, grounding is typically learned using
a shared classifier across images from different perspectives, along with
distillation strategies incorporating part discovery process. However, since
affordance-relevant parts are not always easily distinguishable, models
primarily rely on classification, often focusing on common class-specific
patterns that are unrelated to affordance. To address this limitation, we move
beyond isolated part-level learning by introducing selective prototypical and
pixel contrastive objectives that adaptively learn affordance-relevant cues at
both the part and object levels, depending on the granularity of the available
information. Initially, we find the action-associated objects in both
egocentric (object-focused) and exocentric (third-person example) images by
leveraging CLIP. Then, by cross-referencing the discovered objects of
complementary views, we excavate the precise part-level affordance clues in
each perspective. By consistently learning to distinguish affordance-relevant
regions from affordance-irrelevant background context, our approach effectively
shifts activation from irrelevant areas toward meaningful affordance cues.
Experimental results demonstrate the effectiveness of our method. Codes are
available at github.com/hynnsk/SelectiveCL.