ChatPaper.aiChatPaper

DeCLIP:面向开放词汇密集感知的解耦学习

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

May 7, 2025
作者: Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian
cs.AI

摘要

密集视觉预测任务长期以来受限于对预定义类别的依赖,这限制了其在现实场景中的应用,因为现实中的视觉概念是无限开放的。尽管视觉-语言模型(VLMs)如CLIP在开放词汇任务中展现出潜力,但直接将其应用于密集预测时,由于局部特征表示的局限性,往往导致性能欠佳。在本研究中,我们观察到CLIP的图像标记难以有效聚合来自空间或语义相关区域的信息,导致特征缺乏局部区分性和空间一致性。为解决这一问题,我们提出了DeCLIP,一种新颖的框架,通过解耦自注意力模块分别获取“内容”和“上下文”特征来增强CLIP。其中,“内容”特征与图像裁剪表示对齐,以提升局部区分能力;而“上下文”特征则在视觉基础模型(如DINO)的指导下学习保持空间相关性。大量实验表明,DeCLIP在包括目标检测和语义分割在内的多项开放词汇密集预测任务中,显著超越了现有方法。代码已发布于magenta{https://github.com/xiaomoguhz/DeCLIP}。
English
Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at magenta{https://github.com/xiaomoguhz/DeCLIP}.

Summary

AI-Generated Summary

PDF351May 15, 2025