FG-CLIP:细粒度视觉与文本对齐
FG-CLIP: Fine-Grained Visual and Textual Alignment
May 8, 2025
作者: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin
cs.AI
摘要
对比语言-图像预训练(CLIP)在多模态任务中表现出色,例如图像-文本检索和零样本分类,但由于其专注于粗粒度的简短描述,在细粒度理解方面存在局限。为解决这一问题,我们提出了细粒度CLIP(FG-CLIP),通过三项关键创新提升细粒度理解能力。首先,我们利用大规模多模态模型生成了16亿对长描述-图像对,以捕捉全局层面的语义细节。其次,构建了一个包含1200万张图像和4000万个与详细描述对齐的区域特定边界框的高质量数据集,确保精确且上下文丰富的表示。第三,引入了1000万个困难的细粒度负样本,以增强模型区分细微语义差异的能力。针对这些数据,我们精心设计了相应的训练方法。大量实验表明,FG-CLIP在多种下游任务中均超越了原始CLIP及其他最先进方法,包括细粒度理解、开放词汇目标检测、图像-文本检索以及通用多模态基准测试。这些结果凸显了FG-CLIP在捕捉图像细部细节及提升整体模型性能方面的有效性。相关数据、代码和模型可在https://github.com/360CVGroup/FG-CLIP获取。
English
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks
such as image-text retrieval and zero-shot classification but struggles with
fine-grained understanding due to its focus on coarse-grained short captions.
To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances
fine-grained understanding through three key innovations. First, we leverage
large multimodal models to generate 1.6 billion long caption-image pairs for
capturing global-level semantic details. Second, a high-quality dataset is
constructed with 12 million images and 40 million region-specific bounding
boxes aligned with detailed captions to ensure precise, context-rich
representations. Third, 10 million hard fine-grained negative samples are
incorporated to improve the model's ability to distinguish subtle semantic
differences. Corresponding training methods are meticulously designed for these
data. Extensive experiments demonstrate that FG-CLIP outperforms the original
CLIP and other state-of-the-art methods across various downstream tasks,
including fine-grained understanding, open-vocabulary object detection,
image-text retrieval, and general multimodal benchmarks. These results
highlight FG-CLIP's effectiveness in capturing fine-grained image details and
improving overall model performance. The related data, code, and models are
available at https://github.com/360CVGroup/FG-CLIP.Summary
AI-Generated Summary