FG-CLIP:細粒度視覺與文本對齊
FG-CLIP: Fine-Grained Visual and Textual Alignment
May 8, 2025
作者: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin
cs.AI
摘要
對比式語言-圖像預訓練(CLIP)在多模態任務中表現卓越,例如圖像-文本檢索和零樣本分類,但由於其專注於粗粒度的簡短描述,在細粒度理解方面存在困難。為解決這一問題,我們提出了細粒度CLIP(FG-CLIP),通過三項關鍵創新來增強細粒度理解能力。首先,我們利用大型多模態模型生成了16億個長描述-圖像對,以捕捉全局層次的語義細節。其次,構建了一個高質量數據集,包含1200萬張圖像和4000萬個與詳細描述對齊的區域特定邊界框,確保了精確且語境豐富的表徵。第三,引入了1000萬個困難的細粒度負樣本,以提升模型區分細微語義差異的能力。針對這些數據,我們精心設計了相應的訓練方法。大量實驗表明,FG-CLIP在多種下游任務中均超越了原始CLIP及其他最先進的方法,包括細粒度理解、開放詞彙目標檢測、圖像-文本檢索以及通用多模態基準測試。這些結果凸顯了FG-CLIP在捕捉圖像細部細節及提升整體模型性能方面的有效性。相關數據、代碼和模型可在https://github.com/360CVGroup/FG-CLIP獲取。
English
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks
such as image-text retrieval and zero-shot classification but struggles with
fine-grained understanding due to its focus on coarse-grained short captions.
To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances
fine-grained understanding through three key innovations. First, we leverage
large multimodal models to generate 1.6 billion long caption-image pairs for
capturing global-level semantic details. Second, a high-quality dataset is
constructed with 12 million images and 40 million region-specific bounding
boxes aligned with detailed captions to ensure precise, context-rich
representations. Third, 10 million hard fine-grained negative samples are
incorporated to improve the model's ability to distinguish subtle semantic
differences. Corresponding training methods are meticulously designed for these
data. Extensive experiments demonstrate that FG-CLIP outperforms the original
CLIP and other state-of-the-art methods across various downstream tasks,
including fine-grained understanding, open-vocabulary object detection,
image-text retrieval, and general multimodal benchmarks. These results
highlight FG-CLIP's effectiveness in capturing fine-grained image details and
improving overall model performance. The related data, code, and models are
available at https://github.com/360CVGroup/FG-CLIP.Summary
AI-Generated Summary