FG-CLIP: 세밀한 시각적 및 텍스트적 정렬

초록

대조적 언어-이미지 사전학습(Contrastive Language-Image Pre-training, CLIP)은 이미지-텍스트 검색 및 제로샷 분류와 같은 다중모달 작업에서 뛰어난 성능을 보이지만, 거친 수준의 짧은 캡션에 초점을 맞추기 때문에 세밀한 이해에는 한계가 있습니다. 이를 해결하기 위해, 우리는 세 가지 주요 혁신을 통해 세밀한 이해를 강화한 Fine-Grained CLIP(FG-CLIP)을 제안합니다. 첫째, 대규모 다중모달 모델을 활용하여 전역 수준의 의미론적 세부 사항을 포착하기 위해 16억 개의 긴 캡션-이미지 쌍을 생성합니다. 둘째, 1,200만 개의 이미지와 4,000만 개의 영역별 바운딩 박스로 구성된 고품질 데이터셋을 구축하여 정확하고 맥락이 풍부한 표현을 보장합니다. 셋째, 1,000만 개의 어려운 세밀한 부정 샘플을 포함시켜 모델이 미묘한 의미론적 차이를 구별하는 능력을 향상시킵니다. 이러한 데이터에 맞춰 세심하게 설계된 훈련 방법을 적용합니다. 광범위한 실험을 통해 FG-CLIP이 세밀한 이해, 개방형 어휘 객체 탐지, 이미지-텍스트 검색 및 일반 다중모달 벤치마크를 포함한 다양한 하위 작업에서 원본 CLIP 및 기타 최신 방법을 능가함을 입증했습니다. 이러한 결과는 FG-CLIP이 세밀한 이미지 세부 사항을 포착하고 전반적인 모델 성능을 개선하는 데 효과적임을 보여줍니다. 관련 데이터, 코드 및 모델은 https://github.com/360CVGroup/FG-CLIP에서 확인할 수 있습니다.

English

Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The related data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.

FG-CLIP: 세밀한 시각적 및 텍스트적 정렬

FG-CLIP: Fine-Grained Visual and Textual Alignment

초록

Support