FG-CLIP 2:雙語細粒度視覺-語言對齊模型
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
October 13, 2025
作者: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin
cs.AI
摘要
細粒度視覺語言理解要求視覺內容與語言描述之間的精確對齊,這一能力在當前模型中仍顯不足,尤其是在非英語環境下。儘管如CLIP等模型在全局對齊上表現出色,但在捕捉物體屬性、空間關係及語言表達的細微差異方面常感吃力,且對雙語理解的支持有限。為應對這些挑戰,我們推出了FG-CLIP 2,這是一款專為提升英語和漢語細粒度對齊而設計的雙語視覺語言模型。我們的方法融合了豐富的細粒度監督,包括區域文本匹配與長描述建模,並結合多種判別目標。此外,我們引入了文本模態內對比(TIC)損失,以更好地區分語義相近的描述。通過在精心挑選的大規模英漢數據集上訓練,FG-CLIP 2展現了強大的雙語性能。為實現嚴謹評估,我們提出了一個新的中文多模態理解基準,涵蓋長描述檢索與邊界框分類。在8項任務、29個數據集上的廣泛實驗表明,FG-CLIP 2超越了現有方法,在兩種語言中均取得了領先成果。我們公開了模型、代碼及基準,以促進未來在雙語細粒度對齊領域的研究。
English
Fine-grained vision-language understanding requires precise alignment between
visual content and linguistic descriptions, a capability that remains limited
in current models, particularly in non-English settings. While models like CLIP
perform well on global alignment, they often struggle to capture fine-grained
details in object attributes, spatial relations, and linguistic expressions,
with limited support for bilingual comprehension. To address these challenges,
we introduce FG-CLIP 2, a bilingual vision-language model designed to advance
fine-grained alignment for both English and Chinese. Our approach leverages
rich fine-grained supervision, including region-text matching and long-caption
modeling, alongside multiple discriminative objectives. We further introduce
the Textual Intra-modal Contrastive (TIC) loss to better distinguish
semantically similar captions. Trained on a carefully curated mixture of
large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual
performance. To enable rigorous evaluation, we present a new benchmark for
Chinese multimodal understanding, featuring long-caption retrieval and bounding
box classification. Extensive experiments on 29 datasets across 8 tasks show
that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results
in both languages. We release the model, code, and benchmark to facilitate
future research on bilingual fine-grained alignment.