FG-CLIP 2:一种双语细粒度视觉-语言对齐模型
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
October 13, 2025
作者: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin
cs.AI
摘要
细粒度的视觉-语言理解要求视觉内容与语言描述之间实现精准对齐,这一能力在当前模型中仍显不足,尤其是在非英语环境下。尽管像CLIP这样的模型在全局对齐上表现优异,但在捕捉物体属性、空间关系及语言表达等细粒度细节方面往往力不从心,且对双语理解的支持有限。为应对这些挑战,我们推出了FG-CLIP 2,这是一款旨在提升英语和汉语细粒度对齐能力的双语视觉-语言模型。我们的方法融合了丰富的细粒度监督信息,包括区域-文本匹配和长文本描述建模,并辅以多种判别性目标。此外,我们引入了文本模态内对比(TIC)损失,以更好地区分语义相近的描述。通过在精心挑选的大规模英汉数据混合集上进行训练,FG-CLIP 2展现了强大的双语性能。为了支持严谨的评估,我们提出了一个针对中文多模态理解的新基准,包含长文本检索和边界框分类任务。在涵盖8个任务的29个数据集上的广泛实验表明,FG-CLIP 2超越了现有方法,在两种语言上均取得了最先进的成果。我们公开了模型、代码及基准,以促进未来在双语细粒度对齐领域的研究。
English
Fine-grained vision-language understanding requires precise alignment between
visual content and linguistic descriptions, a capability that remains limited
in current models, particularly in non-English settings. While models like CLIP
perform well on global alignment, they often struggle to capture fine-grained
details in object attributes, spatial relations, and linguistic expressions,
with limited support for bilingual comprehension. To address these challenges,
we introduce FG-CLIP 2, a bilingual vision-language model designed to advance
fine-grained alignment for both English and Chinese. Our approach leverages
rich fine-grained supervision, including region-text matching and long-caption
modeling, alongside multiple discriminative objectives. We further introduce
the Textual Intra-modal Contrastive (TIC) loss to better distinguish
semantically similar captions. Trained on a carefully curated mixture of
large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual
performance. To enable rigorous evaluation, we present a new benchmark for
Chinese multimodal understanding, featuring long-caption retrieval and bounding
box classification. Extensive experiments on 29 datasets across 8 tasks show
that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results
in both languages. We release the model, code, and benchmark to facilitate
future research on bilingual fine-grained alignment.