半真半假破坏基于相似性的检索

摘要

当文本描述被添加错误细节时，图文相似度应当下降。但我们发现CLIP风格的双编码器常违背这一直觉：在正确描述后附加合理但错误的对象或关系，反而可能提高相似度得分。我们将此类情况称为"半真陈述"。在COCO数据集上，CLIP仅40.6%的情况下偏好正确简短描述，当添加细节为关系时，该比例降至32.9%。我们追溯这一缺陷至对描述部分的弱监督：对比训练虽对齐完整句子，却未显式要求实体与关系被单独锚定。我们提出CS-CLIP（组件监督CLIP），其将描述解构为实体与关系单元，为每个单元构建最小化编辑的反例，通过微调使模型对正确单元的评分高于反例，同时保持标准双编码器推理。CS-CLIP将半真陈述准确率提升至69.3%，并在主流组合式理解基准上平均提升5.7分，表明减少半真错误有助于提升组合理解能力。代码已开源：https://github.com/kargibora/CS-CLIP

English

When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP