半真半假破坏基于相似性的检索
Half-Truths Break Similarity-Based Retrieval
February 27, 2026
作者: Bora Kargi, Arnas Uselis, Seong Joon Oh
cs.AI
摘要
当文本描述被添加错误细节时,图文相似度应当下降。但我们发现CLIP风格的双编码器常违背这一直觉:在正确描述后附加合理但错误的对象或关系,反而可能提高相似度得分。我们将此类情况称为"半真陈述"。在COCO数据集上,CLIP仅40.6%的情况下偏好正确简短描述,当添加细节为关系时,该比例降至32.9%。我们追溯这一缺陷至对描述部分的弱监督:对比训练虽对齐完整句子,却未显式要求实体与关系被单独锚定。我们提出CS-CLIP(组件监督CLIP),其将描述解构为实体与关系单元,为每个单元构建最小化编辑的反例,通过微调使模型对正确单元的评分高于反例,同时保持标准双编码器推理。CS-CLIP将半真陈述准确率提升至69.3%,并在主流组合式理解基准上平均提升5.7分,表明减少半真错误有助于提升组合理解能力。代码已开源:https://github.com/kargibora/CS-CLIP
English
When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP