半真半假破坏基于相似性的检索

摘要

當文本描述被附加錯誤細節時，圖文相似度理應下降。我們發現CLIP風格的雙編碼器經常違背這一直覺：在正確描述的基礎上添加合理但錯誤的物體或關係，反而可能提高相似度得分。我們將此類案例稱為「半真陳述」。在COCO數據集上，CLIP僅在40.6%的情況下偏好正確的簡短描述，當添加細節為關係時，該比例更降至32.9%。我們追溯此缺陷至對字幕部分的弱監督：對比式訓練雖對齊完整句子，但未顯式要求個體實體與關係需被錨定。我們提出CS-CLIP（組件監督式CLIP），通過將字幕分解為實體與關係單元，為每個單元構建最小編輯的反例，並在保持標準雙編碼器推理的同時微調模型，使正確單元的得分高於其反例。CS-CLIP將半真陳述準確率提升至69.3%，並在現有組合式基準測試中平均表現提高5.7分，表明減少半真錯誤有助於提升組合理解能力。代碼公開於：https://github.com/kargibora/CS-CLIP

English

When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP

半真半假破坏基于相似性的检索

Half-Truths Break Similarity-Based Retrieval

摘要

Support