ChatPaper.aiChatPaper

半真半假破坏基于相似性的检索

Half-Truths Break Similarity-Based Retrieval

February 27, 2026
作者: Bora Kargi, Arnas Uselis, Seong Joon Oh
cs.AI

摘要

當文本描述被附加錯誤細節時,圖文相似度理應下降。我們發現CLIP風格的雙編碼器經常違背這一直覺:在正確描述的基礎上添加合理但錯誤的物體或關係,反而可能提高相似度得分。我們將此類案例稱為「半真陳述」。在COCO數據集上,CLIP僅在40.6%的情況下偏好正確的簡短描述,當添加細節為關係時,該比例更降至32.9%。我們追溯此缺陷至對字幕部分的弱監督:對比式訓練雖對齊完整句子,但未顯式要求個體實體與關係需被錨定。我們提出CS-CLIP(組件監督式CLIP),通過將字幕分解為實體與關係單元,為每個單元構建最小編輯的反例,並在保持標準雙編碼器推理的同時微調模型,使正確單元的得分高於其反例。CS-CLIP將半真陳述準確率提升至69.3%,並在現有組合式基準測試中平均表現提高5.7分,表明減少半真錯誤有助於提升組合理解能力。代碼公開於:https://github.com/kargibora/CS-CLIP
English
When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP
PDF61March 4, 2026