반진실은 유사성 기반 검색을 무력화한다

초록

텍스트 설명에 잘못된 추가 세부 정보가 덧붙여질 경우 이미지-텍스트 유사도는 하락해야 한다. 본 연구에서는 CLIP 방식의 듀얼 인코더가 이러한 직관을 자주 위반함을 보인다: 정확한 설명에 그럴듯하지만 잘못된 객체나 관계를 추가하면 유사도 점수가 오히려 증가할 수 있다. 우리는 이러한 사례를 '하프 트루스(half-truths)'라고 명명한다. COCO 데이터셋에서 CLIP은 더 짧은 정확한 설명을 선호하는 비율이 40.6%에 불과하며, 추가된 세부 정보가 관계를 나타낼 경우 이 성능은 32.9%로 하락한다. 우리는 이러한 취약점의 원인이 캡션 부분에 대한 약한 감독에서 비롯됨을 규명한다: 대조 학습은 전체 문장을 정렬하지만 개별 개체와 관계가 제대로 기반하게 할 것을 명시적으로 강제하지는 않는다. 우리는 CS-CLIP(Component-Supervised CLIP)을 제안한다. 이 방법은 캡션을 개체 및 관계 단위로 분해하고, 각 단위에 대해 최소한으로 편집된 오류 단위(foil)를 구성하며, 표준 듀얼 인코더 추론을 보존하면서 정확한 단위가 해당 오류 단위보다 높은 점수를 받도록 모델을 미세 조정한다. CS-CLIP은 하프 트루스 정확도를 69.3%로 높이며, 기존 구성적 이해 벤치마크에서 평균 성능을 5.7점 향상시켜 하프 트루스 오류 감소가 구성적 이해의 광범위한 향상과 부합함을 시사한다. 코드는 다음에서 공개된다: https://github.com/kargibora/CS-CLIP

English

When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP

반진실은 유사성 기반 검색을 무력화한다

Half-Truths Break Similarity-Based Retrieval

초록

Support