半真実は類似性に基づく検索を破綻させる

要旨

テキスト記述に誤った詳細情報が追加されると、画像とテキストの類似度は低下するはずである。しかし、CLIPスタイルのデュアルエンコーダーはこの直感に反する場合が多いことがわかった。正しい記述に、もっともらしいが誤ったオブジェクトや関係性を追加すると、類似度スコアが上昇することがある。我々はこのような事例を「半真実」と呼ぶ。COCOデータセットでは、CLIPが正しい短い記述を選ぶ確率は40.6%に過ぎず、追加された詳細が関係性の場合、性能は32.9%に低下する。この脆弱性の原因は、キャプションの部分的な監督信号の弱さにある。対照学習では全文の整合性は取れるが、個々の実体や関係性が適切に接地されることは明示的に保証されない。我々はCS-CLIP（Component-Supervised CLIP）を提案する。これはキャプションを実体と関係性の単位に分解し、各単位ごとに最小限の編集を加えたフォイル（対抗事例）を構築し、標準的なデュアルエンコーダ推論を保ちながら、正しい単位がフォイルよりも高く評価されるようにモデルをファインチューニングする。CS-CLIPは半真実に対する精度を69.3%に向上させ、既存の構成的理解ベンチマークにおいて平均性能を5.7ポイント改善した。これは半真実の誤りを減らすことが、構成的理解の全体的な向上につながることを示唆する。コードはhttps://github.com/kargibora/CS-CLIPで公開されている。

English

When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP

半真実は類似性に基づく検索を破綻させる

Half-Truths Break Similarity-Based Retrieval

要旨

Support