コンクリートジャングル：構成的理解のための具体性に基づく対照的ネガティブマイニングに向けて

要旨

視覚言語モデルは顕著な能力を示すが、構成論的推論に苦戦することが多く、語順や属性束縛に関する脆弱性を示す。この制約は、対照的事前学習において微妙な意味的変異を区別するために必要な情報豊富なサンプルの不足に起因する。ハードネガティブマイニングは有望な解決策を提供するが、既存の手法は、どの言語要素が修正を受けるかを決定する明示的なメカニズムを欠いている。生成アーキテクチャを設計する代わりに、本研究は語彙的具体性をネガティブサンプルの有効性を決定する基本的要因として確立する。高度に具体的な用語を変更することで、より顕著な構造的・視覚的差異が生まれ、大幅に強力な学習信号を提供する。この原理を活用し、知覚に根ざした概念を体系的に分離・操作するConcretePlantを提案する。InfoNCEの分析は、容易に区別可能なペアが最適化プロセスを不均衡に支配し、微妙な学習に利用可能な帯域幅を制限する深刻な勾配不均衡を明らかにする。この劣化を解決するため、マージンベースのアプローチを利用したCement損失を定式化する。心理言語学的スコアとサンプル難易度を相関させることで、この目的関数は個々の訓練ペアに適用されるペナルティを動的に調整する。包括的評価はこれらの理論的主張を実証する。Slipformと命名された統合フレームワークは、多様な構成論的評価ベンチマーク、一般的なクロスモーダル検索、単一および複数ラベル線形 probing において、最先端の精度を達成する。

English

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.

コンクリートジャングル：構成的理解のための具体性に基づく対照的ネガティブマイニングに向けて

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

要旨

Support