콘크리트 정글: 구성적 이해를 위한 구체성 기반 대조적 부정 샘플 마이닝

초록

비전-언어 모델은 놀라운 능력을 보여주지만, 단어 순서와 속성 바인딩 측면에서 취약점을 드러내며 구성적 추론에 어려움을 겪는 경우가 많습니다. 이러한 한계는 대조적 사전 학습 과정에서 미묘한 의미적 변이를 구별하는 데 필요한 정보적 샘플의 부족에서 비롯됩니다. 하드 네거티브 마이닝이 유망한 해결책으로 제시되지만, 기존 방법은 어떤 언어적 요소가 수정 대상인지를 명시적으로 결정하는 메커니즘을 갖추지 못했습니다. 본 연구는 생성 아키텍처를 설계하는 대신, 어휘적 구체성을 네거티브 샘플 효율성의 근본적 결정 요인으로 규정합니다. 높은 구체성을 지닌 용어를 수정할 경우 더 뚜렷한 구조적 및 시각적 차이가 발생하여 상당히 강력한 학습 신호를 제공합니다. 이 원리를 활용하여 지각적으로 기반한 개념을 체계적으로 분리하고 조작하는 ConcretePlant를 제안합니다. InfoNCE에 대한 분석은 또한 쉽게 구별 가능한 샘플 쌍이 최적화 과정을 지나치게 압도하고 미묘한 학습에 사용 가능한 대역폭을 제한하는 심각한 그래디언트 불균형을 추가로 밝혀냅니다. 이러한 성능 저하를 해결하기 위해 마진 기반 접근법을 활용한 Cement 손실 함수를 공식화합니다. 심리언어학적 점수와 샘플 난이도를 연관시킴으로써, 이 목적 함수는 개별 학습 쌍에 적용되는 패널티를 동적으로 조정합니다. 포괄적인 평가를 통해 이러한 이론적 주장을 입증합니다. Slipform으로 명명된 통합 프레임워크는 다양한 구성적 평가 벤치마크, 일반적인 크로스모달 검색, 단일 및 다중 레이블 선형 프로빙에서 최첨단 정확도를 달성합니다.

English

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.

콘크리트 정글: 구성적 이해를 위한 구체성 기반 대조적 부정 샘플 마이닝

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

초록

Support