具体化丛林：面向组合式理解的具体化对比负例挖掘研究

摘要

视觉语言模型展现出卓越能力，但在组合推理方面常显不足，尤其对词序和属性绑定的处理存在明显缺陷。这一局限源于对比预训练中缺乏足够区分细微语义差异的信息化样本。尽管困难负样本挖掘提供了可行解决方案，但现有方法缺乏明确机制来指导语言要素的修改方向。本研究并未构建复杂生成架构，而是将词汇具体性确立为负样本效力的根本决定因素。实验表明，修改高具体性词汇能产生更显著的结构与视觉差异，从而提供更强的学习信号。基于此原理，我们提出ConcretePlant方法，系统化地分离并操控具象化概念。对InfoNCE损失函数的分析进一步揭示了严重的梯度失衡问题：易区分样本对在优化过程中占据不成比例的权重，限制了模型进行精细学习的有效带宽。为解决这一退化现象，我们采用边界间隔方法构建Cement损失函数，通过将心理语言学评分与样本难度相关联，动态校准训练样本对的惩罚强度。综合评估验证了这些理论主张。最终整合的Slipform框架在多项组合推理评测基准、跨模态检索、单标签与多标签线性探测任务中均达到最先进精度。

English

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.