ChatPaper.aiChatPaper

具体化丛林:通过具象化铺就的复合理解对比式负例挖掘路径

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

April 14, 2026
作者: Eun Woo Im, Dhruv Madhwal, Vivek Gupta
cs.AI

摘要

视觉语言模型展现出卓越的能力,但在组合推理方面常常表现不佳,尤其在对词序和属性绑定的处理上存在明显弱点。这种局限源于对比预训练阶段缺乏足够的信息化样本来区分细微的语义差异。尽管困难负样本挖掘提供了可行的解决思路,但现有方法缺乏明确机制来指导语言元素的修改策略。本研究摒弃生成式架构的工程化思路,将词汇具体性确立为负样本效力的根本决定因素。实验表明,修改高具体性词汇能产生更显著的结构与视觉差异,从而提供更强的学习信号。基于此原理,我们提出ConcretePlant方法系统化地分离并操控具象化概念。对InfoNCE损失函数的分析进一步揭示了严重的梯度失衡问题:易区分的样本对会过度主导优化过程,限制模型进行精细学习的带宽。为解决这一退化现象,我们采用边界感知方法构建Cement损失函数,通过将心理语言学评分与样本难度相关联,动态校准训练对的惩罚力度。综合评估验证了这些理论主张。最终整合的Slipform框架在多项组合推理基准测试中实现突破,在广义跨模态检索、单标签/多标签线性探测等任务上均达到最先进水平。
English
Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.
PDF92April 22, 2026