硬负样本对比学习在大规模多模态模型中的细粒度几何理解
Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
May 26, 2025
作者: Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li
cs.AI
摘要
得益于在大规模自然场景图像上通过对比学习训练的视觉编码器,大型多模态模型(LMMs)在各类视觉感知任务中取得了显著成就。然而,对比学习基于概括性描述的内在局限性,从根本上制约了模型在精细推理,尤其是几何问题求解等关键场景中的能力。为提升几何理解能力,我们提出了一种新颖的视觉编码器硬负样本对比学习框架,该框架结合了基于图像的对比学习——利用通过扰动图表生成代码创建的生成式硬负样本,以及基于文本的对比学习——采用由修改后的几何描述衍生的规则负样本和基于标题相似度筛选的检索负样本。我们采用这一强负样本学习方法训练CLIP,即MMCLIP(多模态数学CLIP),随后训练一个用于几何问题求解的LMM。实验表明,我们训练的模型MMGeoLM在三个几何推理基准测试上显著优于其他开源模型。即便规模仅为7B,它也能与GPT-4o等强大的闭源模型相媲美。我们进一步研究了不同负样本构建方法及负样本数量对LMM几何推理性能的影响,得出了富有成效的结论。代码与数据集已发布于https://github.com/THU-KEG/MMGeoLM。
English
Benefiting from contrastively trained visual encoders on large-scale natural
scene images, Large Multimodal Models (LMMs) have achieved remarkable
performance across various visual perception tasks. However, the inherent
limitations of contrastive learning upon summarized descriptions fundamentally
restrict the capabilities of models in meticulous reasoning, particularly in
crucial scenarios of geometric problem-solving. To enhance geometric
understanding, we propose a novel hard negative contrastive learning framework
for the vision encoder, which combines image-based contrastive learning using
generation-based hard negatives created by perturbing diagram generation code,
and text-based contrastive learning using rule-based negatives derived from
modified geometric descriptions and retrieval-based negatives selected based on
caption similarity. We train CLIP using our strong negative learning method,
namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for
geometric problem-solving. Experiments show that our trained model, MMGeoLM,
significantly outperforms other open-source models on three geometric reasoning
benchmarks. Even with a size of 7B, it can rival powerful closed-source models
like GPT-4o. We further study the impact of different negative sample
construction methods and the number of negative samples on the geometric
reasoning performance of LMM, yielding fruitful conclusions. The code and
dataset are available at https://github.com/THU-KEG/MMGeoLM.Summary
AI-Generated Summary