硬負樣本對比學習：用於大型多模態模型中的細粒度幾何理解

摘要

得益於大規模自然場景圖像上的對比訓練視覺編碼器，大型多模態模型（LMMs）在各種視覺感知任務中取得了顯著成就。然而，基於摘要描述的對比學習固有局限性，從根本上限制了模型在細緻推理，尤其是幾何問題求解等關鍵場景中的能力。為提升幾何理解，我們提出了一種新穎的視覺編碼器硬負樣本對比學習框架，該框架結合了基於圖像的對比學習——利用擾動圖表生成代碼創建的生成式硬負樣本，以及基於文本的對比學習——採用修改後的幾何描述衍生的規則負樣本和基於標題相似度選擇的檢索負樣本。我們使用我們提出的強負樣本學習方法，即MMCLIP（多模態數學CLIP）訓練CLIP模型，隨後訓練一個用於幾何問題求解的LMM。實驗表明，我們訓練的模型MMGeoLM在三項幾何推理基準測試中顯著優於其他開源模型。即便在7B規模下，它也能與GPT-4o等強大的閉源模型相媲美。我們進一步研究了不同負樣本構建方法及負樣本數量對LMM幾何推理性能的影響，得出了富有成效的結論。代碼和數據集已公開於https://github.com/THU-KEG/MMGeoLM。

English

Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at https://github.com/THU-KEG/MMGeoLM.

硬負樣本對比學習：用於大型多模態模型中的細粒度幾何理解

Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

摘要

Support