대규모 멀티모달 모델을 위한 세밀한 기하학적 이해를 위한 강한 부정적 대조 학습

초록

대규모 자연 경관 이미지에 대한 대조 학습(contrastive learning)으로 훈련된 시각 인코더의 이점을 활용하여, 대형 멀티모달 모델(Large Multimodal Models, LMMs)은 다양한 시각 인식 작업에서 뛰어난 성능을 달성했습니다. 그러나 요약된 설명에 기반한 대조 학습의 본질적 한계는, 특히 기하학적 문제 해결과 같은 중요한 시나리오에서 모델의 세밀한 추론 능력을 근본적으로 제한합니다. 기하학적 이해를 향상시키기 위해, 우리는 시각 인코더를 위한 새로운 하드 네거티브 대조 학습(hard negative contrastive learning) 프레임워크를 제안합니다. 이 프레임워크는 다이어그램 생성 코드를 변형하여 생성된 생성 기반 하드 네거티브를 사용한 이미지 기반 대조 학습과, 수정된 기하학적 설명에서 도출된 규칙 기반 네거티브 및 캡션 유사성을 기반으로 선택된 검색 기반 네거티브를 사용한 텍스트 기반 대조 학습을 결합합니다. 우리는 강력한 네거티브 학습 방법인 MMCLIP(Multimodal Math CLIP)을 사용하여 CLIP을 훈련시키고, 이어서 기하학적 문제 해결을 위한 LMM을 훈련시킵니다. 실험 결과, 우리가 훈련한 모델인 MMGeoLM은 세 가지 기하학적 추론 벤치마크에서 다른 오픈소스 모델들을 크게 능가하는 성능을 보였습니다. 심지어 7B 크기의 모델도 GPT-4o와 같은 강력한 클로즈드소스 모델에 필적할 수 있었습니다. 우리는 또한 다양한 네거티브 샘플 구성 방법과 네거티브 샘플의 수가 LMM의 기하학적 추론 성능에 미치는 영향을 추가로 연구하여 유의미한 결론을 도출했습니다. 코드와 데이터셋은 https://github.com/THU-KEG/MMGeoLM에서 확인할 수 있습니다.

English

Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at https://github.com/THU-KEG/MMGeoLM.

대규모 멀티모달 모델을 위한 세밀한 기하학적 이해를 위한 강한 부정적 대조 학습

Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

초록

Support