G-LLaVA: 다중 모달 대형 언어 모델을 활용한 기하학적 문제 해결

초록

대규모 언어 모델(LLMs)은 인간 수준의 추론 및 생성 능력에서 놀라운 숙련도를 보여주며, 이는 수학 문제 해결에 대한 광범위한 연구를 촉진하고 있습니다. 그러나 현재의 연구는 주로 텍스트 기반 수학 문제에 초점이 맞춰져 있으며, 기하학적 정보를 포함하는 문제에 대한 연구는 제한적입니다. 이러한 격차를 해소하기 위해, 우리는 LLMs가 이미지 입력을 이해하여 기하학적 문제를 해결할 수 있도록 하는 것을 목표로 합니다. 먼저, 현재의 다중모달 대규모 언어 모델(MLLMs)의 한계를 분석합니다: 이들은 기본적인 기하학적 요소와 그들 간의 관계를 정확히 이해하는 데 어려움을 겪습니다. 이러한 문제를 극복하기 위해, 우리는 기하학적 문제의 고유한 특성(예: 고유한 기하학적 논리 형태, 기하학적 확장성)과 텍스트 기반 LLMs의 능력을 활용하여 기존 데이터를 기반으로 한 풍부한 다중모달 기하학 데이터셋을 구축합니다. 이 증강된 데이터셋인 Geo170K은 170,000개 이상의 기하학적 이미지-캡션 및 질문-답변 쌍을 포함합니다. 우리가 구축한 Geo170K 데이터셋을 활용하여, 우리는 G-LLaVA를 개발하였으며, 이는 기하학적 문제 해결에서 탁월한 성능을 보여주며, MathVista 벤치마크에서 GPT-4-V를 크게 능가하는 성과를 보였습니다. 이는 단 7B 파라미터만으로도 가능했습니다.

English

Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

G-LLaVA: 다중 모달 대형 언어 모델을 활용한 기하학적 문제 해결

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

초록

Support