G-LLaVA: マルチモーダル大規模言語モデルによる幾何学問題の解決

要旨

大規模言語モデル（LLMs）は、人間レベルの推論および生成能力において顕著な熟練度を示しており、これが数学的問題解決への応用に関する広範な研究を促している。しかし、現在の研究は主にテキストベースの数学的問題に焦点を当てており、幾何学的情報を含む問題に関する調査は限られている。このギャップを埋めるため、我々はLLMsが画像入力を理解することで幾何学的問題を解決できるようにすることを目指す。まず、現在のマルチモーダル大規模言語モデル（MLLMs）がこの領域で抱える限界を分析する：それらは基本的な幾何学的要素とその関係を正確に理解するのに苦労している。これらの課題を克服するため、我々は幾何学的問題の特徴（例えば、独自の幾何学的論理形式や幾何学的スケーラビリティ）とテキストベースのLLMsの能力を活用し、既存のデータに基づいて強化されたマルチモーダル幾何学データセットを構築する。この拡張されたデータセット、Geo170Kは、17万以上の幾何学的画像-キャプションおよび質問-回答ペアを含む。構築したGeo170Kデータセットを活用し、我々はG-LLaVAを開発し、MathVistaベンチマークにおいてGPT-4-Vを大幅に上回る優れた性能を発揮することを示す。これはわずか7Bパラメータで達成された。

English

Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

G-LLaVA: マルチモーダル大規模言語モデルによる幾何学問題の解決

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

要旨

Support