G-LLaVA：利用多模態大型語言模型解決幾何問題

摘要

大型語言模型（LLMs）展現出在人類水準的推理和生成能力方面的卓越表現，這促使對它們在數學問題解決中的應用進行廣泛研究。然而，目前的工作主要集中在基於文本的數學問題上，對涉及幾何信息的問題進行的研究有限。為填補這一空白，我們旨在通過理解圖像輸入，使LLMs能夠解決幾何問題。我們首先分析了當前多模態大型語言模型（MLLMs）在這一領域的局限性：它們難以準確理解基本幾何元素及其關係。為克服這些挑戰，我們利用幾何問題的獨特特徵（如獨特的幾何邏輯形式和幾何可擴展性）以及文本LLMs的能力，基於現有數據構建了一個豐富的多模態幾何數據集。擴充後的數據集Geo170K包含超過170K個幾何圖像說明和問答對。利用我們構建的Geo170K數據集，我們開發了G-LLaVA，在解決幾何問題方面表現出色，僅使用70億參數在MathVista基準測試中明顯優於GPT-4-V。

English

Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

G-LLaVA：利用多模態大型語言模型解決幾何問題

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

摘要

Support