G-LLaVA:利用多模態大型語言模型解決幾何問題
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
December 18, 2023
作者: Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong
cs.AI
摘要
大型語言模型(LLMs)展現出在人類水準的推理和生成能力方面的卓越表現,這促使對它們在數學問題解決中的應用進行廣泛研究。然而,目前的工作主要集中在基於文本的數學問題上,對涉及幾何信息的問題進行的研究有限。為填補這一空白,我們旨在通過理解圖像輸入,使LLMs能夠解決幾何問題。我們首先分析了當前多模態大型語言模型(MLLMs)在這一領域的局限性:它們難以準確理解基本幾何元素及其關係。為克服這些挑戰,我們利用幾何問題的獨特特徵(如獨特的幾何邏輯形式和幾何可擴展性)以及文本LLMs的能力,基於現有數據構建了一個豐富的多模態幾何數據集。擴充後的數據集Geo170K包含超過170K個幾何圖像說明和問答對。利用我們構建的Geo170K數據集,我們開發了G-LLaVA,在解決幾何問題方面表現出色,僅使用70億參數在MathVista基準測試中明顯優於GPT-4-V。
English
Large language models (LLMs) have shown remarkable proficiency in human-level
reasoning and generation capabilities, which encourages extensive research on
their application in mathematical problem solving. However, current work has
been largely focused on text-based mathematical problems, with limited
investigation in problems involving geometric information. Addressing this gap,
we aim to enable LLMs to solve geometric problems by understanding image input.
We first analyze the limitations of current Multimodal Large Language Models
(MLLMs) in this area: they struggle to accurately comprehending basic geometric
elements and their relationships. To overcome these challenges, we take
advantage of the unique characteristics of geometric problems (such as unique
geometric logical form, and geometric scalability) and the capacity of the
textual LLMs to build an enriched multimodal geometry dataset based on existing
data. The augmented dataset, Geo170K, contains more than 170K geometric
image-caption and question-answer pairs. Utilizing our constructed Geo170K
dataset, we develop G-LLaVA, which demonstrates exceptional performance in
solving geometric problems, significantly outperforming GPT-4-V on the
MathVista benchmark with only 7B parameters.