G-LLaVA:利用多模态大型语言模型解决几何问题
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
December 18, 2023
作者: Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong
cs.AI
摘要
大型语言模型(LLMs)展现出在人类水平推理和生成能力方面的显著熟练度,这促使人们在数学问题求解应用上进行了广泛研究。然而,目前的工作主要集中在基于文本的数学问题上,对涉及几何信息的问题进行的研究有限。为填补这一空白,我们旨在通过理解图像输入,使LLMs能够解决几何问题。我们首先分析了当前多模态大型语言模型(MLLMs)在这一领域的局限性:它们难以准确理解基本几何元素及其关系。为了克服这些挑战,我们利用几何问题的独特特征(如独特的几何逻辑形式和几何可扩展性)以及文本LLMs的能力,基于现有数据构建了一个丰富的多模态几何数据集。增强的数据集Geo170K包含超过170K个几何图像说明和问题答案对。利用我们构建的Geo170K数据集,我们开发了G-LLaVA,在解决几何问题方面表现出色,仅使用7B参数在MathVista基准测试中明显优于GPT-4-V。
English
Large language models (LLMs) have shown remarkable proficiency in human-level
reasoning and generation capabilities, which encourages extensive research on
their application in mathematical problem solving. However, current work has
been largely focused on text-based mathematical problems, with limited
investigation in problems involving geometric information. Addressing this gap,
we aim to enable LLMs to solve geometric problems by understanding image input.
We first analyze the limitations of current Multimodal Large Language Models
(MLLMs) in this area: they struggle to accurately comprehending basic geometric
elements and their relationships. To overcome these challenges, we take
advantage of the unique characteristics of geometric problems (such as unique
geometric logical form, and geometric scalability) and the capacity of the
textual LLMs to build an enriched multimodal geometry dataset based on existing
data. The augmented dataset, Geo170K, contains more than 170K geometric
image-caption and question-answer pairs. Utilizing our constructed Geo170K
dataset, we develop G-LLaVA, which demonstrates exceptional performance in
solving geometric problems, significantly outperforming GPT-4-V on the
MathVista benchmark with only 7B parameters.