ChatPaper.aiChatPaper

GRAB:一个针对大型多模态模型的具有挑战性的图分析基准测试

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

August 21, 2024
作者: Jonathan Roberts, Kai Han, Samuel Albanie
cs.AI

摘要

大型多模态模型(LMMs)在许多视觉任务中展现出了高超的能力。尽管存在许多知名基准来评估模型性能,但它们的提升空间日益不足。因此,迫切需要一批具有挑战性的新一代基准,以应对未来LMMs的需求。LMMs展现潜力的一个领域是图分析,特别是分析人员在解释图表时通常执行的任务,如估计函数和数据序列的均值、截距或相关性。在这项工作中,我们介绍了GRAB,一个适用于当前和未来前沿LMMs的图分析基准。我们的基准完全是合成的,确保问题高质量且无噪音。GRAB包含2170个问题,涵盖四个任务和23个图属性。我们在GRAB上评估了20个LMMs,发现这是一个具有挑战性的基准,最高表现模型的得分仅为21.7%。最后,我们进行了各种消融实验,以调查模型成功和困难的地方。我们发布GRAB以促进这一重要且不断增长的领域的进展。
English
Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.

Summary

AI-Generated Summary

PDF92November 16, 2024