GRAB:一個針對大型多模型模型的具挑戰性的圖分析基準。
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
August 21, 2024
作者: Jonathan Roberts, Kai Han, Samuel Albanie
cs.AI
摘要
大型多模型模型(LMMs)在許多視覺任務中展現出卓越的能力。儘管存在許多知名基準來評估模型性能,但它們越來越缺乏發揮空間。因此,迫切需要一批新一代基準,具有足夠挑戰性,以應對下一代LMMs。LMMs展示潛力的一個領域是圖分析,特別是分析人員在解釋圖形時可能執行的任務,例如估計函數和數據序列的平均值、截距或相關性。在這項工作中,我們介紹了GRAB,一個適用於當前和未來前沿LMMs的圖分析基準。我們的基準完全是合成的,確保高質量、無噪音的問題。GRAB包含2170個問題,涵蓋四個任務和23個圖屬性。我們在GRAB上評估了20個LMMs,發現這是一個具有挑戰性的基準,表現最好的模型僅獲得21.7%的得分。最後,我們進行各種消融實驗,以研究模型成功和失敗的原因。我們發布GRAB以鼓勵這一重要且不斷增長的領域的進步。
English
Large multimodal models (LMMs) have exhibited proficiencies across many
visual tasks. Although numerous well-known benchmarks exist to evaluate model
performance, they increasingly have insufficient headroom. As such, there is a
pressing need for a new generation of benchmarks challenging enough for the
next generation of LMMs. One area that LMMs show potential is graph analysis,
specifically, the tasks an analyst might typically perform when interpreting
figures such as estimating the mean, intercepts or correlations of functions
and data series. In this work, we introduce GRAB, a graph analysis benchmark,
fit for current and future frontier LMMs. Our benchmark is entirely synthetic,
ensuring high-quality, noise-free questions. GRAB is comprised of 2170
questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on
GRAB, finding it to be a challenging benchmark, with the highest performing
model attaining a score of just 21.7%. Finally, we conduct various ablations to
investigate where the models succeed and struggle. We release GRAB to encourage
progress in this important, growing domain.Summary
AI-Generated Summary