CAMEL-Bench:包括的なアラビア語LMMベンチマーク

要旨

近年、さまざまな視覚的推論および理解タスクを実行できる大規模多モーダルモデル（LMMs）の開発に大きな関心が寄せられています。これにより、複数のLMMベンチマークが導入され、異なるタスクでLMMを評価するために使用されています。ただし、既存のほとんどのLMM評価ベンチマークは主に英語中心です。本研究では、アラビア語の大きな話者人口（4億人以上）を代表するために、包括的なLMM評価ベンチマークであるCAMEL-Benchを開発しました。提案されたベンチマークは、マルチ画像理解、複雑な視覚認識、手書き文書理解、ビデオ理解、医用画像、植物疾患、およびリモートセンシングに基づく土地利用理解など、8つの異なるドメインと38のサブドメインを含んでおり、幅広いシナリオの汎用性を評価します。CAMEL-Benchには、約29,036の質問が含まれており、より大きなサンプルプールからフィルタリングされています。質はネイティブスピーカーによって手動で検証され、信頼性のあるモデル評価が確保されています。私たちは、GPT-4シリーズを含むクローズドソースおよびオープンソースのLMMの評価を行っています。分析の結果、特に最高のオープンソースモデルの改善が必要であり、クローズドソースのGPT-4oでさえ全体スコアが62％に達しています。私たちのベンチマークと評価スクリプトはオープンソースで提供されています。

English

Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.

CAMEL-Bench:包括的なアラビア語LMMベンチマーク

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

要旨

Support