Almieyar-Oryx-BloomBench: 視覚言語モデルの認知に基づく評価のためのバイリンガルマルチモーダルベンチマーク

要旨

ビジョン・ランゲージモデル（VLM）の急速な進歩にもかかわらず、その真の推論能力を厳密に診断し、人間に近いマルチモーダル知能への有意義な進歩を計測するベンチマークは、この分野では不足している。既存の評価のほとんどは、断片的または非連続的なタスクに焦点を当てており、重要な認知的弱点を不明瞭にし、的を絞った改善への洞察をほとんど提供しない。このギャップを埋めるために、我々はAlmieyarベンチマークシリーズの一部であるBloomBenchを紹介する。これは、認知的に人間に基づいた初のバイリンガル（英語・アラビア語）マルチモーダルベンチマークである。ブルームのタキソノミーに基づき、BloomBenchは注意深く設計された画像・質問・回答タスクを通じて、6つの認知レベル（記憶、理解、応用、分析、評価、創造）を体系的に評価する。半自動化されたパイプラインで構築され、層別化されたハイブリッド品質保証プロトコルによって検証されており、スケーラビリティ、文化的包括性、言語的忠実性が保証されている。この枠組みを活用し、我々は最先端のVLMの認知プロファイルを診断する包括的研究を実施する。分析により、顕著な認知的不均衡が明らかになった。すなわち、最先端のモデルは意味理解において高い性能上限を達成する一方で、事実の想起と創造的合成に著しく苦慮している。これは、現在の一般的なマルチモーダル能力が、特定の認知層におけるより深い限界を隠していることを示している。さらに、本研究はアラビア語と英語の間に重大な性能ギャップがあることを浮き彫りにし、現在の言語横断的なマルチモーダル推論の限界を露呈している。これらの発見は、より認知に即した包括的なVLMの開発の基盤を確立するものである。ベンチマークフレームワークとデータセットは以下で入手可能である：https://github.com/qcri/Almieyar-Oryx-BloomBench。

English

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.