Almieyar-Oryx-BloomBench：面向视觉-语言模型认知启发评估的双语多模态基准

摘要

尽管视觉语言模型（VLM）取得了快速进展，但该领域仍缺乏能够严格诊断其真实推理能力并衡量其向类人多模态智能有意义的进展的基准测试。现有大多数评估聚焦于零散或脱节的任务，掩盖了关键的认知缺陷，且对针对性改进提供的洞察有限。为填补这一空白，我们引入BloomBench——作为Almieyar基准系列的一部分，这是首个基于人类认知基础的双语（英语-阿拉伯语）多模态VLM基准。该基准以布鲁姆认知分类学为基础，通过精心设计的图像-问题-答案任务，系统评估六个认知层次（记忆、理解、应用、分析、评价、创造）。借助半自动化流水线构建，并通过分层混合质量保障协议验证，确保了可扩展性、文化包容性和语言忠实度。利用该框架，我们对最先进的VLM进行了全面研究，以诊断其认知特征。分析揭示出明显的认知不对称性：尽管最先进模型在语义理解方面达到较高性能上限，但在事实回忆和创造性综合方面却显著困难。这表明当前通用的多模态能力掩盖了特定认知层面的深层局限。此外，我们的研究还突出了阿拉伯语与英语之间的关键性能差距，暴露了当前跨语言多模态推理的局限性。这些发现为开发更具认知一致性和包容性的VLM奠定了基础。该基准框架及数据集可在以下网址获取：https://github.com/qcri/Almieyar-Oryx-BloomBench。

English

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.