Almieyar-Oryx-BloomBench：面向視覺語言模型認知啟發的雙語多模態評估基準

摘要

儘管視覺語言模型（VLM）進展迅速，該領域仍缺乏能嚴格診斷其真實推理能力、並為邁向類人多模態智慧提供有意義進展的基準測試。現有評估多聚焦於零散或無關聯的任務，掩蓋了關鍵的認知弱點，且難以提供有針對性的改進方向。為填補此缺口，我們提出BloomBench——Almieyar基準系列的一部分——這是首個以人類認知為基礎、雙語（英語-阿拉伯語）的多模態VLM基準測試。基於布魯姆分類學，BloomBench透過精心設計的圖像-問題-答案任務，系統性評估六個認知層級（記憶、理解、應用、分析、評鑑、創造）。藉由半自動化流程建構，並經分層混合品質保證協議驗證，確保其可擴展性、文化包容性及語言忠實度。利用此框架，我們對當前頂尖VLM進行全面研究，以診斷其認知特徵。分析結果揭示出明顯的認知不對稱性：雖然當前頂尖模型在語義理解方面達到高效能上限，但在事實回憶與創造性綜合方面卻表現不佳。這顯示目前通用的多模態能力掩蓋了特定認知層面的深層侷限。此外，我們的研究凸顯了阿拉伯語與英語之間的重大效能落差，揭露當前跨語言多模態推理的缺陷。這些發現為開發更具認知契合度與包容性的VLM奠定了基礎。該基準框架與資料集可於以下網址取得：https://github.com/qcri/Almieyar-Oryx-BloomBench。

English

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.