Almieyar-Oryx-BloomBench: Een tweetalige multimodale benchmark voor cognitief geïnformeerde evaluatie van visie-taalmodellen

Samenvatting

Ondanks de snelle vooruitgang van Visie-Taalmodellen (VLM's) ontbreekt het het veld aan benchmarks die hun ware redeneervermogen rigoureus diagnosticeren en zinvolle vooruitgang in de richting van mensachtige multimodale intelligentie in kaart brengen. De meeste bestaande evaluaties richten zich op versnipperde of onsamenhangende taken, waardoor kritische cognitieve zwakheden worden verhuld en er weinig inzicht wordt geboden voor gerichte verbetering. Om deze leemte aan te pakken, introduceren we BloomBench, onderdeel van de Almieyar-benchmarkreeks, de eerste cognitief menselijk gefundeerde, tweetalige (Engels-Arabisch) multimodale benchmark voor VLM's. Gebaseerd op Blooms taxonomie evalueert BloomBench systematisch zes cognitieniveaus (Onthouden, Begrijpen, Toepassen, Analyseren, Evalueren, Creëren) aan de hand van zorgvuldig ontworpen afbeelding-vraag-antwoordtaken. Gebouwd met een semi-geautomatiseerde pijplijn en gevalideerd via een gestratificeerd hybride kwaliteitsborgingsprotocol, garandeert het schaalbaarheid, culturele inclusiviteit en taalkundige getrouwheid. Door gebruik te maken van dit raamwerk voeren we een uitgebreid onderzoek uit naar state-of-the-art VLM's om hun cognitieve profielen te diagnosticeren. Onze analyse onthult een scherpe cognitieve asymmetrie: hoewel state-of-the-art modellen sterke prestatieplafonds behalen op het gebied van semantisch begrip, hebben ze aanzienlijke moeite met feitelijke herinnering en creatieve synthese. Dit toont aan dat de huidige algemene multimodale bekwaamheid diepere beperkingen op specifieke cognitieve lagen maskeert. Bovendien benadrukt onze studie een kritieke prestatiekloof tussen Arabisch en Engels, wat beperkingen in het huidige cross-linguale multimodale redeneren blootlegt. Deze bevindingen leggen een basis voor de ontwikkeling van meer cognitief afgestemde en inclusieve VLM's. Het benchmarkraamwerk en de dataset zijn beschikbaar op: https://github.com/qcri/Almieyar-Oryx-BloomBench.

English

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.