Almieyar-Oryx-BloomBench: 시각-언어 모델의 인지 기반 평가를 위한 이중 언어 멀티모달 벤치마크

초록

시각-언어 모델(VLM)의 급속한 발전에도 불구하고, 해당 분야에는 이들의 진정한 추론 능력을 엄격히 진단하고 인간 수준의 다중 모달 지능을 향한 의미 있는 진전을 도표화하는 벤치마크가 부족한 실정이다. 기존의 대부분 평가는 단편적이거나 연결성이 떨어지는 작업에 초점을 맞춰 중요한 인지적 약점을 모호하게 하고, 목표 지향적 개선을 위한 통찰을 거의 제공하지 못한다. 이러한 격차를 해소하기 위해 우리는 Almieyar 벤치마킹 시리즈의 일부이자, 최초로 인간 인지에 기반을 둔 이중 언어(영어-아랍어) 다중 모달 벤치마크인 BloomBench를 소개한다. 블룸의 인지 분류체계(Bloom's Taxonomy)에 기반을 둔 BloomBench는 신중하게 설계된 이미지-질문-답변 작업을 통해 여섯 가지 인지 수준(기억, 이해, 적용, 분석, 평가, 창안)을 체계적으로 평가한다. 반자동화 파이프라인으로 구축되고 계층적 혼합 품질 보증 프로토콜을 통해 검증된 이 벤치마크는 확장성, 문화적 포용성, 언어적 충실성을 보장한다. 이 프레임워크를 활용하여 우리는 최첨단 VLM에 대한 포괄적 연구를 수행하여 이들의 인지 프로필을 진단한다. 분석 결과, 날카로운 인지적 비대칭성이 드러났다. 최첨단 모델들은 의미 이해에서 강력한 성능 상한선을 달성하지만, 사실 회상과 창의적 종합에서는 상당히 어려움을 겪는다. 이는 현재의 일반 다중 모달 능력이 특정 인지 계층의 더 깊은 한계를 가리고 있음을 보여준다. 더 나아가, 우리 연구는 아랍어와 영어 간의 심각한 성능 격차를 강조하며, 현재의 교차 언어 다중 모달 추론의 한계를 드러낸다. 이러한 발견은 보다 인지적으로 정합적이고 포용적인 VLM을 개발하기 위한 기초를 마련한다. 벤치마크 프레임워크와 데이터셋은 다음에서 확인할 수 있다: https://github.com/qcri/Almieyar-Oryx-BloomBench.

English

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.