HumaniBench: Een Mensgerichte Framework voor Evaluatie van Grote Multimodale Modellen

Samenvatting

Grote multimodale modellen (LMMs) presteren nu uitstekend op veel visuele taalbenchmarks, maar ze hebben nog steeds moeite met mensgerichte criteria zoals eerlijkheid, ethiek, empathie en inclusiviteit, die essentieel zijn voor afstemming op menselijke waarden. Wij introduceren HumaniBench, een holistische benchmark van 32K real-world beeld-vraagparen, geannoteerd via een schaalbare GPT4o-ondersteunde pijplijn en uitgebreid geverifieerd door domeinexperts. HumaniBench evalueert zeven mensgerichte AI-principes (HCAI): eerlijkheid, ethiek, begrip, redeneren, taal-inclusiviteit, empathie en robuustheid, over zeven diverse taken, waaronder open en gesloten visuele vraag-antwoordtaken (VQA), meertalige QA, visuele gronding, empathische bijschriften en robuustheidstests. Het benchmarken van 15 state-of-the-art LMMs (open en closed source) laat zien dat propriëtaire modellen over het algemeen de leiding hebben, hoewel robuustheid en visuele gronding zwakke punten blijven. Sommige open-source modellen hebben ook moeite om nauwkeurigheid in balans te brengen met naleving van mensgerichte principes. HumaniBench is de eerste benchmark die specifiek is gebouwd rond HCAI-principes. Het biedt een rigoureus testplatform voor het diagnosticeren van afstemmingsproblemen en het begeleiden van LMMs naar gedrag dat zowel nauwkeurig als sociaal verantwoordelijk is. De dataset, annotatieprompts en evaluatiecode zijn beschikbaar op: https://vectorinstitute.github.io/HumaniBench

English

Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain experts. HumaniBench evaluates seven Human Centered AI (HCAI) principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, across seven diverse tasks, including open and closed ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests. Benchmarking 15 state of the art LMMs (open and closed source) reveals that proprietary models generally lead, though robustness and visual grounding remain weak points. Some open-source models also struggle to balance accuracy with adherence to human-aligned principles. HumaniBench is the first benchmark purpose built around HCAI principles. It provides a rigorous testbed for diagnosing alignment gaps and guiding LMMs toward behavior that is both accurate and socially responsible. Dataset, annotation prompts, and evaluation code are available at: https://vectorinstitute.github.io/HumaniBench

HumaniBench: Een Mensgerichte Framework voor Evaluatie van Grote Multimodale Modellen

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Samenvatting

Support