HumaniBench：面向大规模多模态模型评估的人本框架

摘要

当前，大型多模态模型（LMMs）在众多视觉语言基准测试中表现卓越，然而，在诸如公平性、伦理、同理心及包容性等以人为核心的标准上，它们仍面临挑战，这些标准对于与人类价值观保持一致至关重要。为此，我们推出了HumaniBench，这是一个包含32K真实世界图像问答对的综合基准，通过可扩展的GPT4o辅助流程进行标注，并由领域专家详尽验证。HumaniBench评估了七大人本人工智能（HCAI）原则：公平性、伦理、理解力、推理能力、语言包容性、同理心及鲁棒性，覆盖了七项多样化任务，包括开放式与封闭式视觉问答（VQA）、多语言问答、视觉定位、情感化描述以及鲁棒性测试。对15种顶尖LMMs（开源与闭源）的基准测试显示，尽管专有模型总体领先，但鲁棒性和视觉定位仍是其短板。部分开源模型在平衡准确性与遵循人本原则方面也存在困难。HumaniBench是首个围绕HCAI原则专门构建的基准，它为诊断对齐差距、引导LMMs实现既准确又社会负责的行为提供了严格的测试平台。数据集、标注提示及评估代码可在以下网址获取：https://vectorinstitute.github.io/HumaniBench。

English

Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain experts. HumaniBench evaluates seven Human Centered AI (HCAI) principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, across seven diverse tasks, including open and closed ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests. Benchmarking 15 state of the art LMMs (open and closed source) reveals that proprietary models generally lead, though robustness and visual grounding remain weak points. Some open-source models also struggle to balance accuracy with adherence to human-aligned principles. HumaniBench is the first benchmark purpose built around HCAI principles. It provides a rigorous testbed for diagnosing alignment gaps and guiding LMMs toward behavior that is both accurate and socially responsible. Dataset, annotation prompts, and evaluation code are available at: https://vectorinstitute.github.io/HumaniBench