ChatPaper.aiChatPaper

HumaniBench:一個以人為本的大型多模態模型評估框架

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

May 16, 2025
作者: Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya
cs.AI

摘要

大型多模态模型(LMMs)在众多视觉语言基准测试中表现出色,然而,在涉及以人为中心的标准,如公平性、伦理、同理心和包容性等方面,它们仍面临挑战,这些标准对于与人类价值观保持一致至关重要。我们推出了HumaniBench,这是一个包含32K真实世界图像问题对的综合基准,通过可扩展的GPT4o辅助流程进行标注,并由领域专家详尽验证。HumaniBench评估了七项以人为中心的人工智能(HCAI)原则:公平性、伦理、理解力、推理能力、语言包容性、同理心及鲁棒性,覆盖七种多样化任务,包括开放式与封闭式视觉问答(VQA)、多语言问答、视觉定位、共情式字幕生成及鲁棒性测试。对15种最先进的LMMs(开源与闭源)进行基准测试显示,尽管专有模型普遍领先,但鲁棒性和视觉定位仍是其薄弱环节。部分开源模型在平衡准确性与遵循人类对齐原则方面也存在困难。HumaniBench是首个围绕HCAI原则专门构建的基准,为诊断对齐差距、引导LMMs实现既准确又社会负责的行为提供了严格的测试平台。数据集、标注提示及评估代码可在以下网址获取:https://vectorinstitute.github.io/HumaniBench
English
Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain experts. HumaniBench evaluates seven Human Centered AI (HCAI) principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, across seven diverse tasks, including open and closed ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests. Benchmarking 15 state of the art LMMs (open and closed source) reveals that proprietary models generally lead, though robustness and visual grounding remain weak points. Some open-source models also struggle to balance accuracy with adherence to human-aligned principles. HumaniBench is the first benchmark purpose built around HCAI principles. It provides a rigorous testbed for diagnosing alignment gaps and guiding LMMs toward behavior that is both accurate and socially responsible. Dataset, annotation prompts, and evaluation code are available at: https://vectorinstitute.github.io/HumaniBench

Summary

AI-Generated Summary

PDF12May 22, 2025