HumaniBench: 大規模マルチモーダルモデルの評価のための人間中心フレームワーク

要旨

大規模マルチモーダルモデル（LMMs）は現在、多くの視覚言語ベンチマークで優れた性能を発揮していますが、公平性、倫理、共感性、包括性といった人間中心の基準においては依然として課題を抱えており、これらは人間の価値観に沿うための重要な要素です。本論文では、HumaniBenchを紹介します。これは32Kの実世界の画像質問ペアからなる包括的なベンチマークで、スケーラブルなGPT4o支援パイプラインを通じて注釈が付けられ、ドメインエキスパートによって徹底的に検証されています。HumaniBenchは、公平性、倫理、理解、推論、言語の包括性、共感性、堅牢性という7つの人間中心AI（HCAI）原則を評価し、オープンエンドおよびクローズドエンドの視覚質問応答（VQA）、多言語QA、視覚的グラウンディング、共感的キャプショニング、堅牢性テストといった7つの多様なタスクをカバーしています。15の最先端LMMs（オープンソースおよびクローズドソース）をベンチマークした結果、プロプライエタリモデルが一般的にリードしているものの、堅牢性と視覚的グラウンディングは依然として弱点であることが明らかになりました。また、一部のオープンソースモデルは、精度と人間の価値観に沿った原則の遵守とのバランスを取ることに苦労しています。HumaniBenchは、HCAI原則を中心に設計された初めてのベンチマークです。これにより、アライメントのギャップを診断し、LMMsが正確かつ社会的に責任ある行動を取るための指針を提供する厳密なテストベッドが実現されます。データセット、注釈プロンプト、評価コードは以下で公開されています：https://vectorinstitute.github.io/HumaniBench

English

Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain experts. HumaniBench evaluates seven Human Centered AI (HCAI) principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, across seven diverse tasks, including open and closed ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests. Benchmarking 15 state of the art LMMs (open and closed source) reveals that proprietary models generally lead, though robustness and visual grounding remain weak points. Some open-source models also struggle to balance accuracy with adherence to human-aligned principles. HumaniBench is the first benchmark purpose built around HCAI principles. It provides a rigorous testbed for diagnosing alignment gaps and guiding LMMs toward behavior that is both accurate and socially responsible. Dataset, annotation prompts, and evaluation code are available at: https://vectorinstitute.github.io/HumaniBench

HumaniBench: 大規模マルチモーダルモデルの評価のための人間中心フレームワーク

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

要旨

Support