ChatPaper.aiChatPaper

多准则评测:基于多元化标准遵循的多模态评估系统基准测试

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

November 26, 2025
作者: Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang
cs.AI

摘要

大型多模态模型(LMMs)因其强大的指令遵循能力以及与人类偏好的一致性,正日益被用作多模态评估系统的评判者。然而,这些模型在遵循多样化、细粒度评估标准方面的能力仍有待深入探索。我们开发了Multi-Crit基准,用于评估多模态评判者在遵循多元化标准并生成可靠标准级判断方面的能力。该基准涵盖开放式生成与可验证推理任务,通过严格的数据筛选流程构建,收集了具有多标准人工标注的挑战性响应对,并引入三项创新指标系统评估:多元标准遵循度、标准切换灵活性以及识别标准级偏好冲突的能力。对25个LMM模型的综合分析表明:1)专有模型在保持对多元标准的一致性遵循方面仍存在困难——尤其在开放式评估中;2)开源模型在灵活遵循多样化标准方面差距更为显著;3)基于整体判断信号的评判微调虽能增强视觉定位能力,但无法泛化至多元标准级判断。针对推理微调、测试时扩展以及开源与专有模型边界一致性的进一步分析,揭示了当前多模态评判者的能力局限。作为开创性研究,Multi-Crit为构建可靠且可调控的多模态人工智能评估体系奠定了基础。
English
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
PDF92December 1, 2025