ChatPaper.aiChatPaper

多準則評測:基於多元標準遵循的多模態評判模型基準測試

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

November 26, 2025
作者: Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang
cs.AI

摘要

大型多模態模型因其出色的指令遵循能力及與人類偏好的高度一致性,正日益被用作多模態評估系統中的評判者。然而,這些模型在遵循多樣化、細粒度評估準則方面的能力仍有待深入探索。我們開發了Multi-Crit基準測試,用於評估多模態評判者在遵循多元準則並產生可靠準則級判斷的能力。該基準涵蓋開放式生成與可驗證推理兩類任務,通過嚴格的數據篩選流程構建,收錄了帶有多準則人工標註的挑戰性回應對,並引入三項創新指標系統性評估:多元準則遵循度、準則切換靈活性,以及識別準則級偏好衝突的能力。對25個大型多模態模型的綜合分析表明:1)專有模型仍難以保持對多元準則的一致性遵循——尤其在開放式評估中;2)開源模型在靈活遵循多樣準則方面存在更大差距;3)基於整體判斷信號的批評微調雖能增強視覺基礎能力,但無法泛化至多元準則級判斷。針對推理微調、測試時擴展以及開源與專有模型間邊界一致性的補充分析,進一步揭示了當前多模態評判者的局限性。作為開創性研究,Multi-Crit為構建可靠且可調控的多模態人工智能評估奠定了基礎。
English
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
PDF92December 1, 2025