CompassVerifier:面向大语言模型评估与结果奖励的统一鲁棒验证器
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
August 5, 2025
作者: Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen
cs.AI
摘要
答案验证不仅对于通过将大型语言模型(LLMs)的非结构化输出与标准答案进行匹配来评估其性能至关重要,同时也作为奖励模型指导LLM的优化。大多数评估框架依赖于正则化匹配或采用通用LLMs进行答案验证,这需要对正则表达式规则或评估提示进行大量重复的定制。当前方法存在两个根本性局限:1)缺乏全面系统地评估不同LLM验证能力的基准;2)验证器开发尚处于初期阶段,现有方法既缺乏处理复杂边缘情况的鲁棒性,也缺乏跨领域的泛化能力。在本研究中,我们开发了CompassVerifier,一个精确且鲁棒的轻量级验证器模型,用于评估和结果奖励。它展示了跨数学、知识及多样化推理任务的多领域能力,能够处理包括多子问题、公式和序列答案在内的多种答案类型,同时有效识别异常/无效响应。我们引入了VerifierBench基准,该基准包含从多个数据源收集的模型输出,并通过手动分析元错误模式进行增强,以提升CompassVerifier的性能。我们预期CompassVerifier和VerifierBench将促进答案验证、评估协议及强化学习研究。代码和数据集可在https://github.com/open-compass/CompassVerifier获取。
English
Answer verification is crucial not only for evaluating large language models
(LLMs) by matching their unstructured outputs against standard answers, but
also serves as the reward model to guide LLM optimization. Most evaluation
frameworks rely on regularized matching or employ general LLMs for answer
verification, which demands extensive, repetitive customization for regex rules
or evaluation prompts. Two fundamental limitations persist in current
methodologies: 1) the absence of comprehensive benchmarks that systematically
evaluate verification capabilities across different LLMs; and 2) the nascent
stage of verifier development, where existing approaches lack both the
robustness to handle complex edge cases and the generalizability across
different domains. In this work, we develop CompassVerifier, an accurate and
robust lightweight verifier model for evaluation and outcome reward. It
demonstrates multi-domain competency spanning math, knowledge, and diverse
reasoning tasks, with the capability to process various answer types, including
multi-subproblems, formulas, and sequence answers, while effectively
identifying abnormal/invalid responses. We introduce VerifierBench benchmark
comprising model outputs collected from multiple data sources, augmented
through manual analysis of metaerror patterns to enhance CompassVerifier. We
anticipate that CompassVerifier and VerifierBench will facilitate answer
verification, evaluation protocols, and reinforcement learning research. Code
and dataset are available at https://github.com/open-compass/CompassVerifier.