CompassJudger-2:基於可驗證獎勵的通用評判模型
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
July 12, 2025
作者: Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen
cs.AI
摘要
近期,LLM-as-judge在评估大型语言模型中的作用日益凸显。然而,当前的评判模型存在专业领域狭窄和鲁棒性有限的问题,这削弱了其进行全面评估的能力。在本研究中,我们提出了CompassJudger-2,一种新型的通用评判模型,通过任务驱动、多领域数据策展策略克服了这些局限。我们方法的核心在于利用可验证的奖励监督评判任务,通过拒绝采样引导内在的批判性推理,以培养稳健且可推广的评判能力。我们引入了一种改进的学习目标,即边际策略梯度损失,以提升性能。实证表明,CompassJudger-2在多个评判和奖励基准测试中取得了优异成果,我们的7B模型在评判准确性上展现了与DeepSeek-V3和Qwen3-235B-A22B等显著更大模型相竞争的实力。此外,我们提出了JudgerBenchV2,一个评估跨领域评判准确性和排名一致性的综合基准,旨在标准化评判模型的评估。这些贡献推动了鲁棒、可扩展的LLM评判发展,并确立了新的性能与评估标准。
English
Recently, the role of LLM-as-judge in evaluating large language models has
gained prominence. However, current judge models suffer from narrow
specialization and limited robustness, undermining their capacity for
comprehensive evaluations. In this work, we present CompassJudger-2, a novel
generalist judge model that overcomes these limitations via a task-driven,
multi-domain data curation strategy. Central to our approach is supervising
judgment tasks with verifiable rewards, guiding intrinsic critical reasoning
through rejection sampling to foster robust, generalizable judgment
capabilities. We introduce a refined learning objective with margin policy
gradient loss to enhance performance. Empirically, CompassJudger-2 achieves
superior results across multiple judge and reward benchmarks, and our 7B model
demonstrates competitive judgment accuracy with significantly larger models
like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a
comprehensive benchmark evaluating cross-domain judgment accuracy and rank
consistency to standardize judge model evaluation. These contributions advance
robust, scalable LLM judgment and establish new performance and evaluation
standards.