CompassJudger-2：迈向基于可验证奖励的通用评判模型

摘要

近期，LLM-as-judge（大语言模型作为评判者）在评估大语言模型中的作用日益凸显。然而，现有的评判模型存在专业领域狭窄和鲁棒性不足的问题，限制了其进行全面评估的能力。本研究提出了CompassJudger-2，一种新型的通用评判模型，通过任务驱动、多领域数据筛选策略克服了上述局限。我们的方法核心在于利用可验证的奖励监督评判任务，通过拒绝采样引导内在的批判性推理，从而培养出稳健且可泛化的评判能力。我们引入了一种改进的学习目标，结合边际策略梯度损失以提升性能。实验表明，CompassJudger-2在多个评判和奖励基准测试中均取得了优异成果，其7B模型在评判准确性上可与DeepSeek-V3和Qwen3-235B-A22B等显著更大的模型相媲美。此外，我们提出了JudgerBenchV2，一个评估跨领域评判准确性和排名一致性的综合基准，旨在标准化评判模型的评估流程。这些贡献推动了鲁棒、可扩展的LLM评判技术的发展，并确立了新的性能与评估标准。

English

Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

CompassJudger-2：迈向基于可验证奖励的通用评判模型

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

摘要

Support