CompassJudger-2: 検証可能な報酬による汎用評価モデルへのアプローチ

要旨

近年、大規模言語モデル（LLM）の評価における「LLM-as-judge」の役割が注目を集めている。しかし、現在の判定モデルは専門性が狭く、堅牢性に欠けるため、包括的な評価能力が制限されている。本研究では、これらの課題を克服する新たな汎用判定モデル「CompassJudger-2」を提案する。本アプローチの中核は、検証可能な報酬を用いて判定タスクを監督し、拒否サンプリングを通じて本質的な批判的推論を導くことで、堅牢で汎化可能な判定能力を育むことである。さらに、性能向上のため、マージン方策勾配損失を用いた洗練された学習目標を導入した。実験的には、CompassJudger-2は複数の判定および報酬ベンチマークで優れた結果を達成し、7BモデルはDeepSeek-V3やQwen3-235B-A22Bといった大幅に大規模なモデルと競合する判定精度を示した。加えて、クロスドメインの判定精度と順位一貫性を評価する包括的ベンチマーク「JudgerBenchV2」を提案し、判定モデルの評価を標準化した。これらの貢献により、堅牢でスケーラブルなLLM判定が進展し、新たな性能および評価基準が確立された。

English

Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

CompassJudger-2: 検証可能な報酬による汎用評価モデルへのアプローチ

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

要旨

Support