CompassJudger-2: 검증 가능한 보상을 통한 범용 판단 모델로의 접근

초록

최근 대규모 언어 모델(LLM) 평가에서 LLM-as-judge의 역할이 주목받고 있다. 그러나 현재의 판단 모델은 좁은 전문화와 제한된 견고성으로 인해 포괄적인 평가 능력이 저해되고 있다. 본 연구에서는 이러한 한계를 극복하기 위해 작업 중심의 다중 도메인 데이터 큐레이션 전략을 통해 새로운 일반주의 판단 모델인 CompassJudger-2를 제안한다. 우리의 접근 방식의 핵심은 검증 가능한 보상으로 판단 작업을 감독하고, 거부 샘플링을 통해 내재적 비판적 사고를 유도하여 견고하고 일반화 가능한 판단 능력을 키우는 것이다. 또한, 성능을 향상시키기 위해 마진 정책 그래디언트 손실을 포함한 정제된 학습 목표를 도입하였다. 실험적으로, CompassJudger-2는 여러 판단 및 보상 벤치마크에서 우수한 결과를 달성했으며, 7B 모델은 DeepSeek-V3 및 Qwen3-235B-A22B와 같은 훨씬 더 큰 모델과 경쟁력 있는 판단 정확도를 보여주었다. 또한, 판단 모델 평가를 표준화하기 위해 교차 도메인 판단 정확도와 순위 일관성을 평가하는 포괄적인 벤치마크인 JudgerBenchV2를 제안한다. 이러한 기여는 견고하고 확장 가능한 LLM 판단을 발전시키고 새로운 성능 및 평가 기준을 확립한다.

English

Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

CompassJudger-2: 검증 가능한 보상을 통한 범용 판단 모델로의 접근

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

초록

Support