Skill-RM: 에이전트 기술을 통한 이질적 평가 기준의 통합

초록

보상 모델(RM)은 대규모 언어 모델(LLM)의 사후 훈련, 특히 강화 미세 조정(RFT) 및 강화 학습(RL) 파이프라인에서 중요한 피드백 신호를 제공합니다. 그러나 현재의 보상 평가는 규칙 기반 검증기, 실제 정답 참조, 절차 체크리스트, 복잡한 루브릭(rubric)과 같은 이질적인 기준에 의존하며, 모든 유형의 증거를 통합하는 일관된 메커니즘은 아직 탐구되지 않았습니다. 이를 해결하기 위해, 우리는 보상 모델링을 재사용 가능한 보상 평가 스킬(Reward-Evaluation Skill)의 실행으로 재정의하는 통합 프레임워크인 Skill-RM(Skill Reward Model)을 제안합니다. 보상 계산을 구조화된 에이전트 작업으로 처리함으로써, Skill-RM은 이질적인 자원을 조율하기 위한 일관된 인터페이스를 제공하며, 각 입력의 특정 요구사항에 맞춰 증거를 동적으로 선택하고 집계합니다. 이러한 접근 방식은 보상 모델이 정적 평가를 넘어서 다양한 작업에서 일관성과 투명성을 보장할 수 있게 합니다. 보상 벤치마크 및 최상의 N개 선택(best-of-N selection)과 강화 학습을 포함한 다운스트림 애플리케이션에 대한 광범위한 실험 결과, Skill-RM이 기존의 판단 기준(judge baseline)을 일관되게 능가함을 보여줍니다. 우리의 연구 결과는 Skill-RM이 보상 모델링을 위한 통합 솔루션을 제공할 뿐만 아니라 전략적이고 동적인 증거 조율을 통해 우수한 성능을 달성함을 시사합니다. 코드는 https://github.com/Qwen-Applications/Skill-RM에서 확인할 수 있습니다.

English

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.