Skill-RM:通过智能体技能统一异构评估标准
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
June 2, 2026
作者: Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang
cs.AI
摘要
奖励模型(RMs)为大语言模型的后训练过程提供了关键反馈信号,尤其在强化微调(RFT)和强化学习(RL)流程中作用显著。然而,当前的奖励评估依赖于诸如基于规则的验证器、真实值参考、程序化检查清单以及复杂评分准则等异质标准,而统一集成所有类型证据的机制尚未得到探索。为此,我们提出技能奖励模型(Skill-RM),这是一种将奖励建模重构为可复用奖励评估技能执行过程的统一框架。通过将奖励计算视为结构化智能体任务,Skill-RM提供了统一接口来编排异质资源,针对每个输入的特定需求动态选择并聚合证据。该方法使奖励模型突破静态评估限制,确保不同任务间的一致性与透明度。在奖励基准测试及下游应用(包括最优N选与强化学习)上的广泛实验表明,Skill-RM持续优于传统评判基座模型。研究结果表明,Skill-RM不仅为奖励建模提供了统一解决方案,还通过策略性动态编排证据实现了更优性能。代码地址:https://github.com/Qwen-Applications/Skill-RM。
English
Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.