ChatPaper.aiChatPaper

Skill-RM:透過智能體技能統整異質評估標準

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

June 2, 2026
作者: Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang
cs.AI

摘要

奖励模型(RMs)为大语言模型的后训练阶段提供了关键的反馈信号,尤其在强化微调(RFT)和强化学习(RL)流程中作用显著。然而,当前的奖励评估依赖于多种异构准则,如基于规则的验证器、真实参考标注、程序性检查清单以及复杂的评估量表,而尚未有统一机制能够整合所有类型的证据。为此,我们提出技能奖励模型(Skill-RM),这是一个统一框架,将奖励建模重新定义为一种可复用的奖励评估技能的执行过程。通过将奖励计算视为结构化的智能体任务,Skill-RM提供了统一的接口来编排异构资源,动态选择并聚合适用于每个输入特定需求的证据。该方法使奖励模型摆脱了静态评估的局限,确保在不同任务中保持一致性与透明度。在奖励基准测试及下游应用(包括N选一选择和强化学习)上的大量实验表明,Skill-RM始终优于传统的评审基线模型。我们的研究发现表明,Skill-RM不仅为奖励建模提供了统一解决方案,还通过策略性、动态的证据编排实现了更优性能。代码已开源在 https://github.com/Qwen-Applications/Skill-RM。
English
Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.