DogeRM：通过模型合并为奖励模型提供领域知识

摘要

人类反馈强化学习（RLHF）是一种流行的策略，用于使大型语言模型（LLMs）与期望的行为保持一致。奖励建模是RLHF中的关键步骤。然而，为训练奖励模型收集成对偏好数据通常是昂贵且耗时的，尤其是对于需要专家标注的领域特定偏好。为了解决这一挑战，我们提出了领域知识融合奖励模型（DogeRM），这是一个通过模型合并将领域特定知识整合到通用奖励模型中的新框架。实验证明，DogeRM提高了在不同基准测试中的性能，并提供了详细分析，展示了模型合并的效果，显示了促进模型对齐的巨大潜力。

English

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the Domain knowledge merged Reward Model (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

DogeRM：通过模型合并为奖励模型提供领域知识

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

摘要

Support