DogeRM:通过模型合并为奖励模型提供领域知识
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
July 1, 2024
作者: Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Yun-Nung Chen
cs.AI
摘要
人类反馈强化学习(RLHF)是一种流行的策略,用于使大型语言模型(LLMs)与期望的行为保持一致。奖励建模是RLHF中的关键步骤。然而,为训练奖励模型收集成对偏好数据通常是昂贵且耗时的,尤其是对于需要专家标注的领域特定偏好。为了解决这一挑战,我们提出了领域知识融合奖励模型(DogeRM),这是一个通过模型合并将领域特定知识整合到通用奖励模型中的新框架。实验证明,DogeRM提高了在不同基准测试中的性能,并提供了详细分析,展示了模型合并的效果,显示了促进模型对齐的巨大潜力。
English
Reinforcement learning from human feedback (RLHF) is a popular strategy for
aligning large language models (LLMs) with desired behaviors. Reward modeling
is a crucial step in RLHF. However, collecting paired preference data for
training reward models is often costly and time-consuming, especially for
domain-specific preferences requiring expert annotation. To address this
challenge, we propose the Domain knowledge merged
Reward Model (DogeRM), a novel framework that integrates
domain-specific knowledge into a general reward model by model merging. The
experiments demonstrate that DogeRM enhances performance across different
benchmarks and provide a detailed analysis showcasing the effects of model
merging, showing the great potential of facilitating model alignment.Summary
AI-Generated Summary