DogeRM:通過模型合併為獎勵模型配備領域知識
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
July 1, 2024
作者: Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Yun-Nung Chen
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)是一種用於使大型語言模型(LLMs)與期望行為一致的流行策略。獎勵建模是RLHF中至關重要的一步。然而,為訓練獎勵模型收集成對偏好數據通常是昂貴且耗時的,尤其是對於需要專家標註的特定領域偏好。為應對這一挑戰,我們提出了結合領域知識的獎勵模型(DogeRM),這是一種通過模型合併將領域特定知識整合到通用獎勵模型中的新框架。實驗表明,DogeRM提升了在不同基準測試中的性能,並提供了詳細分析,展示了模型合併的效果,顯示了促進模型對齊的巨大潛力。
English
Reinforcement learning from human feedback (RLHF) is a popular strategy for
aligning large language models (LLMs) with desired behaviors. Reward modeling
is a crucial step in RLHF. However, collecting paired preference data for
training reward models is often costly and time-consuming, especially for
domain-specific preferences requiring expert annotation. To address this
challenge, we propose the Domain knowledge merged
Reward Model (DogeRM), a novel framework that integrates
domain-specific knowledge into a general reward model by model merging. The
experiments demonstrate that DogeRM enhances performance across different
benchmarks and provide a detailed analysis showcasing the effects of model
merging, showing the great potential of facilitating model alignment.Summary
AI-Generated Summary