PILAF:獲取報酬建模的最佳人類偏好採樣
PILAF: Optimal Human Preference Sampling for Reward Modeling
February 6, 2025
作者: Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan
cs.AI
摘要
隨著大型語言模型越來越多地驅動現實應用,將它們與人類價值觀保持一致變得至關重要。從人類反饋中學習強化學習(RLHF)已經成為一項關鍵技術,當神諭人類價值無法訪問時,將偏好數據轉化為獎勵模型。在實踐中,RLHF 主要依賴近似獎勵模型,這可能無法一貫地引導策略朝向最大化潛在的人類價值。我們提出了用於對齊反饋的策略插值學習(PILAF),這是一種新穎的偏好標記回應抽樣策略,明確將偏好學習與最大化潛在的神諭獎勵保持一致。PILAF 在理論上有基礎,從優化和統計角度展示了最優性。這種方法易於實施,在反饋策劃至關重要的迭代和在線 RLHF 環境中表現出色。
English
As large language models increasingly drive real-world applications, aligning
them with human values becomes paramount. Reinforcement Learning from Human
Feedback (RLHF) has emerged as a key technique, translating preference data
into reward models when oracle human values remain inaccessible. In practice,
RLHF mostly relies on approximate reward models, which may not consistently
guide the policy toward maximizing the underlying human values. We propose
Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response
sampling strategy for preference labeling that explicitly aligns preference
learning with maximizing the underlying oracle reward. PILAF is theoretically
grounded, demonstrating optimality from both an optimization and a statistical
perspective. The method is straightforward to implement and demonstrates strong
performance in iterative and online RLHF settings where feedback curation is
critical.Summary
AI-Generated Summary