PILAF：獲取報酬建模的最佳人類偏好採樣

摘要

隨著大型語言模型越來越多地驅動現實應用，將它們與人類價值觀保持一致變得至關重要。從人類反饋中學習強化學習（RLHF）已經成為一項關鍵技術，當神諭人類價值無法訪問時，將偏好數據轉化為獎勵模型。在實踐中，RLHF 主要依賴近似獎勵模型，這可能無法一貫地引導策略朝向最大化潛在的人類價值。我們提出了用於對齊反饋的策略插值學習（PILAF），這是一種新穎的偏好標記回應抽樣策略，明確將偏好學習與最大化潛在的神諭獎勵保持一致。PILAF 在理論上有基礎，從優化和統計角度展示了最優性。這種方法易於實施，在反饋策劃至關重要的迭代和在線 RLHF 環境中表現出色。

English

As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

PILAF：獲取報酬建模的最佳人類偏好採樣

PILAF: Optimal Human Preference Sampling for Reward Modeling

摘要

Support