在强化学习中进行策略过滤以微调用于代码生成的LLM

摘要

人类反馈强化学习（RLHF）是帮助大型语言模型（LLMs）遵循指令并提供有益且无害回应的关键技术之一。虽然存在直接策略优化方法，但当前最先进的LLMs采用基于RL的方法（通常是PPO）在RLHF中训练策略，以生成受偏好数据训练的奖励模型引导的良好回应。这些方法的主要挑战在于中间奖励模型的不准确性，尤其是在需要进行长时间和复杂推理才能评分回应的代码生成任务中。我们发现奖励模型的可靠性在分配不同奖励的回应之间存在差异。这激励我们过滤那些奖励可能不可靠的样本，以提高策略学习过程中的信噪比，从而产生基于近端策略优化的策略过滤（PF-PPO）。为了为给定奖励模型选择适当的策略过滤策略，奖励和经过筛选样本上的实际分数之间的确定系数（R^2）作为一个良好的度量指标，帮助我们找到几种有前景的策略。我们进行了大量实验证实PF-PPO在代码生成任务中的有效性，并发现PF-PPO的一些变体在HumanEval、MBPP以及一个新且更具挑战性的LeetCode竞赛基准测试上实现了新的最先进性能。

English

Reinforcement learning from human feedback (RLHF) is one of the key techniques that helps large language models (LLMs) to follow instructions and provide helpful and harmless responses. While direct policy optimization methods exist, state-of-the-art LLMs adopt RL-based methods (usually PPO) in RLHF to train the policy to generate good responses guided by a reward model learned from preference data. The main challenge of these methods is the inaccuracy of the intermediate reward model, especially in code generation tasks that require long and complex reasoning to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtration strategy for a given reward model, the coefficient of determination (R^2) between rewards and actual scores on filtered samples serves as a good metrics and helps us find several promising strategies. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation tasks, and find that some variants of PF-PPO are highly effective and achieve new state-of-the-art performance across 7-billion-parameter models on HumanEval, MBPP, and a new and more challenging LeetCode Contest benchmark.

在强化学习中进行策略过滤以微调用于代码生成的LLM

Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

摘要

Support