ODIN：解耦奖励在RLHF中减轻了黑客行为

摘要

在这项工作中，我们研究了在强化学习从人类反馈中出现的奖励欺骗问题，即LLMs上的响应长度。LLMs中格式良好、冗长但不太有帮助的响应往往会欺骗LLMs甚至人类评估者以获得高分。同样的问题也存在于RL中的一些奖励模型中。为了解决训练和评估中的挑战，我们建立了一个更可靠的评估协议，用于比较不同训练配置，该协议检查了通过改变训练超参数获得的LLM评估分数和响应长度之间的权衡。基于这种评估，我们进行了大规模研究，结果揭示了在减轻长度偏见方面在RL中使用的超参数和技巧的有效性。我们进一步提出通过共同训练两个线性头部在共享特征表示上预测奖励来改进奖励模型，一个头部训练以与长度相关，另一个头部训练以与长度不相关，因此更专注于实际内容。然后在RL中丢弃长度头部以防止对长度的奖励欺骗。实验证明，我们的方法几乎消除了奖励与长度的相关性，并显著改善了获得的策略。

English

In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. The same issue also holds for some reward models in RL. To address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, which inspects the trade-off between LLM evaluation score and response length obtained by varying training hyperparameters. Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. We further propose to improve the reward model by jointly training two linear heads on shared feature representations to predict the rewards, one trained to correlate with length, and the other trained to decorrelate with length and therefore focus more on the actual content. We then discard the length head in RL to prevent reward hacking on length. Experiments demonstrate that our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.

ODIN：解耦奖励在RLHF中减轻了黑客行为

ODIN: Disentangled Reward Mitigates Hacking in RLHF

摘要

Support