在RLHF中驯服LLMs中的过度自信：奖励校准

摘要

语言模型校准指的是模型的信心与其响应实际表现之间的对齐。先前的研究指出大型语言模型（LLMs）存在过度自信现象，并表明使用从人类反馈中训练的强化学习（RLHF）的LLMs会表现出更加尖锐的输出概率，但在本研究中，我们发现RLHF倾向于导致模型在其自身响应中表达口头上的过度自信。我们调查了这种过度自信的潜在原因，并证明了用于Proximal Policy Optimization（PPO）的奖励模型存在固有偏向于高置信度分数的偏见，而不考虑实际响应质量。基于这一洞见，我们提出了两种PPO变体：PPO-M：带有校准奖励建模的PPO和PPO-C：带有校准奖励计算的PPO。PPO-M在奖励模型训练中集成了显式置信度分数，从而校准奖励模型以更好地捕捉响应质量与口头置信度之间的对齐。PPO-C根据当前奖励与过去奖励移动平均值之间的差异，在PPO期间调整奖励分数。PPO-M和PPO-C都可以无缝集成到当前的PPO流程中，不需要额外的黄金标签。我们在Llama3-8B和Mistral-7B上评估了我们的方法，涵盖了六个不同数据集，包括多项选择和开放式生成。实验结果表明，我们的两种方法都可以减少校准误差，并保持与标准PPO相当的性能。我们进一步展示它们不会损害模型在开放式对话环境中的能力。

English

Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO with Calibrated Reward Modeling and PPO-C: PPO with Calibrated Reward Calculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the moving average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experiment results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they do not compromise model capabilities in open-ended conversation settings.

在RLHF中驯服LLMs中的过度自信：奖励校准

Taming Overconfidence in LLMs: Reward Calibration in RLHF

摘要

Support