离散化奖励模型

摘要

尽管奖励模型被广泛使用，但其在强化学习中的作用仍未被充分理解。这类模型提出了一项颇具诱惑力的承诺：在没有验证器或人类评审者的情况下，自动评估响应质量。与通常产生二元分数的“可验证奖励”不同，奖励模型通常生成连续分数，从而能够捕捉响应中的细微差异。然而，我们证明这一看似优势实则是严重缺陷：许多流行的奖励模型过度敏感，对同样优秀的响应给出不同分数。理论上，我们表明看似完美的奖励模型可能高度过度敏感；实证上，这种过度敏感可能导致糟糕的策略。针对现有的“奖励模型准确性”概念，我们提出使用独立的“区分能力”和“特异性”（过度敏感的补集）指标来评估奖励模型。作为解决方案，我们描述了一种无需训练的算法，该算法通过对任何神经奖励模型应用蒙特卡洛丢弃法，生成离散奖励簇。理论上，我们证明存在以最小化区分能力损失为代价降低过度敏感性的离散化方法；实证上，我们展示在受控及自然强化学习环境中，对奖励进行离散化处理比基于原始奖励进行训练更能减少奖励破解并获得更优策略。

English

Despite their widespread use, the role of reward models in shaping reinforcement learning is poorly understood. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike "verifiable rewards" which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popular reward models are oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies. In place of existing notions of "reward model accuracy," we propose evaluating reward models using distinct measures of "discriminative ability" and "specificity" (the complement of oversensitivity). As a solution, we describe a training-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards.