報酬モデルの離散化

要旨

広く使われているにもかかわらず、報酬モデルが強化学習を形成する役割は十分に理解されていない。報酬モデルは魅力的な可能性を提供する。検証者や人間の評価者がいない状況で、応答品質を自動的に推定するのである。通常は二値スコアを生成する「検証可能な報酬」とは異なり、報酬モデルは通常、連続スコアを生成し、応答の細かな差異に敏感になることを可能にしている。しかし、この一見した強みが重大な弱点であることを我々は示す。多くの一般的な報酬モデルは過敏であり、同等に良い応答に異なるスコアを割り当てる。理論的には、一見完璧な報酬モデルが極めて過敏になり得ることを示す。実験的には、この過敏性が悪い方策につながることを示す。既存の「報酬モデルの精度」という概念に代えて、報酬モデルを評価するために、「弁別能力」と「特異性」（過敏性の補集合）という異なる指標を用いることを提案する。解決策として、任意のニューラル報酬モデルにモンテカルロドロップアウトを適用し、離散的な報酬クラスタを生成する学習不要のアルゴリズムを説明する。理論的には、弁別能力を最小限に犠牲にして過敏性を低減する離散化が存在することを証明する。実験的には、制御環境および自然環境の両方の強化学習設定において、報酬を離散化することで、元の報酬で学習するよりも報酬ハッキングが減少し、より良い方策が得られることを示す。

English

Despite their widespread use, the role of reward models in shaping reinforcement learning is poorly understood. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike "verifiable rewards" which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popular reward models are oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies. In place of existing notions of "reward model accuracy," we propose evaluating reward models using distinct measures of "discriminative ability" and "specificity" (the complement of oversensitivity). As a solution, we describe a training-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards.