方向性对齐缓解语言模型强化学习中的奖励劫持

摘要

奖励黑客现象出现于模型通过利用捷径而非解决预期任务来提升代理奖励时。我们通过语言模型中强化学习更新的几何结构研究这一失效模式，并论证当优化偏离稳定的低维学习轨迹时，黑客行为随之产生。通过参数更新的主导奇异方向分析这种偏移，我们发现奖励黑客实验相较于干净实验表现出显著更大的方向性变化。基于此观察，我们引入可信方向投影，该方法将梯度约束在干净参考子空间内。在数学推理任务的奖励黑客实验中，所提方法有效延迟了捷径利用，并更好地保留了任务性能。

English

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.