在語言模型的強化學習中，定向對齊減輕了獎勵駭客行為

摘要

獎勵駭取的發生，源於模型透過利用捷徑來提升代理獎勵，而非實際解決預期任務。我們透過語言模型中強化學習更新的幾何結構來研究此失效模式，並主張獎勵駭取源於優化偏離穩定低維學習軌跡的現象。我們透過參數更新的主導奇異方向分析此偏移，並發現獎勵駭取運行的方向變化遠大於乾淨運行。基於此觀察，我們提出了「可信方向投影」方法，將梯度限制在乾淨參考子空間內。在數學推理的獎勵駭取實驗中，此方法延緩了捷徑利用，並更有效地保留了任務表現。

English

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.