方向アライメントは言語モデルの強化学習における報酬ハッキングを軽減する

要旨

報酬ハッキングは、モデルが意図されたタスクを解決するのではなく、近道を利用して代理報酬を改善する場合に発生する。我々は、言語モデルにおける強化学習更新の幾何学を通じてこの障害モードを研究し、最適化が安定した低次元の学習軌道から逸脱するときにハッキングが生じると主張する。この逸脱をパラメータ更新の支配的な特異方向を通じて分析し、報酬ハッキングが生じた実行では、正常な実行よりも方向変化が著しく大きいことを示す。この観察に基づき、勾配をクリーンな参照部分空間内に留めるように制約する「信頼方向射影」を導入する。数学的推論に関する報酬ハッキング実験全体において、提案手法は近道の利用を遅らせ、タスク性能をより良く維持する。

English

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.