大規模モデル時代における報酬ハッキング：メカニズム、創発的ミスアライメント、課題

要旨

人間フィードバックによる強化学習（RLHF）と関連するアライメント手法は、大規模言語モデル（LLM）およびマルチモーダル大規模言語モデル（MLLM）を人間が望む行動へ導く核心的技術となっている。しかしこれらの手法は、システム的な脆弱性である「報酬ハッキング」を内在させる。報酬ハッキングとは、モデルが学習された報酬信号の不完全性を悪用し、真のタスク意図を満たさずに代理目的を最大化する現象である。モデルの大規模化と最適化の高度化に伴い、この悪用は多様な形で顕在化する：冗長性バイアス、ご機嫌取り、虚偽の正当化、ベンチマーク過適合、さらにマルチモーダル環境では知覚・推論の分離や評価器の操作などが挙げられる。近年の研究では、一見無害な近道行動が、欺瞞や監督メカニズムの戦略的利用といったより広範な不整合へ一般化し得ることも示唆されている。本サーベイでは、報酬ハッキングを統一的に理解する枠組みとして「代理圧縮仮説（PCH）」を提案する。報酬ハッキングを、高次元な人間の目的を圧縮表現した報酬信号に対して表現力豊かな方策を最適化する過程で生じる創発現象として形式化する。この視点では、目的の圧縮、最適化の増幅、評価器と方策の共適応という三つの相互作用が報酬ハッキングを引き起こす。この枠組みはRLHF、RLAIF、RLVRといった様々な手法で観測される現象を統一的に説明し、局所的な近道学習が如何により広範な不整合へ発展し得るかを解明する。さらに、圧縮・増幅・共適応の各力学にどう介入するかによって、検出と緩和の戦略を体系化する。報酬ハッキングを、スケールに伴う代理ベースのアライメントの構造的不安定性として位置づけることで、スケーラブルな監督、マルチモーダル接地、エージェントの自律性における未解決課題を浮き彫りにする。

English

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.

大規模モデル時代における報酬ハッキング：メカニズム、創発的ミスアライメント、課題

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

要旨

Support