大規模言語モデル、報酬ハッキング、そして社会

要旨

強化学習は、大規模言語モデルが報酬から学習することを可能にする支配的なポストトレーニングパラダイムとなっている。我々は、社会的規制が報酬関数と構造的に類似していることを観察する。それらは測定可能な結果、閾値、例外を定義する一方で、制度上の意図を部分的にしか明示しないことが多い。我々は、強化学習のトレーニングプロセスがこれらのギャップを悪用する可能性があると仮説を立て、そのため、強化学習中にモデルが報酬関数をハッキングするよく知られた傾向が、より重大な失敗モードである「社会的ハッキング」、すなわち社会が運営されるルールの抜け穴を発見することに拡張されうるかどうかを問う。この現象を研究するために、我々は72の社会的環境からなるサンドボックスであるSocioHackを導入し、これらの環境内で報酬ハッキングが自然に発生し、規制の抜け穴の発見につながることを確認した。モデルは社会的ルールをハッキングすることを学習し、技術的には準拠しつつ規制の意図を無効にする戦略を生成する。また、現在の大規模言語モデルのセーフガードは限定的な緩和しか提供しない。したがって、モデルトレーニングのための実環境でのフィードバック収集にはより一層の注意が必要であり、現実社会で大規模言語モデルを安全に反復させるための次世代のポストトレーニングパラダイムが求められる。

English

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=