大型語言模型的獎勵駭入與社會

摘要

強化學習（Reinforcement Learning, RL）已成為一種主流的後訓練範式，使大型語言模型（Large Language Models, LLMs）得以從獎勵中學習。我們觀察到，社會規範在結構上與獎勵函數極為相似：二者皆定義了可量化的結果、門檻值與例外情況，但往往僅部分明確定義制度背後的意圖。我們假設，RL訓練過程可能利用這些模糊地帶，因此提出疑問：模型在RL過程中廣為人知的獎勵破解傾向，是否會升級為更具後果性的失敗模式——即「社會漏洞利用」（societal hacking）：發掘社會運作規則中的漏洞。為研究此現象，我們引入了SocioHack，一個包含72種社會環境的沙盒測試平台。我們發現，在這些環境中，獎勵破解自然發生，並導致法規漏洞的發現。模型學會了破解社會規則，並產出在技術上合規、卻違背法規原意的策略；而現行LLM的防護機制僅能提供有限的緩解效果。因此，為訓練模型而收集真實世界的反饋時需要更加謹慎，且我們需要下一代後訓練範式，以在真實社會中安全地迭代優化大型語言模型。

English

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=