大型语言模型破解奖励机制及其社会影响

摘要

强化学习（RL）已成为一种主流的后训练范式，使大语言模型（LLMs）能够从奖励中学习。我们观察到，社会规则在结构上与奖励函数相似：它们定义了可衡量的结果、阈值和例外情况，但往往只部分明确了制度意图。我们假设，RL训练过程可能利用这些漏洞，因此提出一个疑问：模型在RL过程中众所周知的奖励操纵倾向，是否会发展成一种后果更为严重的失败模式——社会漏洞利用，即发现社会运行规则中的漏洞。为研究这一现象，我们引入了SocioHack——一个包含72个社会环境场景的沙盒实验平台。研究发现，在这些环境中，奖励操纵自然出现，并导致监管漏洞的发现。模型学会了操纵社会规则，生成在技术上合规却违背监管意图的策略，而当前LLM的安全防护措施仅能提供有限的缓解效果。因此，收集真实环境中的反馈用于模型训练需要更加谨慎，我们亟需一种新一代的后训练范式，以便在社会中安全地迭代LLM。

English

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=