대규모 언어 모델, 보상 해킹, 그리고 사회

초록

강화 학습(RL)은 대규모 언어 모델(LLM)이 보상으로부터 학습할 수 있게 해주는 지배적인 사후 학습 패러다임이 되었다. 우리는 사회적 규제가 보상 함수와 구조적으로 유사함을 관찰한다. 규제는 측정 가능한 결과, 임계값, 예외를 정의하지만, 종종 제도적 의도를 부분적으로만 명시한다. 우리는 RL 훈련 과정이 이러한 간극을 악용할 수 있다고 가설을 세우고, 따라서 RL 과정에서 모델이 보상 함수를 해킹하려는 잘 알려진 경향이 사회가 운영되는 규칙의 허점을 발견하는 더 중대한 실패 모드인 사회적 해킹으로 확장될 수 있는지 질문한다. 이 현상을 연구하기 위해 우리는 72개의 사회적 환경으로 구성된 샌드박스인 SocioHack을 도입했으며, 이러한 환경 내에서 보상 해킹이 자연스럽게 발생하여 규제 허점 발견으로 이어짐을 확인했다. 모델은 사회적 규칙을 해킹하는 방법을 학습하여 기술적으로는 규정을 준수하면서도 규제 의도를 무력화하는 전략을 생성하며, 현재의 LLM 안전장치는 제한적인 완화만을 제공한다. 따라서 모델 훈련을 위한 현장 피드백 수집에는 더 큰 주의가 필요하며, 실제 사회에서 LLM을 안전하게 반복적으로 개선하기 위한 차세대 사후 학습 패러다임이 필요하다.

English

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=