순차적 사회적 딜레마를 위한 LLM 정책 합성에서의 협력과 착취

초록

우리는 LLM 정책 합성: 대규모 언어 모델을 사용하여 다중 에이전트 환경을 위한 프로그램형 에이전트 정책을 반복적으로 생성하는 방법을 연구합니다. 강화 학습을 통해 신경망 정책을 훈련시키는 대신, 우리의 프레임워크는 LLM에 Python 정책 함수를 생성하도록 프롬프트하고, 자기 대전에서 이를 평가하며, 반복에 걸친 성능 피드백을 사용하여 정책을 개선합니다. 우리는 피드백 엔지니어링(개선 과정에서 LLM에 어떤 평가 정보를 보여줄지의 설계)을 조사하며, 희소 피드백(스칼라 보상만)과 농밀 피드백(보상에 효율성, 평등, 지속가능성, 평화와 같은 사회적 지표를 추가)을 비교합니다. 두 가지 전형적인 순차 사회 딜레마(Gathering 및 Cleanup)와 두 가지 최신 LLM(Claude Sonnet 4.6, Gemini 3.1 Pro)을 대상으로 한 실험에서, 농밀 피드백은 모든 지표에서 희소 피드백과 동등하거나 이를 능가하는 성능을 일관되게 보였습니다. 이 장점은 공공재 게임인 Cleanup에서 가장 두드러졌는데, 여기서 사회적 지표를 제공하는 것이 LLM이 비용이 많이 드는 청소-수확 절충을 조정하는 데 도움을 주었습니다. 사회적 지표는 공정성의 과도한 최적화를 유발하기보다는, 영역 분할, 적응형 역할 할당, 낭비적인 공격 회피 등 보다 효과적인 협력 전략으로 LLM을 이끄는 조정 신호 역할을 했습니다. 우리는 추가적으로 LLM이 이러한 환경에서 보상 해킹을 할 수 있는지 확인하기 위한 적대적 실험을 수행했습니다. 우리는 5가지 공격 유형을 규명하고 완화 방안을 논의하며, LLM 정책 합성에서 표현력과 안전성 사이에 내재된 긴장 관계를 부각합니다. 코드: https://github.com/vicgalle/llm-policies-social-dilemmas.

English

We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

순차적 사회적 딜레마를 위한 LLM 정책 합성에서의 협력과 착취

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

초록

Support