ChatPaper.aiChatPaper

大型语言模型策略生成中合作与利用的序列社会困境研究

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

March 19, 2026
作者: Víctor Gallego
cs.AI

摘要

我们研究LLM策略合成技术:利用大语言模型为多智能体环境迭代生成程序化智能体策略。与通过强化学习训练神经策略不同,我们的框架通过提示LLM生成Python策略函数,在自我对弈中评估这些函数,并基于迭代中的性能反馈进行优化。我们重点研究反馈工程(即优化过程中向LLM展示何种评估信息的设计),对比了稀疏反馈(仅含标量奖励)与密集反馈(奖励加社会指标:效率、平等、可持续性、和平)的效果。在两个经典序列社会困境(采集游戏与清理游戏)和两个前沿LLM(Claude Sonnet 4.6、Gemini 3.1 Pro)上的实验表明,密集反馈在所有指标上均持续达到或超越稀疏反馈。这种优势在清理公共物品博弈中最为显著,社会指标的提供有助于LLM校准成本高昂的清理-收获权衡关系。社会指标并未引发对公平性的过度优化,而是作为协调信号引导LLM形成更有效的合作策略,包括领地划分、自适应角色分配以及避免无效攻击行为。我们还进行了对抗性实验以验证LLM是否能对这些环境实施奖励攻击,归纳出五类攻击模式并探讨缓解措施,揭示了LLM策略合成中表达能力与安全性之间的内在张力。 代码详见:https://github.com/vicgalle/llm-policies-social-dilemmas。
English
We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.
PDF41March 24, 2026