大規模言語モデルによる逐次的社会ジレンマ政策合成における協力と搾取

要旨

我々は、LLMによる政策合成を研究する。すなわち、大規模言語モデルを用いて、マルチエージェント環境におけるプログラム化されたエージェント政策を反復的に生成する手法である。強化学習によるニューラル政策の訓練とは異なり、我々のフレームワークはLLMにPythonの政策関数を生成させ、自己対戦で評価し、反復を跨いだ性能フィードバックを用いて洗練させる。我々はフィードバックエンジニアリング（洗練過程でLLMに提示する評価情報の設計）を調査し、スパースフィードバック（スカラー報酬のみ）と高密度フィードバック（報酬に加えて、効率性、公平性、持続可能性、平和といった社会的指標）を比較する。2つの代表的な逐次的社会ジレンマ（GatheringとCleanup）と2つの先進的LLM（Claude Sonnet 4.6, Gemini 3.1 Pro）を用いた実験において、高密度フィードバックは、全ての指標でスパースフィードバックと同等かそれを上回る性能を一貫して示した。この利点は、公共財ゲームであるCleanupで最も顕著であり、社会的指標を提供することが、コストのかかる清掃と収穫のトレードオフをLLMに調整させるのに役立った。公平性の過剰最適化を引き起こすのではなく、社会的指標は調整信号として機能し、領域分割、適応的な役割割り当て、無駄な攻撃の回避といった、より効果的な協調戦略へとLLMを導いた。さらに我々は、LLMがこれらの環境で報酬ハッキングを行えるかどうかを判断するための敵対的実験を実施した。5つの攻撃クラスを特徴付け、緩和策について議論し、LLM政策合成における表現力と安全性の間の本質的な緊張関係を浮き彫りにする。コードはhttps://github.com/vicgalle/llm-policies-social-dilemmasにて公開。

English

We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

大規模言語モデルによる逐次的社会ジレンマ政策合成における協力と搾取

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

要旨

Support