Samenwerking en Uitbuiting in LLM-beleidssynthese voor Sequentiele Sociale Dilemma's

Samenvatting

Wij bestuderen LLM-beleidssynthese: het gebruik van een groot taalmodel om iteratief programmatische agentenbeleidsregels te genereren voor multi-agent omgevingen. In plaats van neurale beleidsregels te trainen via reinforcement learning, laat ons framework een LLM Python-beleidsfuncties genereren, evalueert ze in zelf-play, en verfijnt ze met behulp van prestatiefeedback over iteraties heen. Wij onderzoeken feedback-engineering (het ontwerp van welke evaluatie-informatie aan de LLM wordt getoond tijdens de verfijning) door schaarse feedback (alleen een scalaire beloning) te vergelijken met gedetailleerde feedback (beloning plus sociale metrieken: efficiëntie, gelijkheid, duurzaamheid, vrede). In twee canonieke Sequentiele Sociale Dilemma's (Gathering en Cleanup) en twee frontier-LLM's (Claude Sonnet 4.6, Gemini 3.1 Pro) presteert gedetailleerde feedback consistent gelijk of beter dan schaarse feedback op alle metrieken. Het voordeel is het grootst in het Cleanup publieke goederen spel, waar het verstrekken van sociale metrieken de LLM helpt om de kosteneffectieve afweging tussen opruimen en oogsten te kalibreren. In plaats van overoptimalisatie van eerlijkheid te triggeren, dienen sociale metrieken als een coördinatiesignaal dat de LLM leidt naar effectievere coöperatieve strategieën, waaronder territoriumverdeling, adaptieve roltoewijzing en het vermijden van zinloze agressie. Wij voeren verder een adversarieel experiment uit om te bepalen of LLM's deze omgevingen kunnen 'reward hacken'. Wij karakteriseren vijf aanvalsklassen en bespreken mitigerende maatregelen, waarbij een inherente spanning in LLM-beleidssynthese tussen expressiviteit en veiligheid wordt belicht. Code beschikbaar op https://github.com/vicgalle/llm-policies-social-dilemmas.

English

We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

Samenwerking en Uitbuiting in LLM-beleidssynthese voor Sequentiele Sociale Dilemma's

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

Samenvatting

Support