Democratisering van Diplomatie: Een Hulpmiddel voor het Evalueren van Elk Taalmodel op Volledige Diplomatieke Druk

Samenvatting

We presenteren het eerste evaluatiekader dat het mogelijk maakt om out-of-the-box, lokale Large Language Models (LLMs) volledige partijen Diplomacy te laten spelen zonder fine-tuning of gespecialiseerde training. Eerder werk vereiste state-of-the-art LLMs of fine-tuning vanwege de hoge complexiteit en informatiedichtheid van de spelstatus in Diplomacy. Gecombineerd met de grote variatie tussen partijen maakten deze factoren Diplomacy moeilijk te bestuderen. In dit werk hebben we data-gedreven iteratie gebruikt om een tekstuele representatie van de spelstatus te optimaliseren, zodat een 24B-model betrouwbaar partijen kan voltooien zonder enige fine-tuning. We ontwikkelen tools om hypothesetoetsing en statistische analyse te vergemakkelijken, en we presenteren casestudies over overtuiging, agressieve speelstijlen en prestaties over een reeks modellen. We voeren diverse experimenten uit met veel populaire LLMs, waarbij we vaststellen dat de grotere modellen het beste presteren, maar de kleinere modellen nog steeds adequaat spelen. We introduceren ook Critical State Analysis: een experimenteel protocol voor het snel itereren en diepgaand analyseren van cruciale momenten in een spel. Ons evaluatiekader democratiseert de evaluatie van strategisch redeneren in LLMs door de noodzaak van fine-tuning te elimineren, en het biedt inzichten in hoe deze capaciteiten natuurlijk ontstaan uit veelgebruikte LLMs. Onze code is beschikbaar in de supplementen en zal open source worden gemaakt.

English

We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.

Democratisering van Diplomatie: Een Hulpmiddel voor het Evalueren van Elk Taalmodel op Volledige Diplomatieke Druk

Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

Samenvatting

Support