民主化外交:评估大型语言模型在全面外交中的效能框架
Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy
August 10, 2025
作者: Alexander Duffy, Samuel J Paech, Ishana Shastri, Elizabeth Karpinski, Baptiste Alloui-Cros, Tyler Marques, Matthew Lyle Olson
cs.AI
摘要
我们首次提出了一种评估框架,使得任何未经微调或专门训练的本地大型语言模型(LLMs)能够完整参与《外交》游戏的全压模式。以往的研究因《外交》游戏状态的高度复杂性与信息密度,需依赖前沿LLMs或进行模型微调。加之比赛结果的高变异性,这些因素使得《外交》成为难以深入研究的对象。在本研究中,我们采用数据驱动迭代方法,优化了文本游戏状态表示,使得一个240亿参数的模型无需任何微调即可稳定完成比赛。我们开发了工具以促进假设检验与统计分析,并展示了关于说服力、激进玩法风格及不同模型性能的案例研究。我们对多种流行LLMs进行了广泛实验,发现较大模型表现最佳,但较小模型仍能胜任游戏。此外,我们引入了关键状态分析:一种快速迭代并深入分析游戏中关键时刻的实验协议。我们的评估框架通过消除微调需求,实现了对LLMs战略推理能力的民主化评估,并揭示了这些能力如何从广泛使用的LLMs中自然涌现。我们的代码已随附提供,并将开源。
English
We present the first evaluation harness that enables any out-of-the-box,
local, Large Language Models (LLMs) to play full-press Diplomacy without
fine-tuning or specialized training. Previous work required frontier LLMs, or
fine-tuning, due to the high complexity and information density of Diplomacy's
game state. Combined with the high variance of matches, these factors made
Diplomacy prohibitive for study. In this work, we used data-driven iteration to
optimize a textual game state representation such that a 24B model can reliably
complete matches without any fine tuning. We develop tooling to facilitate
hypothesis testing and statistical analysis, and we present case studies on
persuasion, aggressive playstyles, and performance across a range of models. We
conduct a variety of experiments across many popular LLMs, finding the larger
models perform the best, but the smaller models still play adequately. We also
introduce Critical State Analysis: an experimental protocol for rapidly
iterating and analyzing key moments in a game at depth. Our harness
democratizes the evaluation of strategic reasoning in LLMs by eliminating the
need for fine-tuning, and it provides insights into how these capabilities
emerge naturally from widely used LLMs. Our code is available in the supplement
and will be open sourced.