民主化外交：評估任何大型語言模型於全面外交之應用的框架

摘要

我們首次提出了一種評估框架，使得任何未經微調或專門訓練的本地大型語言模型（LLMs）都能夠完整地進行《外交》遊戲的全壓制模式。由於《外交》遊戲狀態的高度複雜性和信息密度，以往的研究需要依賴前沿的大型語言模型或進行微調。加之比賽結果的高變異性，這些因素使得《外交》成為難以研究的對象。在本研究中，我們採用數據驅動的迭代方法，優化了文本遊戲狀態的表示方式，從而使得一個240億參數的模型能夠在無需任何微調的情況下可靠地完成比賽。我們開發了工具以促進假設檢驗和統計分析，並展示了關於說服力、激進玩法風格以及跨模型性能的案例研究。我們在多種流行的大型語言模型上進行了多樣化的實驗，發現較大的模型表現最佳，但較小的模型仍能勝任遊戲。此外，我們引入了關鍵狀態分析：一種實驗協議，用於快速迭代並深入分析遊戲中的關鍵時刻。我們的框架通過消除對微調的需求，使得對大型語言模型戰略推理能力的評估變得民主化，並提供了這些能力如何從廣泛使用的大型語言模型中自然湧現的洞見。我們的代碼已隨附提供，並將開源。

English

We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.