民主化する外交：フルプレス外交における大規模言語モデル評価のための枠組み

要旨

本研究では、ファインチューニングや特別なトレーニングを必要とせず、ローカル環境で動作する大規模言語モデル（LLM）が完全なプレスのディプロマシーをプレイできる初の評価フレームワークを提案する。従来の研究では、ディプロマシーのゲーム状態の高度な複雑性と情報密度のため、最先端のLLMやファインチューニングが必要とされていた。これに加え、試合の高いばらつきも相まって、ディプロマシーの研究は困難であった。本論文では、データ駆動型の反復を通じて、24B規模のモデルがファインチューニングなしで確実に試合を完遂できるよう、テキストベースのゲーム状態表現を最適化した。仮説検証と統計分析を容易にするツールを開発し、説得、攻撃的なプレイスタイル、およびさまざまなモデル間での性能に関するケーススタディを提示する。多数の人気LLMを用いた多様な実験を行い、大規模モデルが最も優れた性能を示す一方で、小規模モデルも十分にプレイ可能であることを確認した。さらに、ゲームの重要な局面を迅速に反復し、深く分析するための実験プロトコルである「クリティカルステート分析」を導入した。本フレームワークは、ファインチューニングの必要性を排除することで、LLMにおける戦略的推論の評価を民主化し、広く使用されているLLMからこれらの能力が自然に発現するメカニズムに関する洞察を提供する。本論文の補遺にコードを掲載し、オープンソースとして公開する予定である。

English

We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.

民主化する外交：フルプレス外交における大規模言語モデル評価のための枠組み

Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

要旨

Support