メモ：強固なマルチターン・マルチエージェントLLMゲームのためのメモリ拡張型モデルコンテキスト最適化

要旨

マルチターン・マルチエージェントLLMゲーム評価では、実行間の分散が大きくなりがちである。長期的な相互作用において、初期のわずかな逸脱がターンごとに複合化され、マルチエージェント結合によって増幅される。これにより勝率推定にバイアスが生じ、繰り返し行われるトーナメント間でのランキング信頼性が低下する。プロンプト選択の違いが実質的な方策の差を生むことで、この問題はさらに悪化する。我々はこの不安定性と低性能の両方に対処するため、保持と探索を結合して推論時コンテキストを最適化するセルフプレイフレームワーク、MEMO（Memory-augmented MOdel context optimization）を提案する。保持機能は、セルフプレイ軌道から得られた構造化された知見を永続的メモリバンクに格納し、後のプレイ時に事前情報として注入する。探索機能は、TrueSkillによる不確実性考慮型選択を用いたトーナメント形式のプロンプト進化を実行し、優先順位付きリプレイによって稀かつ決定的な状態を再訪する。5種類のテキストゲームにおける評価では、タスクあたり2,000ゲームのセルフプレイにより、GPT-4o-miniの平均勝率を25.1%から49.5%に、Qwen-2.5-7B-Instructの平均勝率を20.9%から44.3%に向上させた。実行間分散も低減し、プロンプト変動に対するランキングの安定性が向上した。これらの結果は、マルチエージェントLLMゲームの性能と頑健性がコンテキスト最適化によって大幅に改善可能であることを示唆する。MEMOは交渉ゲームや不完全情報ゲームで特に大きな効果を発揮する一方、完全情報環境では強化学習の方が有効であった。

English

Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using 2,000 self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.

メモ：強固なマルチターン・マルチエージェントLLMゲームのためのメモリ拡張型モデルコンテキスト最適化

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

要旨

Support