备忘录:面向鲁棒性多轮多智能体大语言模型博弈的记忆增强型上下文优化
MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games
March 9, 2026
作者: Yunfei Xie, Kevin Wang, Bobby Cheng, Jianzhu Yao, Zhizhou Sha, Alexander Duffy, Yihan Xi, Hongyuan Mei, Cheston Tan, Chen Wei, Pramod Viswanath, Zhangyang Wang
cs.AI
摘要
在多轮多智能体大语言模型游戏评估中,运行间方差往往较为显著。在长程交互过程中,早期微小的偏差会随着回合数累积,并被多智能体耦合效应放大,导致胜率估计产生偏差,也使重复锦标赛中的排名可靠性降低。提示词选择通过生成不同的有效策略进一步加剧了这一问题。我们提出MEMO(记忆增强的模型上下文优化框架),通过耦合记忆保留与探索机制来优化推理时上下文,同时解决不稳定性与性能不足的问题。该自博弈框架中,记忆保留模块维护持久化记忆库,存储自博弈轨迹中的结构化洞见,并在后续对局中将其作为先验知识注入;探索模块采用锦标赛式提示词进化机制,通过TrueSkill进行不确定性感知选择,并利用优先级回放重访关键决策状态。在五款文本游戏中,MEMO将GPT-4o-mini的平均胜率从25.1%提升至49.5%,将Qwen-2.5-7B-Instruct的胜率从20.9%提升至44.3%(每项任务使用2,000场自博弈)。运行间方差显著降低,使不同提示词变体下的排名稳定性增强。结果表明,通过上下文优化可大幅提升多智能体大语言模型游戏的性能与鲁棒性。MEMO在谈判类和不完全信息游戏中提升最为显著,而在完全信息场景下强化学习仍更具优势。
English
Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using 2,000 self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.