메모: 강건한 다중 턴 다중 에이전트 LLM 게임을 위한 메모리 증강 모델 컨텍스트 최적화

초록

다중 턴, 다중 에이전트 LLM 게임 평가는 종종 실행 간 변동성이 크게 나타납니다. 장기 상호작용에서 초기의 작은 편차가 턴을 거듭하며 누적되고 다중 에이전트 결합에 의해 증폭됩니다. 이는 승률 추정치를 편향시키고 반복 토너먼트 간 순위를 신뢰할 수 없게 만듭니다. 프롬프트 선택은 서로 다른 효과적 정책을 생성함으로써 이 문제를 더욱 악화시킵니다. 우리는 MEMO(Memory-augmented MOdel context optimization)를 통해 이러한 불안정성과 낮은 성능을 동시에 해결합니다. MEMO는 보존과 탐색을 결합하여 추론 시점 컨텍스트를 최적화하는 자기 대결(self-play) 프레임워크입니다. 보존(retention)은 자기 대결 경로에서 얻은 구조화된 통찰력을 저장하는 지속적 메모리 뱅크를 유지하고, 이후 게임에서 이를 사전 정보로 주입합니다. 탐색(exploration)은 TrueSkill을 통한 불확실성 인식 선택으로 토너먼트 형식의 프롬프트 진화를 실행하며, 우선순위 재생(prioritized replay)을 사용해 희귀하고 결정적인 상태를 재방문합니다. 5가지 텍스트 기반 게임에서 MEMO는 태스크당 2,000회의 자기 대결 게임을 통해 GPT-4o-mini의 평균 승률을 25.1%에서 49.5%로, Qwen-2.5-7B-Instruct의 평균 승률을 20.9%에서 44.3%로 향상시켰습니다. 실행 간 변동성도 감소하여 프롬프트 변형에 걸쳐 더 안정적인 순위를 제공합니다. 이러한 결과는 다중 에이전트 LLM 게임의 성능과 강건성이 컨텍스트 최적화를 통해 개선될 여지가 크다는 것을 시사합니다. MEMO는 협상 및 불완전 정보 게임에서 가장 큰 성능 향상을 달성한 반면, 완전 정보 환경에서는 여전히 강화 학습이 더 효과적입니다.

English

Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using 2,000 self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.

메모: 강건한 다중 턴 다중 에이전트 LLM 게임을 위한 메모리 증강 모델 컨텍스트 최적화

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

초록

Support