通过情境上下文协同玩家推断实现多智能体协作
Multi-agent cooperation through in-context co-player inference
February 18, 2026
作者: Marissa A. Weis, Maciej Wołczyk, Rajai Nasser, Rif A. Saurous, Blaise Agüera y Arcas, João Sacramento, Alexander Meulemans
cs.AI
摘要
在自利智能体间实现合作始终是多智能体强化学习领域的核心挑战。最新研究表明,当"学习感知型"智能体能够考虑并塑造其对手的学习动态时,可诱导出相互合作行为。然而,现有方法通常依赖于对对手学习规则的硬编码假设(这些假设往往不一致),或强制要求"朴素学习者"在快速时间尺度上进行更新,而"元学习者"则观察这些更新。本文证明,序列模型的上下文学习能力可在无需硬编码假设或显式时间尺度分离的情况下实现对手学习感知。我们发现,通过让序列模型智能体与多样化对手分布进行对抗训练,可自然诱导出上下文最优响应策略,这些策略在快速的情节内时间尺度上发挥着学习算法的作用。研究显示,先前工作中发现的合作机制——即易受勒索胁迫的特性驱动相互塑造——在此设置中自然涌现:上下文适应使智能体易受勒索胁迫,而由此产生的相互压力会塑造对手的上下文学习动态,最终促使合作行为的学习形成。我们的结果表明,基于序列模型的标准去中心化强化学习结合对手多样性,为习得合作行为提供了一条可扩展的路径。
English
Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.