透過情境內共玩家推論實現多智能體協作
Multi-agent cooperation through in-context co-player inference
February 18, 2026
作者: Marissa A. Weis, Maciej Wołczyk, Rajai Nasser, Rif A. Saurous, Blaise Agüera y Arcas, João Sacramento, Alexander Meulemans
cs.AI
摘要
在自利智能體間實現合作始終是多智能體強化學習領域的根本性挑戰。近期研究表明,當「具備學習意識」的智能體能夠考量並塑造其協作對象的學習動態時,可誘導出相互合作的行為。然而,現有方法通常依賴對協作對象學習規則的硬編碼假設(這些假設往往不一致),或嚴格區分在快速時間尺度上更新的「樸素學習者」與觀察這些更新的「元學習者」。本文證實,序列模型的上下文學習能力無需硬編碼假設或顯式時間尺度分離即可實現對協作對象的學習意識。我們發現,通過讓序列模型智能體與多樣化的協作對象分佈進行對抗訓練,可自然誘導出上下文最優響應策略,這些策略在快速的情境內時間尺度上實質發揮著學習算法的作用。我們觀察到,既有研究中識別的合作機制——即易受勒索攻擊的特性驅動相互塑造——在此設定中自然湧現:上下文適應使智能體易受勒索攻擊,而由此產生的相互施壓(旨在塑造對手的上下文學習動態)最終促成了合作行為的學習。我們的研究結果表明,標準的序列模型分散式強化學習結合協作對象多樣性,為實現合作行為的學習提供了一條可擴展路徑。
English
Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.