基于大型语言模型的多智能体学习算法探索
Discovering Multiagent Learning Algorithms with Large Language Models
February 18, 2026
作者: Zun Li, John Schultz, Daniel Hennes, Marc Lanctot
cs.AI
摘要
在多智能体强化学习(MARL)应用于不完美信息博弈领域的发展历程中,其进步很大程度上长期依赖于对基线算法的人工迭代优化。尽管反事实遗憾最小化(CFR)与策略空间响应预言(PSRO)等基础理论体系具有坚实的理论基础,但其最高效变体的设计往往需要依靠人类直觉在庞大的算法设计空间中进行探索。本研究提出利用由大语言模型驱动的进化编程智能体AlphaEvolve,实现多智能体学习算法的自动发现。我们通过演化两种不同范式博弈论学习算法的新变体,证明了该框架的通用性。首先在迭代遗憾最小化领域,我们演化出控制遗憾积累与策略推导的逻辑,发现了一种新算法——波动自适应折现(VAD-)CFR。该算法采用包括波动敏感折现、一致性强制乐观策略及硬性热启动策略积累机制等非直观创新机制,在性能上超越了如折现预测CFR+等最先进的基线算法。其次在基于种群的训练算法领域,我们为PSRO演化出训练阶段与评估阶段的元策略求解器,发现了一种新变体——平滑混合乐观遗憾(SHOR-)PSRO。该变体引入了一种混合元求解器,将乐观遗憾匹配与基于温度控制的纯策略最优分布平滑处理进行线性融合。通过动态调整训练过程中的混合因子与多样性奖励,该算法实现了从种群多样性到精确均衡发现的自动转换,相比标准静态元求解器展现出更优的经验收敛性。
English
Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.