利用大型语言模型探索多智能体学习算法

摘要

在多智能体强化学习（MARL）应用于不完全信息博弈的进程中，历史上的诸多进展长期依赖于对基线算法的人工迭代优化。尽管反事实遗憾最小化（CFR）与策略空间响应预言（PSRO）等基础理论体系具有坚实的理论基础，但其最高效变体的设计往往需要依靠人类直觉来探索庞大的算法设计空间。本研究提出采用基于大语言模型的进化编程智能体AlphaEvolve，以实现新型多智能体学习算法的自动发现。我们通过为两种不同的博弈论学习范式演化新变体，证明了该框架的通用性。首先，在迭代遗憾最小化领域，我们演化出控制遗憾累积与策略推导的逻辑，发现了一种新算法——波动自适应折现（VAD-）CFR。该算法采用了一系列非直觉的创新机制（包括波动敏感折现、一致性强制乐观策略及硬性热启动策略累积方案），在性能上超越了如折现预测CFR+等最先进的基线算法。其次，在基于种群的训练算法领域，我们为PSRO演化出训练阶段与评估阶段的元策略求解器，发现了一种新变体——平滑混合乐观遗憾（SHOR-）PSRO。该变体引入了一种混合元求解器，将乐观遗憾匹配与基于温度控制的纯策略最优分布平滑线性融合。通过动态调整训练过程中的混合因子与多样性奖励，该算法实现了从种群多样性到精确均衡发现的自动过渡，相比标准静态元求解器展现出更优的经验收敛性。

English

Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.

利用大型语言模型探索多智能体学习算法

Discovering Multiagent Learning Algorithms with Large Language Models

摘要

Support