重复博弈中具有自适应对手的遗憾最小化

摘要

本文研究了在与自适应对手（可根据历史博弈过程做出反应）进行重复博弈时的遗憾最小化问题。已知在线学习中的标准外部遗憾指标无法捕捉这种自适应性。为考量参与者的反事实推理，我们引入了 {\tt 重复策略遗憾（RP-Regret）}，这是一种博弈论指标，用于衡量当所有参与者都能对博弈历史做出反应时，实际累积效用与历史最优累积效用之间的差异。相较于现有该情境下的遗憾概念，我们的指标更贴近重复博弈的原始特性，允许更强的比较器与约束更少的对手，同时保持所有参与者最小化该指标时能发现更优均衡的可能性。我们首先确定了实现时间亚线性 {\tt RP-Regret} 的必要条件：这些条件涉及遗憾定义中参与者比较器策略的变异性，以及比较器与对手策略的记忆范围。随后，我们研究了最小化 {\tt RP-Regret} 的其他条件与可证明算法——该指标在策略空间上天然具有非凸性。为应对这一挑战，我们提出三种算法：（i）基于优化预言机的方法（部分先前在线非凸学习研究曾采用此假设）；（ii）每次迭代中最小化 {\tt RP-Regret} 凸线性化代理变量的方法；（iii）当对手策略缓慢变化时直接最小化 {\tt RP-Regret} 的方法。此外，当所有参与者运行最小化 {\tt RP-Regret}（或其线性化变体）的算法时，可习得重复博弈的特定子博弈完美均衡。实验表明，最小化我们的遗憾指标可引导出诸如“猎鹿博弈”等游戏中的更高效用合作解。

English

In this paper, we study regret minimization in repeated games with adaptive opponents who can respond based on histories of play. The standard metric of external regret in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the realized and the best-in-hindsight accumulated utility when all players can respond to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition non-convex in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and linearized surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.