繰り返しゲームにおける適応的対戦相手との後悔最小化

要旨

本論文では、適応的でプレイの履歴に応じて応答可能な対戦相手との反復ゲームにおける後悔最小化を研究する。オンライン学習における標準的な外部後悔の指標は、このような適応性を捉えられないことが知られている。プレイヤーの反事実的推論を考慮するため、我々はゲーム理論的な指標である{\tt 反復方策後悔（RP-Regret）}を導入する。これは、すべてのプレイヤーがプレイの履歴に応答できる場合に、実現された累積効用と事後的に最適な累積効用との差を測定するものである。この設定における既存の後悔概念と比較して、我々の指標は反復ゲームのプレイに固有のものであり、より強力な比較対象とより少ない制約を持つ対戦相手を可能にしつつ、すべてのプレイヤーがそれを最小化するときにより良い均衡を見つける可能性を維持する。まず、時間に対して劣線形な{\tt RP-Regret}を得るための必要条件を、後悔定義におけるプレイヤーの比較対象戦略の変動、ならびに比較対象および対戦相手の戦略の記憶に関して特定する。次に、{\tt RP-Regret}を最小化するための追加条件と証明可能なアルゴリズムを研究する。{\tt RP-Regret}は定義上、戦略空間において非凸である。この課題に対処するため、我々は三つのアルゴリズムを提案する。(i) 従来のオンライン非凸学習研究で想定されたような最適化オラクルに基づくもの、(ii) 各反復において{\tt RP-Regret}の凸で線形化された代理を最小化するもの、(iii) 対戦相手がゆっくりと戦略を変化させる場合に{\tt RP-Regret}を直接最小化するものである。さらに、すべてのプレイヤーが{\tt RP-Regret}（またはその線形化された変種）を最小化するアルゴリズムを実行できる場合、反復ゲームのある種の部分ゲーム完全均衡を学習できる。また、Stag-Huntのようなゲームにおいて、我々の後悔概念を最小化することで、より高い効用を持つ協力的な解が得られることを示す実験も提供する。

English

In this paper, we study regret minimization in repeated games with adaptive opponents who can respond based on histories of play. The standard metric of external regret in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the realized and the best-in-hindsight accumulated utility when all players can respond to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition non-convex in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and linearized surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.