重複博弈中面對適應性對手的遺憾最小化

摘要

在本論文中，我們研究與具備適應能力的對手（即能根據歷史對局做出反應）進行重複博弈時的遺憾最小化問題。已知標準的線上學習外部遺憾指標無法捕捉此類適應性。為納入參與者的反事實推理，我們提出「重複策略遺憾（RP-Regret）」——這是一個博弈論指標，衡量當所有參與者均能對歷史對局做出反應時，其實際累積效用與事後最佳累積效用之差異。相較於該領域現有遺憾概念，本指標原生於重複博弈場景，能在維持所有參與者最小化該指標時發現更佳均衡的可能性之同時，允許更強的比較對象與限制更少的對手策略。我們首先找出使「RP-Regret」隨時間呈次線性變化的必要條件，這些條件涉及遺憾定義中參與者比較策略的變異程度，以及比較對象與對手策略的記憶長度。接著，我們研究最小化「RP-Regret」的額外條件與可證明演算法——該指標依定義在策略空間中為非凸函數。為應對此挑戰，我們提出三種演算法：（一）基於最佳化神諭（如同部分先前線上非凸學習研究之假設）；（二）每次迭代時最小化「RP-Regret」的凸線性化代理函數；（三）當對手策略緩慢變化時直接最小化「RP-Regret」。此外，當所有參與者均可執行最小化「RP-Regret」（或其線性化變體）的演算法時，重複博弈中的特定子博弈完美均衡可被學習。我們也提供實驗，顯示最小化所提出的遺憾概念能在如「獵鹿博弈」等遊戲中促成更高效用的合作解。

English

In this paper, we study regret minimization in repeated games with adaptive opponents who can respond based on histories of play. The standard metric of external regret in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the realized and the best-in-hindsight accumulated utility when all players can respond to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition non-convex in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and linearized surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.