반복 게임에서 적응형 상대와의 후회 최소화

초록

본 논문에서는 과거 플레이 이력을 기반으로 반응할 수 있는 적응적 상대방과의 반복 게임에서 후회 최소화를 연구한다. 온라인 학습의 표준적 외부 후회 지표는 이러한 적응성을 포착하지 못하는 것으로 알려져 있다. 플레이어들의 반사실적 추론을 설명하기 위해, 우리는 {\tt 반복 정책 후회(RP-Regret)}를 도입한다. 이는 게임 이론적 지표로, 모든 플레이어가 플레이 이력에 반응할 수 있을 때 실현된 누적 효용과 사후 최적의 누적 효용 간의 차이를 측정한다. 기존 설정의 후회 개념과 비교하여, 우리의 지표는 반복 게임 플레이에 고유하게 적용되므로, 더 강력한 비교기와 더 적은 제약을 가진 상대방을 허용하면서도 모든 플레이어가 이를 최소화할 때 더 나은 균형을 찾을 가능성을 유지한다. 먼저 시간에 대해 서브리니어(sublinear)한 {\tt RP-Regret}을 얻기 위한 필요 조건을 식별하는데, 이는 후회 정의에서 플레이어의 비교기 전략의 변화량, 그리고 비교기와 상대방 전략의 기억에 관한 조건이다. 그런 다음, {\tt RP-Regret}을 최소화하기 위한 추가 조건과 증명 가능한 알고리즘을 연구한다. {\tt RP-Regret}은 정의상 전략 공간에서 비볼록(non-convex)하다. 이 문제를 해결하기 위해 세 가지 알고리즘을 제안한다: (i) 일부 이전 온라인 비볼록 학습 연구에서 가정된 최적화 오라클에 기반한 알고리즘; (ii) 각 반복에서 {\tt RP-Regret}의 볼록하고 선형화된 대리 함수를 최소화하는 알고리즘; (iii) 상대방이 느리게 전략을 변경할 때 {\tt RP-Regret}을 직접 최소화하는 알고리즘. 또한, 모든 플레이어가 {\tt RP-Regret}(또는 그 선형화된 변형)을 최소화하는 알고리즘을 실행할 수 있을 때, 반복 게임의 특정 부분게임 완전 균형을 학습할 수 있다. 또한, 우리의 후회 개념을 최소화하는 것이 사냥 게임(Stag-Hunt)과 같은 게임에서 더 높은 효용을 가진 협력적 해결책으로 이어질 수 있음을 보여주는 실험을 제공한다.

English

In this paper, we study regret minimization in repeated games with adaptive opponents who can respond based on histories of play. The standard metric of external regret in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the realized and the best-in-hindsight accumulated utility when all players can respond to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition non-convex in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and linearized surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.