MaxInfoRL：情報利得最大化を通じた強化学習における探索の向上

要旨

強化学習（RL）アルゴリズムは、現在の最良戦略を活用することと、より高い報酬につながる可能性のある新しい選択肢を探索することとのバランスを目指しています。ほとんどの一般的なRLアルゴリズムは、無指向性の探索、つまりランダムなアクションのシーケンスを選択します。探索は、好奇心やモデルの認識不確実性などの内在的な報酬を使用しても誘導されることがあります。ただし、タスクと内在的な報酬との効果的なバランスは困難であり、しばしばタスクに依存します。本研究では、内在的および外在的探索をバランスさせるためのMaxInfoRLフレームワークを紹介します。MaxInfoRLは、タスクの基礎となる情報に関する情報利得などの内在的報酬を最大化することで、探索を情報豊かな遷移に向けます。Boltzmann探索と組み合わせることで、このアプローチは自然に価値関数の最大化と状態、報酬、アクションのエントロピーのトレードオフを実現します。私たちは、このアプローチが多腕バンディットの簡略化された設定で予測可能な後悔を達成することを示します。その後、この一般的な定式化を連続状態-アクション空間のオフポリシーのモデルフリーRL手法のさまざまな問題に適用し、視覚制御タスクなどの難解な探索問題や複雑なシナリオで優れた性能を達成する新しいアルゴリズムを生み出します。

English

Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

MaxInfoRL：情報利得最大化を通じた強化学習における探索の向上

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

要旨

Support