リスク回避型強化学習における板倉斎藤損失関数

要旨

リスク回避型強化学習は、様々な高リスク分野で応用されています。期待リターンを最大化することを目指す古典的な強化学習とは異なり、リスク回避型エージェントはリスクを最小化するポリシーを選択し、時には期待値を犠牲にすることもあります。これらの選好は効用理論を通じて定式化することができます。本論文では、指数効用関数の特定のケースに焦点を当て、ベルマン方程式を導出し、わずかな修正で様々な強化学習アルゴリズムを適用できることを示します。しかし、これらの手法はプロセス全体で指数計算が必要となるため、数値的不安定性に悩まされます。この問題に対処するため、状態価値関数と行動価値関数の学習において、Itakura-Saitoダイバージェンスに基づいた数値的に安定かつ数学的に健全な損失関数を提案します。提案した損失関数を、確立された代替手法と理論的・実証的に比較評価します。実験セクションでは、解析解が既知のものも含む複数の金融シナリオを探索し、提案した損失関数が代替手法を上回ることを示します。

English

Risk-averse reinforcement learning finds application in various high-stakes fields. Unlike classical reinforcement learning, which aims to maximize expected returns, risk-averse agents choose policies that minimize risk, occasionally sacrificing expected value. These preferences can be framed through utility theory. We focus on the specific case of the exponential utility function, where we can derive the Bellman equations and employ various reinforcement learning algorithms with few modifications. However, these methods suffer from numerical instability due to the need for exponent computation throughout the process. To address this, we introduce a numerically stable and mathematically sound loss function based on the Itakura-Saito divergence for learning state-value and action-value functions. We evaluate our proposed loss function against established alternatives, both theoretically and empirically. In the experimental section, we explore multiple financial scenarios, some with known analytical solutions, and show that our loss function outperforms the alternatives.

リスク回避型強化学習における板倉斎藤損失関数

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

要旨

Support