ハイパーボリック深層強化学習の理解と改善

要旨

強化学習（RL）エージェントの性能は、基盤となる特徴表現の質に大きく依存する。双曲特徴空間は、複雑なRL環境にしばしば存在する階層的・関係的構造を自然に捉えるため、この目的に適している。しかし、これらの空間を活用する際には、RLの非定常性により最適化上の課題に直面することが多い。本研究では、双曲深層RLエージェントの学習の成功と失敗を決定づける主要因を明らかにする。双曲幾何学のポアンカレ球モデルおよび双曲面モデルにおける核心的操作の勾配を分析することにより、大きなノルムを持つ埋め込みが勾配ベースの学習を不安定にし、近接方策最適化（PPO）における信頼領域の違反を引き起こすことを示す。これらの知見に基づき、我々は新しい双曲PPOエージェントであるHyper++を提案する。これは以下の3つの構成要素から成る：（i）回帰ではなくカテゴリカルな価値損失による安定した批評家の学習、（ii）クリッピングによる次元の呪いを回避しつつ有界なノルムを保証する特徴正則化、（iii）最適化に適した形式の双曲ネットワーク層の採用。ProcGenにおける実験により、Hyper++が学習の安定性を保証し、既存の双曲エージェントを上回り、実時間で約30%の学習時間短減を実現することを示す。Double DQNを用いたAtari-5では、Hyper++はユークリッドおよび双曲ベースラインを大幅に上回る。コードはhttps://github.com/Probabilistic-and-Interactive-ML/hyper-rl で公開している。

English

The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincaré Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .

ハイパーボリック深層強化学習の理解と改善

Understanding and Improving Hyperbolic Deep Reinforcement Learning

要旨

Support