理解与改进双曲深度强化学习

摘要

强化学习（RL）智能体的性能关键取决于底层特征表示的质量。双曲特征空间特别适合这一用途，因其能自然捕捉复杂RL环境中普遍存在的层次化与关联性结构。然而由于RL的非平稳性，利用这些空间通常面临优化挑战。本研究揭示了决定双曲深度RL智能体训练成败的关键因素。通过分析庞加莱球模型和双曲面模型中核心运算的梯度，我们发现大范数嵌入会破坏基于梯度的训练稳定性，导致近端策略优化（PPO）中的信任域违例。基于这些发现，我们提出Hyper++新型双曲PPO智能体，其包含三大组件：（i）通过分类值损失函数替代回归实现稳定的评论家训练；（ii）特征正则化在保证范数有界的同时避免梯度裁剪引发的维度灾难；（iii）采用优化友好的双曲网络层形式。在ProcGen平台的实验中，Hyper++能确保稳定学习，性能超越现有双曲智能体，并将挂钟时间缩短约30%。在Atari-5环境配合Double DQN算法时，Hyper++显著优于欧几里得与双曲基线方法。代码已发布于https://github.com/Probabilistic-and-Interactive-ML/hyper-rl。

English

The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincaré Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .

理解与改进双曲深度强化学习

Understanding and Improving Hyperbolic Deep Reinforcement Learning

摘要

Support