ルーティングではなく表現：マルチタイムスケールPPOにおける代理ハッキングの克服

要旨

強化学習における時間的クレジット割り当ては長年にわたり中心的な課題であった。神経生物学におけるドーパミンシステムの多時間スケールエンコーディングに着想を得て、近年の研究では、近位政策最適化（PPO）などのアクター・クリティックアーキテクチャに複数の割引率を導入し、短期応答と長期計画のバランスを取ろうとしてきた。しかし、本論文は、複雑な遅延報酬タスクにおいて多時間スケール信号を無分別に融合すると、深刻なアルゴリズム病理を引き起こす可能性があることを明らかにする。我々は、時間的注意ルーティング機構を政策勾配にさらすと代理目的関数のハッキングが生じる一方、勾配フリーの不確実性重み付けを採用すると不可逆的な近視的退化を引き起こすことを系統的に実証する。この現象を時間的不確実性のパラドックスと命名する。これらの問題に対処するため、我々はターゲットデカップリングアーキテクチャを提案する。クリティック側では多時間スケール予測を保持して補助的表現学習を強制し、アクター側では短期信号を厳密に分離し、長期アドバンテージのみに基づいて政策を更新する。LunarLander-v2環境における複数の独立したランダムシードにわたる厳密な実証評価により、提案アーキテクチャが統計的に有意な性能向上を達成することが示された。ハイパーパラメータ調整に依存することなく、最小の分散で一貫して「環境解決」閾値を超え、政策崩壊を完全に排除し、単一時間スケールのベースラインを罠にかけるホバリング局所最適から脱出する。実験を再現するためのソースコードはhttps://github.com/ben-dlwlrma/Representation-Over-Routingで公開されている。

English

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.