表示学习优先于路由：克服多时间尺度PPO中的代理目标攻击

摘要

强化学习中的时间信用分配一直是一个核心挑战。受神经生物学中多巴胺系统多时间尺度编码的启发，近期研究试图在演员-评论家架构（如近端策略优化，PPO）中引入多个折扣因子，以平衡短期响应与长期规划。然而，本文揭示，在复杂的延迟奖励任务中盲目融合多时间尺度信号会导致严重的算法病理现象。我们系统性地证明，将时间注意力路由机制暴露于策略梯度会导致替代目标攻击，而采用无梯度不确定性加权则会引发不可逆的短视退化——我们将此现象称为时间不确定性悖论。为解决这些问题，我们提出一种目标解耦架构：在评论家侧保留多时间尺度预测以强制辅助表示学习，在演员侧则严格隔离短期信号，仅基于长期优势更新策略。通过LunarLander-v2环境中多个独立随机种子的严格实证评估，我们的架构实现了统计显著的性能提升。在不依赖超参数调优的情况下，它始终以最小方差超越“环境求解”阈值，彻底消除策略崩溃，并摆脱了单时间尺度基线陷入的停滞局部最优。重现实验的源代码已公开于https://github.com/ben-dlwlrma/Representation-Over-Routing。

English

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.