表徵優於路由:克服多時間尺度PPO中的代理目標操縱
Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
May 21, 2026
作者: Jing Sun
cs.AI
摘要
在強化學習中,時間信用分配長期以來一直是核心挑戰。受神經生物學中多巴胺系統的多時間尺度編碼啟發,近期研究嘗試在Actor-Critic架構(例如近端策略優化PPO)中引入多重折扣因子,以平衡短期反應與長期規劃。然而,本文揭示,在複雜的延遲獎勵任務中盲目融合多時間尺度信號,可能導致嚴重的演算法病理。我們系統性地證明,將時間注意力路由機制暴露於策略梯度會引發替代目標駭客攻擊,而採用無梯度不確定性加權則會觸發不可逆的近視退化,我們將此現象稱為「時間不確定性悖論」。為解決這些問題,我們提出目標解耦架構:在Critic端保留多時間尺度預測以強化輔助表徵學習,而在Actor端嚴格隔離短期信號,僅基於長期優勢更新策略。在LunarLander-v2環境中透過多個獨立隨機種子的嚴格實證評估顯示,我們提出的架構達成了統計上顯著的效能提升。無需依賴超參數調校,該架構能以最小變異數持續超越「環境解決」閾值,完全消除策略崩潰,並逃脫單時間尺度基準線所困的徘徊局部最優。可重現實驗的原始碼已公開於https://github.com/ben-dlwlrma/Representation-Over-Routing。
English
Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.