라우팅보다 표현: 다중 시간 척도 PPO에서 대리 해킹 극복

초록

강화 학습에서 시간적 신용 할당은 오랫동안 핵심적인 과제였다. 신경생물학의 도파민 시스템이 가진 다중 시간 규모 인코딩에서 영감을 받아, 최근 연구는 단기 반응과 장기 계획의 균형을 맞추기 위해 Proximal Policy Optimization(PPO)과 같은 Actor-Critic 구조에 여러 할인 계수를 도입하려는 시도를 해왔다. 그러나 본 논문은 복잡한 지연 보상 과제에서 다중 시간 규모 신호를 무분별하게 융합하면 심각한 알고리즘 병리 현상이 발생할 수 있음을 밝힌다. 우리는 시간적 주의 라우팅 메커니즘을 정책 기울기에 노출시키면 대리 목표 해킹(surrogate objective hacking)이 발생하고, 기울기 없는 불확실성 가중치를 채택하면 되돌릴 수 없는 근시성 퇴화가 촉발된다는 것을 체계적으로 입증하며, 이 현상을 시간적 불확실성의 역설(Paradox of Temporal Uncertainty)이라고 명명한다. 이러한 문제를 해결하기 위해 우리는 목표 분리(Target Decoupling) 구조를 제안한다. Critic 측에서는 다중 시간 규모 예측을 유지하여 보조 표현 학습을 강화하고, Actor 측에서는 단기 신호를 엄격히 분리하여 장기 이점에만 기반하여 정책을 갱신한다. LunarLander-v2 환경에서 여러 독립적인 무작위 시드에 걸친 엄격한 실증 평가를 통해 제안된 구조가 통계적으로 유의미한 성능 향상을 달성함을 입증한다. 하이퍼파라미터 해킹에 의존하지 않고도 최소 분산으로 '환경 해결(Environment Solved)' 임계값을 일관되게 초과하며, 정책 붕괴를 완전히 제거하고, 단일 시간 규모 기준선을 가두는 맴도는 지역 최적점을 벗어난다. 실험 재현을 위한 소스 코드는 https://github.com/ben-dlwlrma/Representation-Over-Routing에서 공개적으로 이용 가능하다.

English

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.