面向离线目标条件强化学习的选项感知时序抽象价值
Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning
May 19, 2025
作者: Hongjoon Ahn, Heewoong Choi, Jisu Han, Taesup Moon
cs.AI
摘要
离线目标导向强化学习(GCRL)提供了一种实用的学习范式,其中目标达成策略是从大量无标签(无奖励)数据集中训练而来,无需额外的环境交互。然而,即便采用了如HIQL等层次化策略结构的最新进展,离线GCRL在处理长时程任务时仍面临挑战。通过深入分析这一难题的根源,我们得出以下洞见:首先,性能瓶颈主要源于高层策略无法生成合适的子目标。其次,在长时程场景下学习高层策略时,优势信号的符号经常出现错误。因此,我们认为改进价值函数以产生清晰的优势信号,对于高层策略的学习至关重要。本文提出了一种简单而有效的解决方案:选项感知的时序抽象价值学习(OTA),它将时序抽象融入时序差分学习过程中。通过使价值更新具备选项感知能力,所提出的学习方案缩短了有效时程长度,即便在长时程场景下也能实现更优的优势估计。实验表明,利用OTA价值函数提取的高层策略在OGBench这一新近提出的离线GCRL基准测试中,包括迷宫导航和视觉机器人操作环境,均展现出强劲性能。
English
Offline goal-conditioned reinforcement learning (GCRL) offers a practical
learning paradigm where goal-reaching policies are trained from abundant
unlabeled (reward-free) datasets without additional environment interaction.
However, offline GCRL still struggles with long-horizon tasks, even with recent
advances that employ hierarchical policy structures, such as HIQL. By
identifying the root cause of this challenge, we observe the following
insights: First, performance bottlenecks mainly stem from the high-level
policy's inability to generate appropriate subgoals. Second, when learning the
high-level policy in the long-horizon regime, the sign of the advantage signal
frequently becomes incorrect. Thus, we argue that improving the value function
to produce a clear advantage signal for learning the high-level policy is
essential. In this paper, we propose a simple yet effective solution:
Option-aware Temporally Abstracted value learning, dubbed OTA, which
incorporates temporal abstraction into the temporal-difference learning
process. By modifying the value update to be option-aware, the proposed
learning scheme contracts the effective horizon length, enabling better
advantage estimates even in long-horizon regimes. We experimentally show that
the high-level policy extracted using the OTA value function achieves strong
performance on complex tasks from OGBench, a recently proposed offline GCRL
benchmark, including maze navigation and visual robotic manipulation
environments.Summary
AI-Generated Summary