離線目標條件強化學習中的選項感知時間抽象價值
Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning
May 19, 2025
作者: Hongjoon Ahn, Heewoong Choi, Jisu Han, Taesup Moon
cs.AI
摘要
離線目標條件強化學習(GCRL)提供了一種實用的學習範式,其中目標達成策略是從大量未標記(無獎勵)數據集中訓練而來,無需額外的環境交互。然而,即便採用了如HIQL等層次化策略結構的最新進展,離線GCRL在處理長時序任務時仍面臨挑戰。通過深入剖析這一難題的根源,我們得出以下洞見:首先,性能瓶頸主要源於高層策略無法生成合適的子目標。其次,在長時序情境下學習高層策略時,優勢信號的符號經常出現錯誤。因此,我們主張改進價值函數以產生清晰優勢信號,對於學習高層策略至關重要。本文中,我們提出了一種簡單而有效的解決方案:選項感知的時間抽象價值學習(OTA),該方法將時間抽象融入時間差分學習過程。通過使價值更新具備選項感知能力,所提出的學習方案縮短了有效時序長度,即便在長時序情境下也能獲得更好的優勢估計。實驗表明,利用OTA價值函數提取的高層策略在OGBench這一新近提出的離線GCRL基準測試中表現出色,包括迷宮導航和視覺機器人操作環境。
English
Offline goal-conditioned reinforcement learning (GCRL) offers a practical
learning paradigm where goal-reaching policies are trained from abundant
unlabeled (reward-free) datasets without additional environment interaction.
However, offline GCRL still struggles with long-horizon tasks, even with recent
advances that employ hierarchical policy structures, such as HIQL. By
identifying the root cause of this challenge, we observe the following
insights: First, performance bottlenecks mainly stem from the high-level
policy's inability to generate appropriate subgoals. Second, when learning the
high-level policy in the long-horizon regime, the sign of the advantage signal
frequently becomes incorrect. Thus, we argue that improving the value function
to produce a clear advantage signal for learning the high-level policy is
essential. In this paper, we propose a simple yet effective solution:
Option-aware Temporally Abstracted value learning, dubbed OTA, which
incorporates temporal abstraction into the temporal-difference learning
process. By modifying the value update to be option-aware, the proposed
learning scheme contracts the effective horizon length, enabling better
advantage estimates even in long-horizon regimes. We experimentally show that
the high-level policy extracted using the OTA value function achieves strong
performance on complex tasks from OGBench, a recently proposed offline GCRL
benchmark, including maze navigation and visual robotic manipulation
environments.Summary
AI-Generated Summary