離線目標條件強化學習中的選項感知時間抽象價值

摘要

離線目標條件強化學習（GCRL）提供了一種實用的學習範式，其中目標達成策略是從大量未標記（無獎勵）數據集中訓練而來，無需額外的環境交互。然而，即便採用了如HIQL等層次化策略結構的最新進展，離線GCRL在處理長時序任務時仍面臨挑戰。通過深入剖析這一難題的根源，我們得出以下洞見：首先，性能瓶頸主要源於高層策略無法生成合適的子目標。其次，在長時序情境下學習高層策略時，優勢信號的符號經常出現錯誤。因此，我們主張改進價值函數以產生清晰優勢信號，對於學習高層策略至關重要。本文中，我們提出了一種簡單而有效的解決方案：選項感知的時間抽象價值學習（OTA），該方法將時間抽象融入時間差分學習過程。通過使價值更新具備選項感知能力，所提出的學習方案縮短了有效時序長度，即便在長時序情境下也能獲得更好的優勢估計。實驗表明，利用OTA價值函數提取的高層策略在OGBench這一新近提出的離線GCRL基準測試中表現出色，包括迷宮導航和視覺機器人操作環境。

English

Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm where goal-reaching policies are trained from abundant unlabeled (reward-free) datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. By identifying the root cause of this challenge, we observe the following insights: First, performance bottlenecks mainly stem from the high-level policy's inability to generate appropriate subgoals. Second, when learning the high-level policy in the long-horizon regime, the sign of the advantage signal frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage signal for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, the proposed learning scheme contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy extracted using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.