ChatPaper.aiChatPaper

離線目標條件強化學習中的選項感知時間抽象價值

Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

May 19, 2025
作者: Hongjoon Ahn, Heewoong Choi, Jisu Han, Taesup Moon
cs.AI

摘要

離線目標條件強化學習(GCRL)提供了一種實用的學習範式,其中目標達成策略是從大量未標記(無獎勵)數據集中訓練而來,無需額外的環境交互。然而,即便採用了如HIQL等層次化策略結構的最新進展,離線GCRL在處理長時序任務時仍面臨挑戰。通過深入剖析這一難題的根源,我們得出以下洞見:首先,性能瓶頸主要源於高層策略無法生成合適的子目標。其次,在長時序情境下學習高層策略時,優勢信號的符號經常出現錯誤。因此,我們主張改進價值函數以產生清晰優勢信號,對於學習高層策略至關重要。本文中,我們提出了一種簡單而有效的解決方案:選項感知的時間抽象價值學習(OTA),該方法將時間抽象融入時間差分學習過程。通過使價值更新具備選項感知能力,所提出的學習方案縮短了有效時序長度,即便在長時序情境下也能獲得更好的優勢估計。實驗表明,利用OTA價值函數提取的高層策略在OGBench這一新近提出的離線GCRL基準測試中表現出色,包括迷宮導航和視覺機器人操作環境。
English
Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm where goal-reaching policies are trained from abundant unlabeled (reward-free) datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. By identifying the root cause of this challenge, we observe the following insights: First, performance bottlenecks mainly stem from the high-level policy's inability to generate appropriate subgoals. Second, when learning the high-level policy in the long-horizon regime, the sign of the advantage signal frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage signal for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, the proposed learning scheme contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy extracted using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.

Summary

AI-Generated Summary

PDF12May 27, 2025