オフライン目標条件付き強化学習のためのオプション認識型時間抽象化価値関数

要旨

オフライン目標条件付き強化学習（GCRL）は、追加の環境相互作用なしに、豊富なラベルなし（報酬なし）データセットから目標到達ポリシーを訓練する実用的な学習パラダイムを提供します。しかし、オフラインGCRLは、HIQLのような階層的ポリシー構造を採用した最近の進歩にもかかわらず、長期的なタスクにおいて依然として苦戦しています。この課題の根本原因を特定することで、以下の洞察が得られました。第一に、パフォーマンスのボトルネックは主に高レベルポリシーが適切なサブゴールを生成できないことに起因しています。第二に、長期的なレジームで高レベルポリシーを学習する際、アドバンテージ信号の符号が頻繁に誤ったものになります。したがって、高レベルポリシーの学習に明確なアドバンテージ信号を生成するために価値関数を改善することが重要であると主張します。本論文では、シンプルでありながら効果的な解決策を提案します。それは、時間的抽象化を時間的差分学習プロセスに組み込んだ「Option-aware Temporally Abstracted value learning（OTA）」と呼ばれる手法です。価値更新をオプション認識型に変更することで、提案された学習スキームは有効な地平線の長さを短縮し、長期的なレジームにおいてもより良いアドバンテージ推定を可能にします。実験的に、OTA価値関数を使用して抽出された高レベルポリシーが、最近提案されたオフラインGCRLベンチマークであるOGBenchの複雑なタスク（迷路ナビゲーションや視覚的ロボット操作環境を含む）において強力なパフォーマンスを達成することを示します。

English

Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm where goal-reaching policies are trained from abundant unlabeled (reward-free) datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. By identifying the root cause of this challenge, we observe the following insights: First, performance bottlenecks mainly stem from the high-level policy's inability to generate appropriate subgoals. Second, when learning the high-level policy in the long-horizon regime, the sign of the advantage signal frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage signal for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, the proposed learning scheme contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy extracted using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.

オフライン目標条件付き強化学習のためのオプション認識型時間抽象化価値関数

Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

要旨

Support