ラベルのない事前データからのスキル活用による効率的なオンライン探索

要旨

教師なし事前学習は多くの教師あり領域で革新的であることが示されています。ただし、このような考えを強化学習（RL）に適用することは、微調整がタスク固有のデータを模倣するのではなく、反復的な自己改善を通じて解を探索および特定するという独自の課題を提起します。本研究では、未ラベルの事前軌跡データを活用して効率的な探索戦略を学習する方法を検討しています。事前データは、低レベルのスキルセットを事前学習するために使用されるか、オンラインRLのための追加のオフポリシーデータとして使用される可能性がありますが、これらのアイデアをオンライン探索に効果的に組み合わせる方法は不明瞭でした。当社の手法であるSUPE（未ラベルの事前データからのスキルによる探索）は、これらのアイデアを慎重に組み合わせることでその利点を複利化することを示しています。当社の手法は、まず変分オートエンコーダ（VAE）を使用して低レベルのスキルを抽出し、次に楽観的な報酬モデルを使用して未ラベルの軌跡を疑似的にラベル付けし、事前データを高レベルでタスクに関連する例に変換します。最後に、SUPEはこれらの変換された例をオンラインRLのための追加のオフポリシーデータとして使用して、事前学習された低レベルのスキルを組み合わせて効率的に探索する高レベルポリシーを学習します。経験的に、SUPEは以前の戦略を確実に上回り、長期間で報酬がまばらなタスクの一連の問題を成功裏に解決します。コード: https://github.com/rail-berkeley/supe.

English

Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an optimistic reward model, transforming prior data into high-level, task-relevant examples. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.

ラベルのない事前データからのスキル活用による効率的なオンライン探索

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

要旨

Support