対照的例示ベース制御

要旨

多くの現実世界の問題は強化学習の恩恵を受ける可能性があるものの、これらの問題はMDP（マルコフ決定過程）の枠組みにうまく当てはまらないことが多い。環境との相互作用はしばしばコストが高く、報酬関数の指定も困難である。これらの課題を動機として、これまでの研究では、遷移ダイナミクスからのサンプルと高リターンの状態の例のみから学習するデータ駆動型アプローチが開発されてきた。これらの手法は通常、高リターンの状態から報酬関数を学習し、その報酬関数を使用して遷移にラベルを付け、その後、オフライン強化学習アルゴリズムをこれらの遷移に適用する。これらの手法は多くのタスクで良好な結果を達成できるが、正則化や時間差分更新を必要とするなど、複雑であることが多い。本論文では、報酬関数ではなく、多段階遷移の暗黙的モデルを学習する、オフラインの例ベース制御の手法を提案する。この暗黙的モデルが、例ベース制御問題のQ値を表現できることを示す。一連の状態ベースおよび画像ベースのオフライン制御タスクにおいて、本手法は学習済みの報酬関数を使用するベースラインを上回り、追加の実験では、データセットサイズに対するロバスト性とスケーリングの向上が実証された。

English

While many real-world problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed data-driven approaches that learn entirely from samples from the transition dynamics and examples of high-return states. These methods typically learn a reward function from high-return states, use that reward function to label the transitions, and then apply an offline RL algorithm to these transitions. While these methods can achieve good results on many tasks, they can be complex, often requiring regularization and temporal difference updates. In this paper, we propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function. We show that this implicit model can represent the Q-values for the example-based control problem. Across a range of state-based and image-based offline control tasks, our method outperforms baselines that use learned reward functions; additional experiments demonstrate improved robustness and scaling with dataset size.

対照的例示ベース制御

Contrastive Example-Based Control

要旨

Support