对比实例驱动控制

摘要

尽管许多现实世界中可能受益于强化学习的问题，这些问题很少符合马尔可夫决策过程（MDP）的模式：与环境的交互往往代价高昂，指定奖励函数也具有挑战性。受到这些挑战的启发，先前的研究已经开发了从转移动态样本和高回报状态示例中完全学习的数据驱动方法。这些方法通常从高回报状态学习奖励函数，使用该奖励函数标记转移，然后将离线强化学习算法应用于这些转移。虽然这些方法在许多任务上可以取得良好的结果，但它们可能会很复杂，通常需要正则化和时间差分更新。在本文中，我们提出了一种基于示例的离线控制方法，该方法学习多步转移的隐式模型，而不是奖励函数。我们展示了这个隐式模型可以表示基于示例的控制问题的Q值。在一系列基于状态和基于图像的离线控制任务中，我们的方法优于使用学习奖励函数的基线；额外的实验表明了随着数据集大小的增加，改进了鲁棒性和扩展性。

English

While many real-world problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed data-driven approaches that learn entirely from samples from the transition dynamics and examples of high-return states. These methods typically learn a reward function from high-return states, use that reward function to label the transitions, and then apply an offline RL algorithm to these transitions. While these methods can achieve good results on many tasks, they can be complex, often requiring regularization and temporal difference updates. In this paper, we propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function. We show that this implicit model can represent the Q-values for the example-based control problem. Across a range of state-based and image-based offline control tasks, our method outperforms baselines that use learned reward functions; additional experiments demonstrate improved robustness and scaling with dataset size.

对比实例驱动控制

Contrastive Example-Based Control

摘要

Support