监督预训练可以学习上下文强化学习。
Supervised Pretraining Can Learn In-Context Reinforcement Learning
June 26, 2023
作者: Jonathan N. Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, Emma Brunskill
cs.AI
摘要
在各种数据集上训练的大型变压器模型展现出了在上下文中学习的显著能力,实现了高Few-shot性能,能够在未经明确训练的任务上表现出色。本文研究了变压器在决策问题中的上下文学习能力,即强化学习(RL)用于赌博机和马尔可夫决策过程。为此,我们引入并研究了决策预训练变压器(DPT),这是一种监督预训练方法,其中变压器根据查询状态和上下文交互数据集预测最佳动作,涵盖了各种任务。尽管这一过程简单,但产生了一个具有几个令人惊讶能力的模型。我们发现,预训练变压器可以在上下文中解决各种RL问题,展现出在线探索和离线保守性,尽管并未经过明确训练。该模型还可以推广到预训练分布之外的新任务,并自动调整其决策策略以适应未知结构。从理论上讲,我们展示了DPT可以被视为贝叶斯后验抽样的高效实现,这是一种经过证明的样本高效RL算法。我们进一步利用这一联系,对DPT产生的上下文算法的遗憾提供保证,并证明它可以比用于生成预训练数据的算法学习得更快。这些结果表明了在变压器中灌输强大的上下文决策能力的一个有前途但简单的途径。
English
Large transformer models trained on diverse datasets have shown a remarkable
ability to learn in-context, achieving high few-shot performance on tasks they
were not explicitly trained to solve. In this paper, we study the in-context
learning capabilities of transformers in decision-making problems, i.e.,
reinforcement learning (RL) for bandits and Markov decision processes. To do
so, we introduce and study Decision-Pretrained Transformer (DPT), a supervised
pretraining method where the transformer predicts an optimal action given a
query state and an in-context dataset of interactions, across a diverse set of
tasks. This procedure, while simple, produces a model with several surprising
capabilities. We find that the pretrained transformer can be used to solve a
range of RL problems in-context, exhibiting both exploration online and
conservatism offline, despite not being explicitly trained to do so. The model
also generalizes beyond the pretraining distribution to new tasks and
automatically adapts its decision-making strategies to unknown structure.
Theoretically, we show DPT can be viewed as an efficient implementation of
Bayesian posterior sampling, a provably sample-efficient RL algorithm. We
further leverage this connection to provide guarantees on the regret of the
in-context algorithm yielded by DPT, and prove that it can learn faster than
algorithms used to generate the pretraining data. These results suggest a
promising yet simple path towards instilling strong in-context decision-making
abilities in transformers.