監督式預訓練能夠學習上下文強化學習。
Supervised Pretraining Can Learn In-Context Reinforcement Learning
June 26, 2023
作者: Jonathan N. Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, Emma Brunskill
cs.AI
摘要
基於多樣數據集訓練的大型Transformer模型展現了在上下文中學習的卓越能力,實現了高few-shot性能,即在未明確訓練解決的任務上表現出色。本文研究了Transformer在決策問題中的上下文學習能力,即強化學習(RL)用於樂觀機制和馬可夫決策過程。為此,我們引入並研究了決策預訓練Transformer(DPT),這是一種監督預訓練方法,其中Transformer在給定查詢狀態和上下文互動數據集的情況下預測最優行動,跨越多樣任務。儘管此過程簡單,但產生了一個具有幾個令人驚訝能力的模型。我們發現預訓練Transformer可以用於解決一系列RL問題,展示出線上探索和線下保守性,儘管未明確訓練為此目的。該模型還可以推廣到預訓練分佈之外的新任務,並自動適應其對未知結構的決策策略。從理論上講,我們展示了DPT可以被視為貝葉斯後驗採樣的高效實現,這是一種經證明具有高樣本效率的RL算法。我們進一步利用這種聯繫,對DPT產生的上下文算法的後悔提供保證,並證明它可以比用於生成預訓練數據的算法更快地學習。這些結果表明了在Transformer中灌輸強大的上下文決策能力的一條有前途但簡單的途徑。
English
Large transformer models trained on diverse datasets have shown a remarkable
ability to learn in-context, achieving high few-shot performance on tasks they
were not explicitly trained to solve. In this paper, we study the in-context
learning capabilities of transformers in decision-making problems, i.e.,
reinforcement learning (RL) for bandits and Markov decision processes. To do
so, we introduce and study Decision-Pretrained Transformer (DPT), a supervised
pretraining method where the transformer predicts an optimal action given a
query state and an in-context dataset of interactions, across a diverse set of
tasks. This procedure, while simple, produces a model with several surprising
capabilities. We find that the pretrained transformer can be used to solve a
range of RL problems in-context, exhibiting both exploration online and
conservatism offline, despite not being explicitly trained to do so. The model
also generalizes beyond the pretraining distribution to new tasks and
automatically adapts its decision-making strategies to unknown structure.
Theoretically, we show DPT can be viewed as an efficient implementation of
Bayesian posterior sampling, a provably sample-efficient RL algorithm. We
further leverage this connection to provide guarantees on the regret of the
in-context algorithm yielded by DPT, and prove that it can learn faster than
algorithms used to generate the pretraining data. These results suggest a
promising yet simple path towards instilling strong in-context decision-making
abilities in transformers.