지도 학습 기반 사전 훈련은 인-컨텍스트 강화 학습을 학습할 수 있다.

초록

다양한 데이터셋으로 훈련된 대형 트랜스포머 모델은 컨텍스트 내 학습 능력이 뛰어나며, 명시적으로 훈련되지 않은 작업에서도 높은 퓨샷(few-shot) 성능을 달성하는 것으로 나타났습니다. 본 논문에서는 트랜스포머의 컨텍스트 내 학습 능력을 의사결정 문제, 즉 밴딧(bandit) 및 마르코프 의사결정 과정(Markov Decision Process, MDP)에 대한 강화학습(Reinforcement Learning, RL)에서 연구합니다. 이를 위해, 우리는 다양한 작업 세트에서 쿼리 상태와 상호작용 데이터셋이 주어졌을 때 최적의 행동을 예측하도록 트랜스포머를 지도 학습 방식으로 사전 훈련하는 Decision-Pretrained Transformer (DPT)를 소개하고 연구합니다. 이 절차는 단순하지만, 몇 가지 놀라운 능력을 가진 모델을 생성합니다. 사전 훈련된 트랜스포머는 명시적으로 훈련되지 않았음에도 불구하고, 온라인에서의 탐색(exploration)과 오프라인에서의 보수적(conservatism) 접근을 모두 보여주며 다양한 RL 문제를 컨텍스트 내에서 해결할 수 있습니다. 또한, 이 모델은 사전 훈련 분포를 넘어 새로운 작업으로 일반화되며, 알려지지 않은 구조에 자동으로 의사결정 전략을 적응시킵니다. 이론적으로, 우리는 DPT가 표본 효율성이 입증된 RL 알고리즘인 베이지안 사후 샘플링(Bayesian posterior sampling)의 효율적인 구현으로 볼 수 있음을 보입니다. 더 나아가, 이 연결을 활용하여 DPT가 생성한 컨텍스트 내 알고리즘의 후회(regret)에 대한 보장을 제공하고, 사전 훈련 데이터를 생성하는 데 사용된 알고리즘보다 더 빠르게 학습할 수 있음을 증명합니다. 이러한 결과는 트랜스포머에 강력한 컨텍스트 내 의사결정 능력을 부여하기 위한 간단하면서도 유망한 접근 방식을 제시합니다.

English

Large transformer models trained on diverse datasets have shown a remarkable ability to learn in-context, achieving high few-shot performance on tasks they were not explicitly trained to solve. In this paper, we study the in-context learning capabilities of transformers in decision-making problems, i.e., reinforcement learning (RL) for bandits and Markov decision processes. To do so, we introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action given a query state and an in-context dataset of interactions, across a diverse set of tasks. This procedure, while simple, produces a model with several surprising capabilities. We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline, despite not being explicitly trained to do so. The model also generalizes beyond the pretraining distribution to new tasks and automatically adapts its decision-making strategies to unknown structure. Theoretically, we show DPT can be viewed as an efficient implementation of Bayesian posterior sampling, a provably sample-efficient RL algorithm. We further leverage this connection to provide guarantees on the regret of the in-context algorithm yielded by DPT, and prove that it can learn faster than algorithms used to generate the pretraining data. These results suggest a promising yet simple path towards instilling strong in-context decision-making abilities in transformers.

지도 학습 기반 사전 훈련은 인-컨텍스트 강화 학습을 학습할 수 있다.

Supervised Pretraining Can Learn In-Context Reinforcement Learning

초록

Support