PASTA: 預訓練動作-狀態轉換器代理
PASTA: Pretrained Action-State Transformer Agents
July 20, 2023
作者: Raphael Boige, Yannis Flet-Berliac, Arthur Flajolet, Guillaume Richard, Thomas Pierrot
cs.AI
摘要
自我監督學習已在各種計算領域帶來了革命性的範式轉變,包括自然語言處理、視覺和生物學。最近的方法涉及在大量未標記數據上預訓練變壓器模型,作為有效解決下游任務的起點。在強化學習領域,研究人員最近通過開發在專家軌跡上預訓練的模型,使其能夠應對從機器人到推薦系統等各種任務。然而,現有方法主要依賴於針對特定下游應用量身定制的複雜預訓練目標。本文提出了我們稱之為預訓練動作-狀態變壓器代理(PASTA)的模型的全面研究。我們的研究使用統一的方法論,涵蓋了包括行為克隆、離線強化學習、傳感器故障韌性和動態變化適應等廣泛的一般下游任務。我們的目標是系統地比較各種設計選擇,並為構建強健模型的從業者提供有價值的見解。我們研究的重點包括在動作和狀態組件級別進行標記化,使用基本的預訓練目標,如下一個標記預測,同時跨多個領域訓練模型,以及使用參數高效的微調(PEFT)。我們研究中開發的模型包含不到1千萬個參數,應用PEFT使我們在下游適應期間微調不到1萬個參數,使廣泛社區能夠使用這些模型並重現我們的實驗。我們希望這項研究將鼓勵進一步研究使用具有第一原則設計選擇的變壓器來表示強化學習軌跡,並有助於強健策略學習。
English
Self-supervised learning has brought about a revolutionary paradigm shift in
various computing domains, including NLP, vision, and biology. Recent
approaches involve pre-training transformer models on vast amounts of unlabeled
data, serving as a starting point for efficiently solving downstream tasks. In
the realm of reinforcement learning, researchers have recently adapted these
approaches by developing models pre-trained on expert trajectories, enabling
them to address a wide range of tasks, from robotics to recommendation systems.
However, existing methods mostly rely on intricate pre-training objectives
tailored to specific downstream applications. This paper presents a
comprehensive investigation of models we refer to as Pretrained Action-State
Transformer Agents (PASTA). Our study uses a unified methodology and covers an
extensive set of general downstream tasks including behavioral cloning, offline
RL, sensor failure robustness, and dynamics change adaptation. Our goal is to
systematically compare various design choices and provide valuable insights to
practitioners for building robust models. Key highlights of our study include
tokenization at the action and state component level, using fundamental
pre-training objectives like next token prediction, training models across
diverse domains simultaneously, and using parameter efficient fine-tuning
(PEFT). The developed models in our study contain fewer than 10 million
parameters and the application of PEFT enables fine-tuning of fewer than 10,000
parameters during downstream adaptation, allowing a broad community to use
these models and reproduce our experiments. We hope that this study will
encourage further research into the use of transformers with first-principles
design choices to represent RL trajectories and contribute to robust policy
learning.