PASTA：预训练动作-状态Transformer智能体

摘要

自监督学习在各种计算领域，包括自然语言处理、视觉和生物学中带来了革命性的范式转变。最近的方法涉及在大量未标记数据上预训练Transformer模型，作为高效解决下游任务的起点。在强化学习领域，研究人员最近通过开发在专家轨迹上预训练的模型，使其能够解决从机器人到推荐系统等各种任务。然而，现有方法大多依赖于为特定下游应用量身定制的复杂预训练目标。本文提出了一个名为预训练动作-状态Transformer代理（PASTA）的模型的全面研究。我们的研究采用统一的方法论，涵盖了广泛的一系列通用下游任务，包括行为克隆、离线强化学习、传感器故障鲁棒性和动态变化适应性。我们的目标是系统地比较各种设计选择，并为构建稳健模型的从业者提供宝贵见解。我们研究的关键亮点包括在动作和状态组件级别进行标记化，使用基本的预训练目标，如下一个标记预测，同时跨多个领域训练模型，并使用参数高效微调（PEFT）。我们研究中开发的模型包含不到1000万参数，应用PEFT使得在下游适应期间微调少于1万个参数，使广泛社区能够使用这些模型并重现我们的实验。我们希望这项研究能鼓励进一步研究，探讨使用基于第一原理的设计选择与转换器相结合来表示强化学习轨迹，并促进稳健策略学习。

English

Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In the realm of reinforcement learning, researchers have recently adapted these approaches by developing models pre-trained on expert trajectories, enabling them to address a wide range of tasks, from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper presents a comprehensive investigation of models we refer to as Pretrained Action-State Transformer Agents (PASTA). Our study uses a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our goal is to systematically compare various design choices and provide valuable insights to practitioners for building robust models. Key highlights of our study include tokenization at the action and state component level, using fundamental pre-training objectives like next token prediction, training models across diverse domains simultaneously, and using parameter efficient fine-tuning (PEFT). The developed models in our study contain fewer than 10 million parameters and the application of PEFT enables fine-tuning of fewer than 10,000 parameters during downstream adaptation, allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first-principles design choices to represent RL trajectories and contribute to robust policy learning.

PASTA：预训练动作-状态Transformer智能体

PASTA: Pretrained Action-State Transformer Agents

摘要

Support