PASTA: 事前学習済みアクション・ステートトランスフォーマーエージェント

要旨

自己教師あり学習は、NLP、ビジョン、生物学など、さまざまなコンピューティング領域において革命的なパラダイムシフトをもたらしました。最近のアプローチでは、膨大な量のラベルなしデータでトランスフォーマーモデルを事前学習し、下流タスクを効率的に解決するための出発点として活用しています。強化学習の分野では、研究者たちは最近、これらのアプローチを適応させ、エキスパート軌跡で事前学習されたモデルを開発し、ロボティクスから推薦システムまで幅広いタスクに対応できるようにしました。しかし、既存の手法の多くは、特定の下流アプリケーションに特化した複雑な事前学習目的に依存しています。本論文では、Pretrained Action-State Transformer Agents（PASTA）と呼ぶモデルについて包括的な調査を行います。私たちの研究では、統一された方法論を使用し、行動クローニング、オフラインRL、センサー障害に対するロバスト性、ダイナミクス変化への適応など、広範な一般的な下流タスクをカバーしています。私たちの目標は、さまざまな設計選択を体系的に比較し、堅牢なモデルを構築するための貴重な洞察を実践者に提供することです。本研究の主なハイライトには、アクションと状態コンポーネントレベルでのトークン化、次のトークン予測のような基本的な事前学習目的の使用、多様なドメインにわたるモデルの同時学習、およびパラメータ効率的なファインチューニング（PEFT）の適用が含まれます。私たちの研究で開発されたモデルは1000万パラメータ未満であり、PEFTの適用により、下流適応時に1万パラメータ未満のファインチューニングが可能となり、広範なコミュニティがこれらのモデルを使用し、私たちの実験を再現できるようになります。この研究が、第一原理に基づいた設計選択を用いてRL軌跡を表現するトランスフォーマーの使用と、堅牢なポリシー学習への貢献をさらに促進することを願っています。

English

Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In the realm of reinforcement learning, researchers have recently adapted these approaches by developing models pre-trained on expert trajectories, enabling them to address a wide range of tasks, from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper presents a comprehensive investigation of models we refer to as Pretrained Action-State Transformer Agents (PASTA). Our study uses a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our goal is to systematically compare various design choices and provide valuable insights to practitioners for building robust models. Key highlights of our study include tokenization at the action and state component level, using fundamental pre-training objectives like next token prediction, training models across diverse domains simultaneously, and using parameter efficient fine-tuning (PEFT). The developed models in our study contain fewer than 10 million parameters and the application of PEFT enables fine-tuning of fewer than 10,000 parameters during downstream adaptation, allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first-principles design choices to represent RL trajectories and contribute to robust policy learning.

PASTA: 事前学習済みアクション・ステートトランスフォーマーエージェント

PASTA: Pretrained Action-State Transformer Agents

要旨

Support