PASTA: 사전 훈련된 행동-상태 변환기 에이전트

초록

자기 지도 학습(self-supervised learning)은 NLP, 비전, 생물학 등 다양한 컴퓨팅 분야에서 혁신적인 패러다임 전환을 가져왔습니다. 최근 접근법은 방대한 양의 레이블 없는 데이터에 대해 트랜스포머(transformer) 모델을 사전 학습하여 다운스트림 작업을 효율적으로 해결하기 위한 출발점으로 활용하는 것을 포함합니다. 강화 학습(reinforcement learning) 영역에서 연구자들은 최근 이러한 접근법을 적용하여 전문가 궤적(expert trajectories)에 대해 사전 학습된 모델을 개발함으로써 로보틱스부터 추천 시스템에 이르기까지 다양한 작업을 해결할 수 있도록 했습니다. 그러나 기존 방법들은 주로 특정 다운스트림 애플리케이션에 맞춰 설계된 복잡한 사전 학습 목표에 의존합니다. 본 논문은 우리가 PASTA(Pretrained Action-State Transformer Agents)라고 부르는 모델에 대한 포괄적인 연구를 제시합니다. 우리의 연구는 통합된 방법론을 사용하며, 행동 복제(behavioral cloning), 오프라인 강화 학습(offline RL), 센서 고장 견고성(sensor failure robustness), 동역학 변화 적응(dynamics change adaptation) 등 광범위한 일반 다운스트림 작업을 다룹니다. 우리의 목표는 다양한 설계 선택을 체계적으로 비교하고, 견고한 모델을 구축하기 위한 실무자들에게 유용한 통찰력을 제공하는 것입니다. 우리 연구의 주요 하이라이트는 행동 및 상태 구성 요소 수준에서의 토큰화(tokenization), 다음 토큰 예측(next token prediction)과 같은 기본적인 사전 학습 목표 사용, 다양한 도메인에서 동시에 모델을 학습, 그리고 파라미터 효율적 미세 조정(PEFT, Parameter Efficient Fine-Tuning)의 적용을 포함합니다. 우리 연구에서 개발된 모델은 1천만 개 미만의 파라미터를 포함하며, PEFT의 적용으로 다운스트림 적응 동안 1만 개 미만의 파라미터만 미세 조정할 수 있어, 광범위한 커뮤니티가 이러한 모델을 사용하고 우리의 실험을 재현할 수 있도록 합니다. 우리는 이 연구가 트랜스포머를 사용하여 강화 학습 궤적을 표현하고 견고한 정책 학습에 기여하기 위한 첫 원칙(first-principles) 설계 선택에 대한 추가 연구를 촉진하기를 바랍니다.

English

Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In the realm of reinforcement learning, researchers have recently adapted these approaches by developing models pre-trained on expert trajectories, enabling them to address a wide range of tasks, from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper presents a comprehensive investigation of models we refer to as Pretrained Action-State Transformer Agents (PASTA). Our study uses a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our goal is to systematically compare various design choices and provide valuable insights to practitioners for building robust models. Key highlights of our study include tokenization at the action and state component level, using fundamental pre-training objectives like next token prediction, training models across diverse domains simultaneously, and using parameter efficient fine-tuning (PEFT). The developed models in our study contain fewer than 10 million parameters and the application of PEFT enables fine-tuning of fewer than 10,000 parameters during downstream adaptation, allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first-principles design choices to represent RL trajectories and contribute to robust policy learning.

PASTA: 사전 훈련된 행동-상태 변환기 에이전트

PASTA: Pretrained Action-State Transformer Agents

초록

Support