마스킹된 궤적 모델: 예측, 표현 및 제어를 위한 접근법

초록

우리는 순차적 의사결정을 위한 일반적인 추상화로서 마스크드 트래젝토리 모델(Masked Trajectory Models, MTM)을 소개한다. MTM은 상태-행동 시퀀스와 같은 트래젝토리를 입력으로 받아, 동일한 트래젝토리의 무작위 부분 집합에 조건부로 트래젝토리를 재구성하는 것을 목표로 한다. 고도로 무작위화된 마스킹 패턴으로 학습함으로써, MTM은 추론 시 적절한 마스크를 선택하기만 하면 다양한 역할이나 기능을 수행할 수 있는 다재다능한 네트워크를 학습한다. 예를 들어, 동일한 MTM 네트워크를 전방 동역학 모델, 역동역학 모델, 심지어 오프라인 강화학습(RL) 에이전트로 사용할 수 있다. 여러 연속 제어 작업에서의 광범위한 실험을 통해, 동일한 MTM 네트워크(즉, 동일한 가중치)가 앞서 언급한 기능을 위해 훈련된 전용 네트워크와 견줄 만하거나 이를 능가할 수 있음을 보여준다. 또한, MTM에 의해 학습된 상태 표현이 전통적인 RL 알고리즘의 학습 속도를 크게 가속화할 수 있음을 발견했다. 마지막으로, 오프라인 RL 벤치마크에서 MTM은 명시적인 RL 구성 요소 없이도 일반적인 자기 지도 학습 방법임에도 불구하고, 전용 오프라인 RL 알고리즘과 경쟁력을 갖추고 있음을 확인했다. 코드는 https://github.com/facebookresearch/mtm에서 확인할 수 있다.

English

We introduce Masked Trajectory Models (MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take on different roles or capabilities, by simply choosing appropriate masks at inference time. For example, the same MTM network can be used as a forward dynamics model, inverse dynamics model, or even an offline RL agent. Through extensive experiments in several continuous control tasks, we show that the same MTM network -- i.e. same weights -- can match or outperform specialized networks trained for the aforementioned capabilities. Additionally, we find that state representations learned by MTM can significantly accelerate the learning speed of traditional RL algorithms. Finally, in offline RL benchmarks, we find that MTM is competitive with specialized offline RL algorithms, despite MTM being a generic self-supervised learning method without any explicit RL components. Code is available at https://github.com/facebookresearch/mtm

마스킹된 궤적 모델: 예측, 표현 및 제어를 위한 접근법

Masked Trajectory Models for Prediction, Representation, and Control

초록

Support