マスク軌跡モデルによる予測、表現、制御

要旨

我々は、逐次的意思決定のための汎用的な抽象化としてMasked Trajectory Models（MTM）を提案する。MTMは、状態-行動系列のような軌跡を入力とし、同じ軌跡のランダムな部分集合を条件として軌跡を再構築することを目指す。高度にランダム化されたマスキングパターンで訓練することで、MTMは推論時に適切なマスクを選択するだけで、異なる役割や能力を担える汎用性の高いネットワークを学習する。例えば、同じMTMネットワークを、順力学モデル、逆力学モデル、さらにはオフライン強化学習（RL）エージェントとして使用できる。いくつかの連続制御タスクにおける広範な実験を通じて、我々は同じMTMネットワーク（すなわち同じ重み）が、前述の能力のために訓練された専門的なネットワークに匹敵するか、それを上回る性能を発揮することを示す。さらに、MTMによって学習された状態表現が、従来のRLアルゴリズムの学習速度を大幅に加速できることを発見した。最後に、オフラインRLベンチマークにおいて、MTMが明示的なRLコンポーネントを持たない汎用的な自己教師あり学習手法であるにもかかわらず、専門的なオフラインRLアルゴリズムと競合することを確認した。コードはhttps://github.com/facebookresearch/mtmで公開されている。

English

We introduce Masked Trajectory Models (MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take on different roles or capabilities, by simply choosing appropriate masks at inference time. For example, the same MTM network can be used as a forward dynamics model, inverse dynamics model, or even an offline RL agent. Through extensive experiments in several continuous control tasks, we show that the same MTM network -- i.e. same weights -- can match or outperform specialized networks trained for the aforementioned capabilities. Additionally, we find that state representations learned by MTM can significantly accelerate the learning speed of traditional RL algorithms. Finally, in offline RL benchmarks, we find that MTM is competitive with specialized offline RL algorithms, despite MTM being a generic self-supervised learning method without any explicit RL components. Code is available at https://github.com/facebookresearch/mtm

マスク軌跡モデルによる予測、表現、制御

Masked Trajectory Models for Prediction, Representation, and Control

要旨

Support