遮罩軌跡模型用於預測、表示和控制

摘要

我們引入遮罩軌跡模型（MTM）作為一種用於順序決策的通用抽象。MTM 採用軌跡，例如狀態-動作序列，並旨在在相同軌跡的隨機子集條件下重建軌跡。通過使用高度隨機化的遮罩模式進行訓練，MTM 學習到可以在推論時通過簡單地選擇適當的遮罩來扮演不同角色或具有不同功能的多功能網絡。例如，同一個 MTM 網絡可以用作前向動力學模型、逆向動力學模型，甚至是離線強化學習代理。通過在多個連續控制任務中進行大量實驗，我們展示了相同的 MTM 網絡 - 即相同的權重 - 可以與為上述功能訓練的專用網絡匹敵甚至優於其表現。此外，我們發現由 MTM 學習的狀態表示可以顯著加速傳統強化學習算法的學習速度。最後，在離線強化學習基準測試中，我們發現 MTM 與專用的離線強化學習算法相媲美，儘管 MTM 是一種通用的自監督學習方法，沒有任何明確的強化學習組件。代碼可在 https://github.com/facebookresearch/mtm 找到。

English

We introduce Masked Trajectory Models (MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take on different roles or capabilities, by simply choosing appropriate masks at inference time. For example, the same MTM network can be used as a forward dynamics model, inverse dynamics model, or even an offline RL agent. Through extensive experiments in several continuous control tasks, we show that the same MTM network -- i.e. same weights -- can match or outperform specialized networks trained for the aforementioned capabilities. Additionally, we find that state representations learned by MTM can significantly accelerate the learning speed of traditional RL algorithms. Finally, in offline RL benchmarks, we find that MTM is competitive with specialized offline RL algorithms, despite MTM being a generic self-supervised learning method without any explicit RL components. Code is available at https://github.com/facebookresearch/mtm

遮罩軌跡模型用於預測、表示和控制

Masked Trajectory Models for Prediction, Representation, and Control

摘要

Support