データ効率の良い強化学習のためのTransformerワールドモデルの改善

要旨

私たちは、Craftax-classicという難解なベンチマークで新たな最先端のパフォーマンスを達成するモデルベースの強化学習アプローチを提案します。Craftax-classicは、広大な2Dサバイバルゲームであり、強力な汎化能力、深い探索、長期的な推論など、幅広い一般的な能力をエージェントに要求します。サンプル効率性を向上させるための慎重な設計選択の連続により、当社のMBRLアルゴリズムは、環境ステップが100万回しか経過していない段階で報酬が67.4%に達し、DreamerV3の53.2%を大幅に上回り、初めて65.0%の人間のパフォーマンスを超えました。当社の手法は、まず、CNNとRNNを組み合わせた革新的なポリシーアーキテクチャを使用して、SOTAモデルフリーベースラインを構築します。次に、標準的なMBRLセットアップに3つの改良を加えます：(a)「ウォームアップ付きダイナ」は、ポリシーを実データと架空データでトレーニングするもので、(b) 画像パッチに「最近傍トークナイザー」を適用し、トランスフォーマーワールドモデル（TWM）の入力を改善し、(c) 「ブロック教師強制」は、TWMが次のタイムステップの未来トークンについて共同で推論することを可能にします。

English

We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities -- such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps, significantly outperforming DreamerV3, which achieves 53.2%, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna with warmup", which trains the policy on real and imaginary data, (b) "nearest neighbor tokenizer" on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep.

データ効率の良い強化学習のためのTransformerワールドモデルの改善

Improving Transformer World Models for Data-Efficient RL

要旨

Support