Genie: 生成的インタラクティブ環境

要旨

私たちはGenieを紹介します。これは、ラベル付けされていないインターネット動画から教師なし学習で訓練された初めての生成的インタラクティブ環境です。このモデルは、テキスト、合成画像、写真、さらにはスケッチで記述された、アクション制御可能な仮想世界を無限に生成することができます。110億パラメータを持つGenieは、基盤となる世界モデルと見なすことができます。これは、時空間的ビデオトークナイザー、自己回帰的ダイナミクスモデル、そしてシンプルでスケーラブルな潜在アクションモデルで構成されています。Genieは、世界モデルの文献で一般的に見られるような真のアクションラベルや他のドメイン固有の要件なしに訓練されているにもかかわらず、ユーザーが生成された環境でフレームごとにアクションを取ることを可能にします。さらに、結果として学習された潜在アクション空間は、未見の動画から行動を模倣するエージェントの訓練を容易にし、将来の汎用エージェントの訓練への道を開きます。

English

We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Genie: 生成的インタラクティブ環境

Genie: Generative Interactive Environments

要旨

Support