神燈:生成式互動環境
Genie: Generative Interactive Environments
February 23, 2024
作者: Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh, Tim Rocktäschel
cs.AI
摘要
我們介紹了 Genie,這是第一個從未標記的互聯網視頻中以非監督方式訓練的生成式互動環境。該模型可以根據文本、合成圖像、照片,甚至草圖的描述,生成各種可通過動作控制的虛擬世界。擁有 110 億參數的 Genie 可被視為基礎世界模型。它由時空視頻分詞器、自回歸動力學模型以及一個簡單且可擴展的潛在動作模型組成。Genie 讓用戶能夠在生成的環境中進行逐幀操作,盡管訓練過程中沒有任何地面真實動作標籤或其他通常在世界模型文獻中找到的領域特定要求。此外,所得到的學習潛在動作空間有助於訓練代理人模仿來自未見視頻的行為,為未來訓練通用型代理人開辟了道路。
English
We introduce Genie, the first generative interactive environment trained in
an unsupervised manner from unlabelled Internet videos. The model can be
prompted to generate an endless variety of action-controllable virtual worlds
described through text, synthetic images, photographs, and even sketches. At
11B parameters, Genie can be considered a foundation world model. It is
comprised of a spatiotemporal video tokenizer, an autoregressive dynamics
model, and a simple and scalable latent action model. Genie enables users to
act in the generated environments on a frame-by-frame basis despite training
without any ground-truth action labels or other domain-specific requirements
typically found in the world model literature. Further the resulting learned
latent action space facilitates training agents to imitate behaviors from
unseen videos, opening the path for training generalist agents of the future.