Genie: 생성형 상호작용 환경

초록

우리는 레이블이 없는 인터넷 동영상으로부터 비지도 방식으로 학습된 최초의 생성형 인터랙티브 환경인 Genie를 소개합니다. 이 모델은 텍스트, 합성 이미지, 사진, 심지어 스케치로 설명된 다양한 액션 제어 가능 가상 세계를 끝없이 생성하도록 프롬프트될 수 있습니다. 110억 개의 파라미터를 가진 Genie는 기반 세계 모델로 간주될 수 있습니다. 이 모델은 시공간 비디오 토크나이저, 자기회귀적 동역학 모델, 그리고 간단하고 확장 가능한 잠재 액션 모델로 구성됩니다. Genie는 학습 과정에서 실제 액션 레이블이나 세계 모델 문헌에서 일반적으로 요구되는 도메인별 요구사항 없이도 사용자가 생성된 환경에서 프레임 단위로 행동할 수 있게 합니다. 더 나아가, 학습된 잠재 액션 공간은 보지 못한 동영상에서의 행동을 모방하도록 에이전트를 훈련하는 데 용이하여, 미래의 일반화된 에이전트 훈련을 위한 길을 열어줍니다.

English

We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Genie: 생성형 상호작용 환경

Genie: Generative Interactive Environments

초록

Support