유메(Yume): 상호작용형 세계 생성 모델

초록

Yume는 이미지, 텍스트 또는 비디오를 활용하여 상호작용적이고 사실적이며 동적인 세계를 생성하는 것을 목표로 합니다. 이를 통해 주변 장치나 신경 신호를 사용한 탐색과 제어가 가능합니다. 본 보고서에서는 입력 이미지로부터 동적인 세계를 생성하고 키보드 동작을 통해 탐색할 수 있는 \method의 프리뷰 버전을 소개합니다. 이러한 고품질의 상호작용적 비디오 세계 생성을 위해, 우리는 카메라 모션 양자화, 비디오 생성 아키텍처, 고급 샘플러, 모델 가속화 등 네 가지 주요 구성 요소로 이루어진 잘 설계된 프레임워크를 제안합니다. 먼저, 안정적인 학습과 사용자 친화적인 키보드 입력 상호작용을 위해 카메라 모션을 양자화합니다. 그런 다음, 무한 비디오 생성을 위해 메모리 모듈이 포함된 Masked Video Diffusion Transformer~(MVDT)를 소개합니다. 이후, 더 나은 시각적 품질과 정밀한 제어를 위해 훈련이 필요 없는 Anti-Artifact Mechanism (AAM)과 Stochastic Differential Equations (TTS-SDE) 기반의 Time Travel Sampling을 샘플러에 도입합니다. 또한, 적대적 증류와 캐싱 메커니즘의 시너지 최적화를 통해 모델 가속화를 연구합니다. 우리는 고품질의 세계 탐색 데이터셋 \sekai를 사용하여 \method를 훈련시켰으며, 다양한 장면과 응용 분야에서 뛰어난 결과를 달성했습니다. 모든 데이터, 코드베이스, 모델 가중치는 https://github.com/stdstu12/YUME에서 확인할 수 있습니다. Yume는 원래 목표를 달성하기 위해 매월 업데이트될 예정입니다. 프로젝트 페이지: https://stdstu12.github.io/YUME-Project/.

English

Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.

유메(Yume): 상호작용형 세계 생성 모델

Yume: An Interactive World Generation Model

초록

Support