WorldDreamer: 마스킹된 토큰 예측을 통한 비디오 생성을 위한 일반적 세계 모델 개발

초록

월드 모델(World Model)은 세계의 역학을 이해하고 예측하는 데 중요한 역할을 하며, 이는 비디오 생성에 필수적입니다. 그러나 기존의 월드 모델은 게임이나 운전과 같은 특정 시나리오에 국한되어 있어 일반적인 세계 역학 환경의 복잡성을 포착하는 데 한계가 있습니다. 따라서 우리는 일반 세계 물리학과 움직임에 대한 포괄적인 이해를 촉진하고 비디오 생성 능력을 크게 향상시키는 선구적인 월드 모델인 WorldDreamer를 소개합니다. 대규모 언어 모델의 성공에서 영감을 받은 WorldDreamer는 월드 모델링을 비지도 시각적 시퀀스 모델링 문제로 프레임화합니다. 이는 시각적 입력을 이산 토큰으로 매핑하고 마스킹된 토큰을 예측함으로써 달성됩니다. 이 과정에서 우리는 월드 모델 내 상호작용을 촉진하기 위해 다중 모달 프롬프트를 통합합니다. 우리의 실험은 WorldDreamer가 자연 경관과 운전 환경을 포함한 다양한 시나리오에서 비디오를 생성하는 데 탁월함을 보여줍니다. WorldDreamer는 텍스트-투-비디오 변환, 이미지-투-비디오 합성, 비디오 편집과 같은 작업을 수행하는 데 있어 다재다능함을 보여줍니다. 이러한 결과는 WorldDreamer가 다양한 일반 세계 환경 내 동적 요소를 효과적으로 포착하는 데 있어 그 효율성을 입증합니다.

English

World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.

WorldDreamer: 마스킹된 토큰 예측을 통한 비디오 생성을 위한 일반적 세계 모델 개발

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

초록

Support