WorldDreamer：通过预测掩码标记实现视频生成的通用世界模型

摘要

世界模型在理解和预测世界动态的过程中发挥着至关重要的作用，这对于视频生成至关重要。然而，现有的世界模型局限于特定场景，如游戏或驾驶，限制了其捕捉一般世界动态环境复杂性的能力。因此，我们引入了WorldDreamer，这是一种开创性的世界模型，旨在促进对一般世界物理和运动的全面理解，从而显著增强视频生成的能力。受大型语言模型成功的启发，WorldDreamer将世界建模框架化为一项无监督的视觉序列建模挑战。这是通过将视觉输入映射到离散标记并预测被屏蔽的标记来实现的。在此过程中，我们结合多模态提示以促进世界模型内的交互。我们的实验表明，WorldDreamer在生成涵盖不同场景的视频方面表现出色，包括自然场景和驾驶环境。WorldDreamer展示了在执行诸如文本到视频转换、图像到视频合成和视频编辑等任务方面的多功能性。这些结果突显了WorldDreamer在捕捉多样化一般世界环境中的动态元素方面的有效性。

English

World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.

WorldDreamer：通过预测掩码标记实现视频生成的通用世界模型

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

摘要

Support