WorldDreamer：通過預測遮罩標記來實現視頻生成的通用世界模型

摘要

世界模型在理解和預測世界動態的過程中扮演著至關重要的角色，這對於視頻生成至關重要。然而，現有的世界模型局限於特定情境，如遊戲或駕駛，限制了其捕捉一般世界動態環境複雜性的能力。因此，我們引入了WorldDreamer，一個開創性的世界模型，旨在促進對一般世界物理和運動的全面理解，顯著增強了視頻生成的能力。受大型語言模型成功的啟發，WorldDreamer將世界建模框架定位為一個無監督的視覺序列建模挑戰。通過將視覺輸入映射到離散標記並預測被遮蔽的標記來實現這一目標。在此過程中，我們結合多模態提示以促進世界模型內的交互作用。我們的實驗表明，WorldDreamer在生成各種情境下的視頻方面表現出色，包括自然場景和駕駛環境。WorldDreamer展示了在執行文本到視頻轉換、圖像到視頻合成和視頻編輯等任務方面的多功能性。這些結果突顯了WorldDreamer在捕捉多樣一般世界環境中的動態元素方面的有效性。

English

World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.

WorldDreamer：通過預測遮罩標記來實現視頻生成的通用世界模型

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

摘要

Support