视频作为现实世界决策的新语言

摘要

互联网上存在丰富的文本和视频数据，支持通过下一个标记或帧预测进行大规模的自监督学习。然而，它们并没有被充分利用：语言模型在现实世界中产生了重大影响，而视频生成在很大程度上仍然局限于媒体娱乐领域。然而，视频数据捕捉了关于物理世界的重要信息，这些信息很难用语言表达。为了弥补这一差距，我们讨论了将视频生成扩展到解决现实世界任务的一个被低估的机会。我们观察到，类似于语言，视频可以作为一个统一的接口，可以吸收互联网知识并表示各种任务。此外，我们展示了视频生成如何像语言模型一样，通过技术如上下文学习、规划和强化学习，可以作为规划器、代理、计算引擎和环境模拟器。我们确定了在机器人、自动驾驶和科学等领域的重大影响机会，这些领域得到了最近的研究支持，证明了视频生成中这些先进能力很可能在可预见的范围内实现。最后，我们确定了在视频生成中阻碍进展的关键挑战。解决这些挑战将使视频生成模型能够在更广泛的AI应用领域中展示出与语言模型相媲美的独特价值。

English

Both text and video data are abundant on the internet and support large-scale self-supervised learning through next token or frame prediction. However, they have not been equally leveraged: language models have had significant real-world impact, whereas video generation has remained largely limited to media entertainment. Yet video data captures important information about the physical world that is difficult to express in language. To address this gap, we discuss an under-appreciated opportunity to extend video generation to solve tasks in the real world. We observe how, akin to language, video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. Moreover, we demonstrate how, like language models, video generation can serve as planners, agents, compute engines, and environment simulators through techniques such as in-context learning, planning and reinforcement learning. We identify major impact opportunities in domains such as robotics, self-driving, and science, supported by recent work that demonstrates how such advanced capabilities in video generation are plausibly within reach. Lastly, we identify key challenges in video generation that mitigate progress. Addressing these challenges will enable video generation models to demonstrate unique value alongside language models in a wider array of AI applications.

视频作为现实世界决策的新语言

Video as the New Language for Real-World Decision Making

摘要

Support