影片作為現實世界決策的新語言

摘要

網絡上充斥著豐富的文字和影片數據，通過下一個標記或幀預測支持大規模的自監督學習。然而，它們並未被同等利用：語言模型在現實世界中產生了顯著影響，而視頻生成在很大程度上仍局限於媒體娛樂。然而，視頻數據捕捉了關於物理世界的重要信息，這些信息在語言中很難表達。為了彌補這一差距，我們討論了一個被低估的機會，即擴展視頻生成以解決現實世界中的任務。我們觀察到，類似於語言，視頻可以作為一個統一的接口，可以吸收互聯網知識並代表多樣的任務。此外，我們展示了如何通過上下文學習、規劃和強化學習等技術，視頻生成可以作為規劃者、代理人、計算引擎和環境模擬器。我們確定了在領域中的主要影響機會，如機器人技術、自動駕駛和科學，這些領域得到了最近的研究支持，該研究表明，視頻生成中的這些高級功能可能是可以實現的。最後，我們確定了在視頻生成中阻礙進展的主要挑戰。解決這些挑戰將使視頻生成模型能夠在更廣泛的人工智能應用中展示與語言模型相同的獨特價值。

English

Both text and video data are abundant on the internet and support large-scale self-supervised learning through next token or frame prediction. However, they have not been equally leveraged: language models have had significant real-world impact, whereas video generation has remained largely limited to media entertainment. Yet video data captures important information about the physical world that is difficult to express in language. To address this gap, we discuss an under-appreciated opportunity to extend video generation to solve tasks in the real world. We observe how, akin to language, video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. Moreover, we demonstrate how, like language models, video generation can serve as planners, agents, compute engines, and environment simulators through techniques such as in-context learning, planning and reinforcement learning. We identify major impact opportunities in domains such as robotics, self-driving, and science, supported by recent work that demonstrates how such advanced capabilities in video generation are plausibly within reach. Lastly, we identify key challenges in video generation that mitigate progress. Addressing these challenges will enable video generation models to demonstrate unique value alongside language models in a wider array of AI applications.

影片作為現實世界決策的新語言

Video as the New Language for Real-World Decision Making

摘要

Support