ビデオを現実世界の意思決定のための新たな言語として

要旨

テキストと動画データはインターネット上に豊富に存在し、次のトークンやフレームの予測を通じて大規模な自己教師あり学習を支えています。しかし、これらは同等に活用されてはいません。言語モデルは現実世界で大きな影響を及ぼしている一方で、動画生成は主にメディアエンターテインメントに限定されています。しかし、動画データは物理世界に関する重要な情報を捉えており、それを言語で表現するのは困難です。このギャップを埋めるため、私たちは動画生成を現実世界の課題解決に拡張するための未開拓の可能性について議論します。言語と同様に、動画がインターネットの知識を吸収し、多様なタスクを表現する統一インターフェースとして機能し得ることを観察します。さらに、言語モデルと同様に、動画生成が文脈内学習、計画、強化学習などの技術を通じてプランナー、エージェント、計算エンジン、環境シミュレーターとして機能し得ることを示します。ロボティクス、自動運転、科学などの分野での主要な影響機会を特定し、そのような高度な動画生成能力が現実的に達成可能であることを示す最近の研究を支持します。最後に、動画生成の進展を妨げる主要な課題を特定します。これらの課題に取り組むことで、動画生成モデルが言語モデルと並んで、より広範なAIアプリケーションにおいて独自の価値を示すことが可能になるでしょう。

English

Both text and video data are abundant on the internet and support large-scale self-supervised learning through next token or frame prediction. However, they have not been equally leveraged: language models have had significant real-world impact, whereas video generation has remained largely limited to media entertainment. Yet video data captures important information about the physical world that is difficult to express in language. To address this gap, we discuss an under-appreciated opportunity to extend video generation to solve tasks in the real world. We observe how, akin to language, video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. Moreover, we demonstrate how, like language models, video generation can serve as planners, agents, compute engines, and environment simulators through techniques such as in-context learning, planning and reinforcement learning. We identify major impact opportunities in domains such as robotics, self-driving, and science, supported by recent work that demonstrates how such advanced capabilities in video generation are plausibly within reach. Lastly, we identify key challenges in video generation that mitigate progress. Addressing these challenges will enable video generation models to demonstrate unique value alongside language models in a wider array of AI applications.

ビデオを現実世界の意思決定のための新たな言語として

Video as the New Language for Real-World Decision Making

要旨

Support