ビデオワールド：未ラベルのビデオからの知識学習の探求

要旨

この研究は、テキストベースのモデルである大規模言語モデル（LLMs）に代わり、深層生成モデルが視覚入力のみから複雑な知識を学習できるかどうかを探るものです。我々は、未ラベルのビデオデータで訓練された自己回帰型ビデオ生成モデルであるVideoWorldを開発し、その知識獲得能力をビデオベースの囲碁やロボット制御のタスクでテストします。実験の結果、2つの主要な発見が明らかになりました：（1）ビデオのみの訓練は、ルール、推論、計画能力を含む知識を学習するのに十分な情報を提供し、（2）視覚的変化の表現が知識獲得に重要であることが示されました。このプロセスの効率性と効果を向上させるために、我々はVideoWorldの主要な構成要素として潜在動態モデル（LDM）を導入します。驚くべきことに、VideoWorldは、探索アルゴリズムや強化学習に典型的な報酬メカニズムに依存せず、3億パラメータのモデルでビデオ-GoBenchで5段のプロフェッショナルレベルに到達します。ロボットタスクでは、VideoWorldは効果的にさまざまな制御操作を学習し、環境を横断して汎化し、CALVINやRLBenchのオラクルモデルに近いパフォーマンスを達成します。この研究は、視覚データからの知識獲得の新たな可能性を開拓し、すべてのコード、データ、モデルをオープンソースとして公開し、さらなる研究に活用できるようにしています。

English

This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.