世界モデル自己蒸留：汎用タスクを解くための世界モデルの訓練

要旨

事前学習されたビデオ生成モデルは、創発的なタスク解決能力を示す有望な視覚的世界モデルである。しかし、詳細なテキスト記述に依存するため、計画や意思決定への直接的な利用は限定的である。既存のアプローチでは、この推論を言語モデルや視覚言語モデルに外部委託するか、対となるタスク実行ビデオを用いた教師ありファインチューニングに依存しているが、これらは収集コストが高く、スケーラビリティに欠ける。我々は、自己蒸留と強化学習を組み合わせることで、このようなモデルにおけるタスク解決能力を引き出すスケーラブルなフレームワークを提案する。ラベルなしのシーン画像が与えられると、視覚言語モデルが候補タスクと詳細なステップバイステップの解決手順を生成する。この解決手順は、事前学習されたビデオ拡散モデル（デモンストレーター）の条件付けとして機能し、我々はその振る舞いを、画像と短いタスクプロンプトのみを条件とするエグゼキュータに蒸留する。これにより、キャプション誘導による生成から、命令条件付けによるタスク解決への実行知識の転移が、キュレーションされたタスクビデオの教師なしで実現される。さらに、VLMフィードバックからの強化学習を用いてエグゼキュータを改善し、サンプリングされたビデオがタスクを満たすかどうかを判断することと、解決手順を生成することの間の非対称性を活用する。提案するWorldTasks-BenchmarkとDreamGenロボティクスベンチマークでの実験により、我々のVLMベースの評価プロトコルにおいて、エグゼキュータがデモンストレーターを上回り、ロボットタスクにも競争力を持って転移可能であることが示された。

English

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.