世界模型自蒸餾：訓練世界模型解決通用任務

摘要

預訓練的影片生成模型作為具備湧現任務求解能力的視覺世界模型，極具潛力；然而，它們依賴詳細的文字描述，限制了其在規劃與決策上的直接應用。現有方法若非將此推理過程外包給語言或視覺語言模型，便是依賴配有任務執行影片的監督式微調，但這類資料蒐集成本高昂且難以擴展。我們提出一個可擴展框架，透過結合自我蒸餾與強化學習，激發此類模型的任務求解能力。給定一張未標註的場景影像，視覺語言模型會生成一項候選任務及詳細的逐步解決方案。該解決方案作為條件，引導預訓練的影片擴散模型（即「示範者」）；我們將其行為蒸餾至僅以影像與簡短任務提示為條件的「執行者」模型。此舉將基於標題生成的執行知識，轉移至無需精心配對任務影片監督的指令條件式任務求解中。我們進一步利用來自視覺語言模型回饋的強化學習優化執行者，充分利用「判斷取樣影片是否符合任務」與「生成解決方案」兩者之間的不對稱性。在我們提出的WorldTasks-Benchmark及DreamGen機器人基準測試上的實驗顯示，在我們基於視覺語言模型的評估協議下，執行者模型表現超越示範者模型，並能具競爭力地遷移至機器人任務。

English

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.