世界模型自蒸馏：训练世界模型解决通用任务

摘要

预训练视频生成器作为具有涌现任务求解能力的视觉世界模型前景广阔，但其对详细文本描述的依赖限制了其直接用于规划与决策。现有方法要么将此推理过程外包给语言模型或视觉语言模型，要么依赖代价高昂且难以规模化的配对任务执行视频进行监督微调。我们提出一种可扩展框架，通过结合自蒸馏与强化学习来激发此类模型的任务求解能力。给定一张未标注的场景图像，视觉语言模型生成候选任务及其详细的分步解决方案。该解决方案作为预训练视频扩散模型（演示者）的条件输入；我们将其行为蒸馏至执行者模型，后者仅以图像和简短任务提示为条件。这一过程将字幕引导生成中的执行知识迁移至指令条件任务求解，无需人工标注的任务-视频配对数据。我们进一步利用视觉语言模型反馈进行强化学习来优化执行者，利用"评判生成的视频是否满足任务"与"生成解决方案"之间的非对称性。在我们提出的WorldTasks基准测试与DreamGen机器人基准测试上的实验表明，在基于视觉语言模型的评估协议下，执行者性能超越演示者，并能有效迁移至机器人任务。

English

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.