세계 모델 자기 증류: 일반 작업 해결을 위한 세계 모델 훈련

초록

사전 학습된 비디오 생성기는 창발적 작업 해결 능력을 보여주는 유망한 시각적 세계 모델이지만, 상세한 텍스트 설명에 의존하기 때문에 계획 및 의사 결정에 직접 사용하는 데 한계가 있습니다. 기존 접근 방식은 이러한 추론을 언어 또는 시각-언어 모델에 위임하거나, 수집 비용이 높고 확장이 어려운 쌍을 이룬 작업 실행 비디오를 사용한 지도 미세 조정에 의존합니다. 우리는 자기 증류와 강화 학습을 결합하여 이러한 모델에서 작업 해결 능력을 이끌어내는 확장 가능한 프레임워크를 제안합니다. 레이블이 없는 장면 이미지가 주어지면 시각-언어 모델이 후보 작업과 상세한 단계별 솔루션을 생성합니다. 솔루션은 사전 학습된 비디오 확산 모델인 시연자(Demonstrator)를 조건화합니다. 우리는 시연자의 행동을 이미지와 짧은 작업 프롬프트에만 조건화된 실행자(Executor)로 증류합니다. 이는 큐레이션된 작업-비디오 지도 학습 없이 캡션 기반 생성에서 명령 조건화된 작업 해결로 실행 지식을 전이합니다. 우리는 샘플링된 비디오가 작업을 만족하는지 판단하는 것과 솔루션을 생성하는 것 사이의 비대칭성을 활용하여 VLM 피드백으로부터의 강화 학습을 통해 실행자를 추가로 개선합니다. 우리가 제안한 WorldTasks-벤치마크와 DreamGen 로보틱스 벤치마크에 대한 실험은 실행자가 VLM 기반 평가 프로토콜 하에서 시연자를 능가하며 로봇 작업에 경쟁력 있게 전이됨을 보여줍니다.

English

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.