WorldCompass：基于强化学习的远视距世界模型

摘要

本研究提出WorldCompass——一种面向长时序交互式视频世界模型的新型强化学习后训练框架，该框架能基于交互信号使世界模型更精准、更连贯地探索虚拟世界。为有效"引导"世界模型的探索行为，我们针对自回归视频生成范式量身打造了三大核心创新：1）片段级推演策略：在单个目标片段处生成并评估多组样本，显著提升推演效率并提供细粒度奖励信号；2）互补式奖励函数：设计了同时兼顾交互跟随精度与视觉质量的奖励函数，既提供直接监督又有效抑制奖励黑客行为；3）高效强化学习算法：采用负样本感知微调策略并结合多种效率优化手段，以高效方式持续增强模型能力。在开源世界模型标杆WorldPlay上的实验表明，WorldCompass能在多种场景下显著提升交互精度与视觉保真度。

English

This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

WorldCompass：基于强化学习的远视距世界模型

WorldCompass: Reinforcement Learning for Long-Horizon World Models

摘要

Support