WorldCompass:基于强化学习的远见世界模型探索
WorldCompass: Reinforcement Learning for Long-Horizon World Models
February 9, 2026
作者: Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, Chunchao Guo, Zhou Zhao
cs.AI
摘要
本文提出WorldCompass——一种面向长周期交互式视频世界模型的新型强化学习后训练框架,通过交互信号使模型能够更精准、更连贯地探索世界。为有效引导世界模型的探索过程,我们针对自回归视频生成范式引入三大核心创新:1)片段级推演策略:在单个目标片段处生成并评估多组样本,显著提升推演效率并提供细粒度奖励信号;2)互补式奖励函数:设计同时兼顾交互跟随精度与视觉质量的奖励函数,既提供直接监督又有效抑制奖励作弊行为;3)高效强化学习算法:采用负向感知微调策略并结合多种效率优化手段,以高效方式提升模型能力。在开源前沿世界模型WorldPlay上的实验表明,WorldCompass在各种场景下均能显著提升交互精度与视觉保真度。
English
This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.