ChatPaper.aiChatPaper

Safari船长:世界引擎

Captain Safari: A World Engine

November 28, 2025
作者: Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, Junfei Xiao
cs.AI

摘要

世界引擎旨在合成支持用户控制相机运动下场景交互式探索的长时、三维一致视频。然而现有系统在激进六自由度轨迹和复杂户外场景中表现不佳:它们会丢失长程几何一致性、偏离目标路径或陷入过度保守的运动模式。为此,我们推出Captain Safari——一种通过从持久化世界记忆库中检索来生成视频的位姿条件化世界引擎。给定相机路径,我们的方法维护动态局部记忆库,并利用检索器获取位姿对齐的世界标记,这些标记进而沿轨迹条件化视频生成。该设计使模型能在精确执行挑战性相机运动的同时保持稳定的三维结构。为评估此设定,我们构建了OpenSafari数据集,这是一个通过多阶段几何与运动学验证流程建立的野外第一人称视角数据集,包含带有已验证相机轨迹的高动态无人机视频。在视频质量、三维一致性和轨迹跟随性方面,Captain Safari显著优于当前最先进的相机控制生成器:将MEt3R指标从0.3703降至0.3690,AUC@30从0.181提升至0.200,且FVD远低于所有相机控制基线。更重要的是,在50人参与的五模型匿名对比研究中,注释者在五个匿名模型中选择最佳结果时,67.6%的偏好指标全面倾向于我们的方法。我们的结果表明,位姿条件化世界记忆是实现长时序可控视频生成的有效机制,并将OpenSafari确立为未来世界引擎研究的新基准。
English
World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.
PDF71December 2, 2025