奥拉夫世界:面向视频世界建模的潜在行动定向
Olaf-World: Orienting Latent Actions for Video World Modeling
February 10, 2026
作者: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
cs.AI
摘要
动作可控世界模型的规模化发展受限于动作标签的稀缺性。虽然潜在动作学习有望从无标注视频中提取控制接口,但习得的潜在表征常难以跨场景迁移:它们会混杂场景特定线索,且缺乏统一的坐标系。这是因为标准目标函数仅在各视频片段内部生效,无法提供跨场景动作语义的对齐机制。我们的核心发现是:尽管动作本身不可观测,但其语义效应是可观测的,并能作为共享参照基准。我们提出SeqΔ-REPA——一种序列层级的控制效应对齐目标,通过冻结的自监督视频编码器生成的时间特征差异来锚定集成潜在动作。在此基础上,我们开发了Olaf-World流程,能够基于大规模被动视频预训练动作条件化的视频世界模型。大量实验表明,相较于现有最优基线方法,我们的方法能学习到更具结构化的潜在动作空间,在零样本动作迁移任务中表现更优,且能更高效地适应新控制接口。
English
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.