ChatPaper.aiChatPaper

奥拉夫世界:面向视频世界建模的潜在行动导向

Olaf-World: Orienting Latent Actions for Video World Modeling

February 10, 2026
作者: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
cs.AI

摘要

動作可控世界模型的規模化拓展受制於動作標籤的稀缺性。雖然潛在動作學習有望從未標註影片中提取控制介面,但習得的潛在表徵常難以跨情境遷移:它們會與場景特定線索糾纏,且缺乏統一的座標系。這一問題的根源在於標準優化目標僅在單一影片片段內生效,缺乏跨情境對齊動作語義的機制。我們的核心洞見是:儘管動作本身不可觀測,但其語義效果可被觀測並作為共享參照基準。我們提出SeqΔ-REPA——一種序列層級的控制效果對齊目標,通過凍結的自監督影片編碼器產生的時序特徵差異來錨定整合潛在動作。基於此,我們開發了Olaf-World流程,能夠從大規模被動觀測影片中預訓練動作條件化的影片世界模型。大量實驗表明,相較於現有頂尖基線方法,我們的方法能學習到更具結構化的潛在動作空間,在零樣本動作遷移任務中表現優異,並能更高效地適應新型控制介面。
English
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
PDF221February 12, 2026