ChatPaper.aiChatPaper

OmniTransfer:一体化时空视频迁移框架

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

January 20, 2026
作者: Pengze Zhang, Yanze Wu, Mengtian Li, Xu Bai, Songtao Zhao, Fulong Ye, Chong Mou, Xinghui Li, Zhuowei Chen, Qian He, Mingyuan Gao
cs.AI

摘要

视频相比图像或文字能传递更丰富的信息,同时捕捉空间与时间动态。然而现有视频定制方法大多依赖参考图像或特定任务的时间先验,未能充分利用视频固有的丰富时空信息,从而限制了视频生成的灵活性与泛化能力。为突破这些局限,我们提出OmniTransfer——一个统一的时空视频迁移框架。该框架通过跨帧的多视角信息增强外观一致性,并利用时序线索实现细粒度的时间控制。为统一各类视频迁移任务,OmniTransfer包含三大核心设计:任务感知位置偏置机制自适应地利用参考视频信息以提升时序对齐或外观一致性;参考解耦因果学习将参考流与目标流分离,在提升效率的同时实现精准的参考迁移;任务自适应多模态对齐通过多模态语义引导动态区分并处理不同任务。大量实验表明,OmniTransfer在外观迁移(身份与风格)和时序迁移(摄像机运动与视频特效)上均优于现有方法,同时在无需使用姿态信息的情况下达到与姿态引导方法相当的运动迁移效果,为灵活、高保真的视频生成建立了新范式。
English
Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.
PDF294January 22, 2026