SAMA：面向指令引导视频编辑的因子化语义锚定与运动对齐

摘要

当前基于指令的视频编辑模型难以同时兼顾精确的语义修改与忠实运动保持。虽然现有方法依赖注入显式外部先验（如VLM特征或结构条件）来缓解这些问题，但这种依赖严重制约了模型的鲁棒性与泛化能力。为突破此局限，我们提出SAMA（解耦的语义锚定与运动对齐框架），将视频编辑解构为语义锚定和运动建模两个维度。首先引入语义锚定机制，通过在稀疏锚帧上联合预测语义标记与视频潜变量，建立可靠的视觉锚点，实现纯指令感知的结构规划。其次，运动对齐模块通过运动中心的视频修复预训练任务（立方体修复、速度扰动、管状混洗），使骨干网络直接从原始视频中内化时序动态特征。SAMA采用两阶段优化流程：先通过解耦预训练阶段学习固有的语义-运动表示（无需配对的视频-指令编辑数据），再基于配对编辑数据进行监督微调。值得注意的是，仅通过解耦预训练就已展现出强大的零样本视频编辑能力，验证了所提解耦框架的有效性。SAMA在开源模型中达到最先进性能，并与领先商业系统（如Kling-Omni）相媲美。代码、模型及数据集将全面开源。

English

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

SAMA：面向指令引导视频编辑的因子化语义锚定与运动对齐

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

摘要

Support