SAMA：基於指令引導影片編輯的因式化語義錨定與運動對齊

摘要

當前基於指令的影片編輯模型難以同時兼顧精確語義修改與忠實運動保持。現有方法依賴注入顯式外部先驗（如VLM特徵或結構條件）來緩解這些問題，但這種依賴嚴重制約了模型的魯棒性與泛化能力。為突破此限制，我們提出SAMA（分解式語義錨定與運動對齊框架），將影片編輯分解為語義錨定和運動建模兩個因子。首先，我們引入語義錨定機制，通過在稀疏錨定幀上聯合預測語義標記與影片潛變數，建立可靠的視覺錨點，實現純指令感知的結構規劃。其次，運動對齊模塊通過在以運動為核心的影片修復預訓練任務（立方體修補、速度擾動、時序管重排）上預訓練同一骨幹網絡，使模型能直接從原始影片中內化時序動態特徵。SAMA採用兩階段優化流程：先進行分解式預訓練以學習內在的語義-運動表徵（無需配對的影片-指令編輯數據），再基於配對編輯數據進行監督式微調。值得注意的是，僅通過分解預訓練即可產生強大的零樣本影片編輯能力，驗證了所提分解策略的有效性。SAMA在開源模型中實現了最先進的性能，並可與主流商業系統（如Kling-Omni）競爭。程式碼、模型與數據集將公開釋出。

English

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

SAMA：基於指令引導影片編輯的因式化語義錨定與運動對齊

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

摘要

Support