结构追踪:从运动轨迹中提取结构保持信息用于视频生成
Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
December 12, 2025
作者: Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu
cs.AI
摘要
现实是刚性约束与可变形结构之间的共舞。对视频模型而言,这意味着生成既保持保真度又维持结构的运动。尽管扩散模型取得进展,但生成逼真的结构保持运动仍具挑战性,尤其对于人类和动物等铰接式可变形物体。迄今为止,仅靠扩大训练数据仍无法解决物理上不合理的过渡问题。现有方法依赖于带有噪声的运动表征作为条件输入,例如通过外部不完美模型提取的光流或骨骼数据。为应对这些挑战,我们提出一种算法,将自回归视频跟踪模型(SAM2)中的结构保持运动先验知识蒸馏至双向视频扩散模型(CogVideoX)。基于该方法,我们训练出SAM2VideoX模型,其包含两大创新:(1)双向特征融合模块,可从SAM2等循环模型中提取全局结构保持运动先验;(2)局部格拉姆流损失函数,用于对齐局部特征的协同运动。在VBench基准测试和人类评估中,SAM2VideoX相较现有基线模型实现稳定提升(VBench得分+2.60%,FVD降低21-22%,人类偏好率达71.4%)。具体而言,在VBench上我们取得95.51%的得分,以2.60%优势超越REPA(92.91%),并将FVD降至360.57,较REPA和LoRA微调方法分别提升21.20%和22.46%。项目网站详见https://sam2videox.github.io/。
English
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .