ChatPaper.aiChatPaper

追蹤建構:提煉結構保持性運動的影片生成技術

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

December 12, 2025
作者: Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu
cs.AI

摘要

現實是剛性約束與可變形結構之間的舞蹈。對影片模型而言,這意味著生成既能保持真實性又能維持結構的運動。儘管擴散模型有所進展,但產生逼真的結構保持運動仍具挑戰性,尤其對人體和動物這類具關節與可變形物體。僅靠擴充訓練數據至今仍無法解決物理上不合理的過渡問題。現有方法依賴於帶噪聲的運動表徵進行條件化,例如使用外部不完美模型提取的光流或骨架。為解決這些難題,我們提出一種算法,將結構保持運動先驗從自迴歸影片追蹤模型(SAM2)提煉至雙向影片擴散模型(CogVideoX)。透過此方法,我們訓練出SAM2VideoX,其包含兩項創新:(1) 雙向特徵融合模組,可從如SAM2的循環模型中提取全局結構保持運動先驗;(2) 局部格蘭姆流損失,用於對齊局部特徵的協同運動。在VBench上的實驗及人類評估顯示,SAM2VideoX相較既有基準模型實現持續提升(VBench得分提升2.60%、FVD降低21-22%、人類偏好度達71.4%)。具體而言,在VBench上我們獲得95.51%的成績,較REPA(92.91%)提升2.60%,並將FVD降至360.57,相較REPA與LoRA微調分別改善21.20%與22.46%。專案網站請見 https://sam2videox.github.io/ 。
English
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .
PDF92December 17, 2025