FlashMotion:基于轨迹引导的少样本可控视频生成技术
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
March 12, 2026
作者: Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, Zuxuan Wu
cs.AI
摘要
近期軌跡可控影片生成技術取得了顯著進展。現有方法主要基於適配器架構實現預定義軌跡的精確運動控制,但均依賴多步去噪過程,導致顯著的時間冗餘與計算開銷。雖然現有影片蒸餾技術能成功將多步生成器壓縮為少步模型,但直接應用於軌跡可控影片生成時會導致影片質量與軌跡精度明顯下降。為解決這一問題,我們提出FlashMotion——專為少步軌跡可控影片生成設計的新型訓練框架。我們首先在多步影片生成器上訓練軌跡適配器以實現精確軌跡控制,接著將生成器蒸餾為少步版本以加速生成過程,最後採用結合擴散模型與對抗目標的混合策略對適配器進行微調,使其與少步生成器協同生成高質量、高軌跡精度的影片。為評估性能,我們構建了FlashBench基準測試集,專門針對包含不同數量前景物體的長序列軌跡可控影片生成任務,同時衡量影片質量與軌跡精度。在兩種適配器架構上的實驗表明,FlashMotion在視覺質量與軌跡一致性方面均優於現有影片蒸餾方法及傳統多步模型。
English
Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.