FlowMo：基于方差的流引导技术，实现视频生成中的连贯运动

摘要

文本到视频扩散模型在建模时间维度上的能力，如运动、物理和动态交互，存在显著局限。现有方法通过重新训练模型或引入外部条件信号来强化时间一致性，以应对这一局限。本研究探讨了是否能够直接从预训练模型的预测中提取有意义的时间表征，而无需额外训练或辅助输入。我们提出了FlowMo，一种无需训练的新型引导方法，它仅利用模型在每一步扩散中的自身预测来增强运动连贯性。FlowMo首先通过测量连续帧对应潜在空间的距离，得到去除了外观偏差的时间表征，从而凸显模型预测的隐含时间结构。接着，它通过计算时间维度上各局部区域的方差来估计运动连贯性，并在采样过程中动态引导模型降低这一方差。跨多个文本到视频模型的大量实验表明，FlowMo在不牺牲视觉质量或提示对齐的前提下，显著提升了运动连贯性，为增强预训练视频扩散模型的时间保真度提供了一种即插即用的有效解决方案。

English

Text-to-video diffusion models are notoriously limited in their ability to model temporal aspects such as motion, physics, and dynamic interactions. Existing approaches address this limitation by retraining the model or introducing external conditioning signals to enforce temporal consistency. In this work, we explore whether a meaningful temporal representation can be extracted directly from the predictions of a pre-trained model without any additional training or auxiliary inputs. We introduce FlowMo, a novel training-free guidance method that enhances motion coherence using only the model's own predictions in each diffusion step. FlowMo first derives an appearance-debiased temporal representation by measuring the distance between latents corresponding to consecutive frames. This highlights the implicit temporal structure predicted by the model. It then estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling. Extensive experiments across multiple text-to-video models demonstrate that FlowMo significantly improves motion coherence without sacrificing visual quality or prompt alignment, offering an effective plug-and-play solution for enhancing the temporal fidelity of pre-trained video diffusion models.

FlowMo：基于方差的流引导技术，实现视频生成中的连贯运动

FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation

摘要

Support