ChatPaper.aiChatPaper

FlowLong:基于流形约束Tweedie匹配的推理时长视频生成

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

May 20, 2026
作者: Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye
cs.AI

摘要

将视频扩散模型的生成范围扩展到长序列仍然是一个长期存在且重要的挑战。现有的免训练方法可分为两类:双向模型的扩展,这类方法与特定架构紧密耦合,且随着生成长度增加会出现质量退化;以及自回归模型,这类模型因暴露偏差积累漂移误差,容易产生重复的运动模式。为解决这些问题,我们提出了一种新颖而简单的推理时长视频生成方法,该方法与架构无关且无需额外训练。我们的方法通过重叠滑动窗口生成长视频,利用Tweedie匹配混合相邻窗口的预测干净样本,以在重叠区域同时施加流形约束和时间一致性。随后,在Tweedie匹配修正后的高噪声阶段注入新噪声,通过随机早期采样同步各窗口轨迹,再过渡到确定性ODE采样以保持精细的视觉保真度。将该方法应用于多种视频生成模型后,生成的视频长度比原生窗口长度长数倍,同时在时间一致性和视觉质量上优于现有的免训练和自回归基线方法,且无需任何微调即可扩展到音视频联合生成与文本到三维高斯泼溅(text-to-3DGS)。
English
Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via Tweedie matching to enforce both manifold constraint and temporal consistency across overlap regions. Stochastic early-phase sampling then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.