FlowLong：通過流形約束的Tweedie匹配實現推理階段長視頻生成

摘要

將視頻擴散模型的生成範圍擴展到長序列仍是一個長期且重要的挑戰。現有的免訓練方法可分為兩類：雙向模型的擴展（與特定架構緊密耦合，且在長序列中出現品質退化）與自回歸模型（因曝光偏差累積漂移誤差，並傾向產生重複動作模式）。為解決這些問題，我們提出一種新穎且簡單的推論時方法用於長視頻生成，該方法不受架構限制且無需額外訓練。我們的方法通過重疊滑動窗口生成長視頻，利用Tweedie匹配混合相鄰窗口的預測乾淨樣本，以在重疊區域同時強制流形約束與時間一致性。隨後，隨機早期採樣通過在高噪聲階段每次Tweedie匹配校正後注入新鮮噪聲來同步各窗口軌跡，再轉向確定性ODE採樣以保留細粒度視覺保真度。應用於多種視頻生成模型時，我們的方法能生成比原始窗口長度多倍的視頻，在時間一致性與視覺品質上優於免訓練與自回歸基線，並可進一步延伸至音頻-視頻聯合生成與文本到3DGS，無需任何微調。

English

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via Tweedie matching to enforce both manifold constraint and temporal consistency across overlap regions. Stochastic early-phase sampling then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.