ChatPaper.aiChatPaper

無需調參的多事件長視頻生成:基於同步耦合採樣

Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

March 11, 2025
作者: Subin Kim, Seoung Wug Oh, Jui-Hsien Wang, Joon-Young Lee, Jinwoo Shin
cs.AI

摘要

儘管近期文本到視頻擴散模型的進步使得從單一提示生成高質量的短視頻成為可能,但在單次生成中生成現實世界的長視頻仍然具有挑戰性,這主要受限於數據不足和高昂的計算成本。為解決這一問題,多項研究提出了無需調優的方法,即擴展現有模型以生成長視頻,特別是使用多個提示來實現動態且可控的內容變化。然而,這些方法主要集中於確保相鄰幀之間的平滑過渡,往往導致內容漂移和語義連貫性在較長序列中的逐漸喪失。為應對這一問題,我們提出了同步耦合採樣(SynCoS),這是一種新穎的推理框架,通過同步整個視頻的去噪路徑,確保相鄰和遠距離幀之間的長程一致性。我們的方法結合了兩種互補的採樣策略:反向採樣和基於優化的採樣,分別確保無縫的局部過渡和強制全局一致性。然而,直接交替使用這些採樣會導致去噪軌跡錯位,破壞提示引導並引入非預期的內容變化,因為它們是獨立運行的。為解決這一問題,SynCoS通過固定的時間步長和基準噪聲來同步它們,確保完全耦合的採樣和對齊的去噪路徑。大量實驗表明,SynCoS顯著改善了多事件長視頻的生成,實現了更平滑的過渡和更優的長程一致性,在定量和定性上均超越了先前的方法。
English
While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.

Summary

AI-Generated Summary

PDF262March 12, 2025