雙流擴散網絡用於文本到視頻生成

摘要

隨著新興的擴散模型，最近，文字轉視頻生成引起了越來越多的關注。但其中一個重要的瓶頸是，生成的視頻往往會出現一些閃爍和瑕疵。在這項工作中，我們提出了一種雙流擴散網絡（DSDN），以改善生成視頻中內容變化的一致性。特別是，設計的兩個擴散流，視頻內容和運動分支，不僅可以在它們各自的私有空間中運行，以生成個性化的視頻變化和內容，而且還可以通過利用我們設計的交叉轉換器交互模塊，在內容和運動領域之間實現良好對齊，這將有助於生成視頻的平滑度。此外，我們還引入了運動分解器和合併器，以促進對視頻運動的操作。定性和定量實驗表明，我們的方法能夠生成具有較少閃爍的令人驚嘆的連續視頻。

English

With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.

雙流擴散網絡用於文本到視頻生成

Dual-Stream Diffusion Net for Text-to-Video Generation

摘要

Support