Snap Video：用於文本到視頻合成的尺度空時轉換器

摘要

當前用於生成影像的模型展現出卓越的品質和多功能性。受到這些優勢的影響，研究界將其重新運用於生成影片。由於影片內容高度冗餘，我們認為將影像模型的進展天真地帶入影片生成領域會降低動態保真度、視覺品質並影響可擴展性。在這項工作中，我們建立了 Snap Video，這是一個以影片為先的模型，系統性地應對這些挑戰。為此，我們首先擴展了 EDM 框架，以考慮空間和時間上的冗餘像素，並自然地支持影片生成。其次，我們發現 U-Net - 影像生成背後的得力工具 - 在生成影片時擴展效果不佳，需要大量的計算開銷。因此，我們提出了一種基於 Transformer 的新架構，訓練速度比 U-Net 快 3.31 倍（推論速度快約 4.5 倍）。這使我們能夠高效地訓練具有數十億參數的文本到影片模型，首次在多個基準測試中達到最先進的結果，並生成具有顯著更高品質、時間一致性和運動複雜性的影片。用戶研究顯示，我們的模型在很大程度上優於最新方法。請參閱我們的網站：https://snap-research.github.io/snapvideo/。

English

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

Snap Video：用於文本到視頻合成的尺度空時轉換器

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

摘要

Support