ChatPaper.aiChatPaper

LTX-Video:实时视频潜在扩散

LTX-Video: Realtime Video Latent Diffusion

December 30, 2024
作者: Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, Ofir Bibi
cs.AI

摘要

我们介绍了LTX-Video,这是一种基于Transformer的潜在扩散模型,通过无缝集成Video-VAE和去噪Transformer的功能,采用了一种全面的方法来生成视频。与现有方法不同,现有方法将这些组件视为独立的,LTX-Video旨在优化它们的相互作用,以提高效率和质量。其核心是经过精心设计的Video-VAE,实现了高达1:192的压缩比,每个标记的空间时间降采样为32 x 32 x 8像素,通过将分块操作从Transformer的输入转移到VAE的输入来实现。在这种高度压缩的潜在空间中运行,使得Transformer能够高效执行全空间时间自注意力,这对于生成具有时间一致性的高分辨率视频至关重要。然而,高度压缩固有地限制了细节的表示。为了解决这个问题,我们的VAE解码器负责潜在到像素的转换和最终的去噪步骤,在像素空间直接生成清晰结果。这种方法保留了生成细节的能力,而不会产生单独上采样模块的运行时成本。我们的模型支持各种用例,包括文本到视频和图像到视频生成,两种功能同时训练。它实现了快于实时的生成,在Nvidia H100 GPU上仅需2秒即可在768x512分辨率下生成5秒24 fps视频,优于所有类似规模的现有模型。源代码和预训练模型已公开可用,为可访问和可扩展的视频生成设定了新的基准。
English
We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.

Summary

AI-Generated Summary

PDF473January 3, 2025