ChatPaper.aiChatPaper

StreamDiT:实时流式文本到视频生成

StreamDiT: Real-Time Streaming Text-to-Video Generation

July 4, 2025
作者: Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao
cs.AI

摘要

近期,通过将基于Transformer的扩散模型扩展至数十亿参数规模,文本到视频(T2V)生成领域取得了显著进展,能够生成高质量视频。然而,现有模型通常仅能离线生成短视频片段,限制了其在交互式及实时应用中的使用场景。本文针对这些挑战,提出了StreamDiT,一种流式视频生成模型。StreamDiT的训练基于流匹配技术,并引入了移动缓冲区。我们设计了混合训练策略,采用不同的缓冲帧划分方案,以同时提升内容一致性与视觉质量。StreamDiT的建模基于adaLN DiT,结合了动态时间嵌入与窗口注意力机制。为实践所提方法,我们训练了一个拥有40亿参数的StreamDiT模型。此外,我们提出了一种专为StreamDiT定制的多步蒸馏方法,在选定划分方案的每个片段内执行采样蒸馏。蒸馏后,总函数评估次数(NFEs)减少至缓冲区内的分块数量。最终,我们的蒸馏模型在单GPU上实现了16帧每秒的实时性能,能够生成512p分辨率的视频流。我们通过定量指标与人工评估相结合的方式验证了方法的有效性。该模型支持实时应用,如流式生成、交互式生成及视频到视频转换。更多视频结果及示例请访问我们的项目网站:<a href="https://cumulo-autumn.github.io/StreamDiT/">此https链接。</a>
English
Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: <a href="https://cumulo-autumn.github.io/StreamDiT/">this https URL.</a>
PDF71July 8, 2025