Helios：真正实时生成长视频的生成模型

摘要

我们推出Helios——首个140亿参数视频生成模型，在单张NVIDIA H100 GPU上可实现19.5 FPS的实时生成，支持分钟级长视频生成，同时保持与强基线模型相当的生成质量。我们在三个关键维度实现突破：(1) 无需自强制、误差累积库或关键帧采样等常用防漂移策略，即可实现长视频生成的稳定性；(2) 无需KV缓存、稀疏/线性注意力或量化等标准加速技术，即可达成实时生成；(3) 无需并行或分片训练框架，在80GB GPU内存内可同时容纳四个140亿参数模型，并实现图像扩散模型级别的批处理规模。具体而言，Helios采用具有统一输入表示的140亿参数自回归扩散架构，原生支持文生视频、图生视频和视频生视频任务。针对长视频生成中的漂移问题，我们系统分析了典型失效模式，提出通过显式模拟训练过程中的漂移现象来设计简单高效的训练策略，从根源上消除重复性运动。在效率方面，我们大幅压缩历史信息与噪声上下文，减少采样步数，使计算成本与13亿参数视频生成模型相当甚至更低。此外，我们引入基础设施层优化，在降低内存占用的同时加速推理与训练过程。大量实验表明，Helios在短视频生成长视频生成任务上均持续超越现有方法。我们将开源代码、基础模型与蒸馏模型，以支持社区后续发展。

English

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.