Helios: Ein echtes Echtzeit-Langvideo-Generierungsmodell

Zusammenfassung

Wir stellen Helios vor, das erste 14B-Videogenerierungsmodell, das mit 19,5 FPS auf einer einzelnen NVIDIA H100 GPU läuft, minutelange Generierung unterstützt und dabei die Qualität einer starken Baseline erreicht. Wir erzielen Durchbrüche in drei Schlüsseldimensionen: (1) Robustheit gegenüber Driften in langen Videos ohne häufig verwendete Heuristiken zur Driftvermeidung wie Self-Forcing, Error-Banks oder Keyframe-Sampling; (2) Echtzeitgenerierung ohne Standardbeschleunigungstechniken wie KV-Caching, sparse/lineare Attention oder Quantisierung; und (3) Training ohne Parallelitäts- oder Sharding-Frameworks, was Batch-Größen auf Bild-Diffusions-Maßstab ermöglicht und bis zu vier 14B-Modelle innerhalb von 80 GB GPU-Speicher unterbringt. Konkret ist Helios ein 14B autoregressives Diffusionsmodell mit einer einheitlichen Eingabedarstellung, das nativ T2V-, I2V- und V2V-Aufgaben unterstützt. Um Driften bei der Langzeit-Videogenerierung zu mildern, charakterisieren wir typische Fehlermodi und schlagen einfache, aber effektive Trainingsstrategien vor, die Driften explizit während des Trainings simulieren, während repetitive Bewegung an ihrer Quelle eliminiert wird. Für Effizienz komprimieren wir den historischen und verrauschten Kontext stark und reduzieren die Anzahl der Sampling-Schritte, was zu rechenkosten führt, die vergleichbar mit – oder niedriger als – denen von 1,3B-Videogenerierungsmodellen sind. Darüber hinaus führen wir Infrastrukturoptimierungen ein, die sowohl Inferenz als auch Training beschleunigen und den Speicherverbrauch reduzieren. Umfangreiche Experimente zeigen, dass Helios frühere Methoden sowohl bei der Kurz- als auch bei der Langzeit-Videogenerierung konsequent übertrifft. Wir planen, den Code, das Basismodell und das destillierte Modell zu veröffentlichen, um die weitere Entwicklung durch die Community zu unterstützen.

English

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.

Helios: Ein echtes Echtzeit-Langvideo-Generierungsmodell

Helios: Real Real-Time Long Video Generation Model

Zusammenfassung

Support