DC-VideoGen: 深層圧縮ビデオオートエンコーダによる効率的なビデオ生成

要旨

我々は、効率的な動画生成のためのポストトレーニング高速化フレームワーク「DC-VideoGen」を紹介する。DC-VideoGenは、任意の事前学習済み動画拡散モデルに適用可能であり、軽量なファインチューニングによって深層圧縮潜在空間に適応させることで効率性を向上させる。本フレームワークは、以下の2つの主要なイノベーションに基づいている：(i) 32倍/64倍の空間圧縮と4倍の時間圧縮を実現しつつ、再構成品質と長尺動画への汎化性を維持する、新規のチャンク因果的時系列設計を備えたDeep Compression Video Autoencoder、および(ii) 事前学習済みモデルを新たな潜在空間へ迅速かつ安定して転移させるAE-Adapt-Vという堅牢な適応戦略である。DC-VideoGenを用いて事前学習済みWan-2.1-14Bモデルを適応させるのに必要なのは、NVIDIA H100 GPU上でわずか10 GPU日である。高速化されたモデルは、品質を損なうことなくベースモデルと比較して最大14.8倍の低い推論遅延を実現し、さらに単一GPUでの2160x3840解像度の動画生成を可能にする。コード：https://github.com/dc-ai-projects/DC-VideoGen。

English

We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.

DC-VideoGen: 深層圧縮ビデオオートエンコーダによる効率的なビデオ生成

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

要旨

Support