両者の長所を統合：ビデオ生成のための言語モデルと拡散モデルの融合

要旨

テキストからビデオ（T2V）生成の最近の進歩は、2つの競合するパラダイムによって推進されています：自己回帰型言語モデルと拡散モデルです。しかし、それぞれのパラダイムには固有の限界があります。言語モデルは視覚品質とエラーの蓄積に苦戦し、拡散モデルは意味理解と因果モデリングに欠けています。本研究では、粗から細かい生成を通じて両パラダイムの強みを相乗的に活用するハイブリッドフレームワークであるLanDiffを提案します。私たちのアーキテクチャは、以下の3つの主要な革新を導入します：(1) 効率的な意味的圧縮を通じて3D視覚的特徴をコンパクトな1D離散表現に圧縮する意味的トークナイザーで、sim14,000倍の圧縮率を達成します；(2) 高レベルの意味的関係を持つ意味的トークンを生成する言語モデル；(3) 粗い意味を高忠実度のビデオに洗練するストリーミング拡散モデル。実験では、5BモデルのLanDiffがVBench T2Vベンチマークで85.43のスコアを達成し、最先端のオープンソースモデルであるHunyuan Video（13B）やSora、Keling、Hailuoなどの商用モデルを凌駕しました。さらに、私たちのモデルは長尺ビデオ生成においても最先端の性能を達成し、この分野の他のオープンソースモデルを上回りました。私たちのデモはhttps://landiff.github.io/でご覧いただけます。

English

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a sim14,000times compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Keling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

両者の長所を統合：ビデオ生成のための言語モデルと拡散モデルの融合

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

要旨

Support