Lumos-1: 統一モデル視点に基づく自己回帰的動画生成について

要旨

自己回帰型大規模言語モデル（LLM）は、幅広い言語タスクを統合し、自己回帰型ビデオ生成における初期の取り組みを刺激してきました。既存の自己回帰型ビデオ生成器は、標準的なLLMアーキテクチャから逸脱しているか、かさばる外部テキストエンコーダに依存しているか、次のトークンのデコードによる過度の遅延を招いています。本論文では、LLMアーキテクチャを最小限の変更で保持する自己回帰型ビデオ生成器、Lumos-1を紹介します。LLMに時空間相関を注入するために、3D RoPEの有効性を特定し、その不均衡な周波数スペクトル範囲を診断します。そこで、元のテキストRoPEを保持しつつ、多モーダル時空間データをモデル化するための包括的な周波数スペクトルとスケーリングされた3D位置を提供するRoPEスキーム、MM-RoPEを提案します。さらに、Lumos-1は、フレーム内双方向性とフレーム間時間的因果性に従うトークン依存性戦略を採用します。この依存性戦略に基づき、空間情報の冗長性によるフレームごとの損失不均衡の問題を特定し、自己回帰型離散拡散強制（AR-DF）を提案することで解決します。AR-DFは、トレーニング中に時間的チューブマスキングを導入し、品質低下を避けるための互換性のある推論時マスキングポリシーを提供します。メモリ効率の良いトレーニング技術を使用することで、Lumos-1をわずか48GPUで事前トレーニングし、GenEvalではEMU3、VBench-I2VではCOSMOS-Video2World、VBench-T2VではOpenSoraPlanに匹敵する性能を達成しました。コードとモデルはhttps://github.com/alibaba-damo-academy/Lumosで公開されています。

English

Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

Lumos-1: 統一モデル視点に基づく自己回帰的動画生成について

Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

要旨

Support