LoomVideo: マルチモーダル入力をビデオ生成と編集に統合する

要旨

インタリーブされたマルチモーダル入力を解釈可能な統合動画生成・編集モデルの開発は、有望でありながら困難なフロンティア分野である。既存の統合フレームワークは主に大規模モデル（典型的には130億パラメータ以上）に依存し、編集のためにシーケンストークンを連結することでソース動画の条件を組み込んでいる。この連結はシーケンス長を必然的に2倍にし、自己注意機構の計算量を4倍に増加させ、法外なオーバーヘッドをもたらす。これらのボトルネックに対処するため、我々は動画生成と編集の両方に対応する高効率な50億パラメータ統合アーキテクチャであるLoomVideoを提案する。LoomVideoは標準のテキストエンコーダをマルチモーダル大規模言語モデル（MLLM）に置き換え、DeepStack注入メカニズムを採用して多層のMLLM特徴量を拡散トランスフォーマー（DiT）と整合させる。重要なのは、動画編集のためにゼロオーバーヘッドのScale-and-Add条件付け手法を導入する点である。クリーンなソース動画の潜在表現をスケーリングしてノイズ付きターゲット潜在表現に直接加算するこのエレガントな設計は、トークン連結を不要にし、計算コストを大幅に削減しつつ、複雑で非剛体的な編集に対する堅牢な能力を維持する。さらに、複数の参照画像を処理するためにNegative Temporal RoPE戦略がシームレスに統合されている。広範な実験により、我々のコンパクトな50億パラメータモデルが包括的なベンチマークにおいて最先端または非常に競争力のある性能を達成し、eコマースやファッション生成シナリオにおいて卓越した優位性を示すことが実証された。ゼロオーバーヘッドの条件付けメカニズムの恩恵により、LoomVideoは同等の能力を持つモデルと比較して推論速度が少なくとも5.41倍向上し、非常に実用的で効率的な動画基盤モデルの道を開く。

English

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.