LoomVideo：統一多模態輸入的影片生成與編輯

摘要

开发能够解译交错多模态输入的统一视频生成与编辑模型，是一个前景广阔但充满挑战的前沿领域。现有的统一框架主要依赖大型模型（通常具有130亿参数或更多），并通过拼接序列标记将源视频条件引入编辑过程。这种拼接不可避免地使序列长度翻倍，导致自注意力机制的计算复杂度呈四次方增长，带来难以承受的开销。为解决这些瓶颈，我们提出LoomVideo——一种高效、拥有50亿参数的统一架构，适用于视频生成与编辑。LoomVideo用多模态大型语言模型（MLLM）取代标准文本编码器，并采用Deepstack注入机制将多层MLLM特征与扩散变换器（DiT）对齐。关键在于，我们引入了一种零开销的缩放-添加条件方法用于视频编辑。通过缩放并直接将干净的源视频潜变量添加到含噪的目标潜变量中，这一简洁设计消除了标记拼接的需求，大幅降低计算成本，同时保持对复杂非刚性编辑的稳健能力。此外，我们还无缝集成了负时态旋转位置编码（Negative Temporal RoPE）策略以处理多个参考图像。大量实验表明，我们紧凑的50亿参数模型在综合基准测试中达到了最先进或极具竞争力的性能，在电商和时尚生成场景中展现出卓越优势。得益于零开销条件机制，LoomVideo在与同类能力的模型相比，推理速度至少提速5.41倍，为高度实用且高效的视频基础模型铺平了道路。

English

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.