LoomVideo：将多模态输入统一到视频生成与编辑中

摘要

开发能够解读交错多模态输入的统一视频生成与编辑模型是一个前景广阔但富有挑战的前沿领域。现有统一框架主要依赖大规模模型（通常拥有130亿参数以上），并通过拼接序列令牌的方式引入源视频条件以实现编辑。这种拼接不可避免地使序列长度翻倍，导致自注意力机制的计算复杂度呈四倍增长，带来难以承受的开销。为解决这些瓶颈，我们提出了LoomVideo——一种高效、拥有50亿参数、适用于视频生成与编辑的统一架构。LoomVideo用多模态大语言模型替换标准文本编码器，并采用深层堆叠注入机制将多模态大语言模型的跨层特征与扩散变换器对齐。关键之处在于，我们为零开销的缩放-加条件添加方法设计了视频编辑方案。通过缩放并直接将干净源视频潜变量添加到带噪目标潜变量上，这一优雅设计省去了令牌拼接的必要性，大幅降低计算成本，同时保持了对复杂非刚性编辑的稳健能力。此外，我们无缝集成了负时间旋转位置编码策略以处理多张参考图像。大量实验表明，我们紧凑的50亿参数模型在综合基准测试中达到了最先进或极具竞争力的性能，在电商与时尚生成场景中展现出卓越优势。得益于零开销条件机制，LoomVideo在推理速度上相较同类模型至少获得了5.41倍的加速，为打造高度实用且高效的视频基础模型铺平了道路。

English

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.