OmniWeaving：面向自由组合与推理的统一视频生成框架

摘要

尽管Seedance-2.0等专有系统已在全能视频生成领域取得显著成功，但开源替代方案仍存在明显差距。当前多数学术模型仍处于高度碎片化状态，少数现有的一体化视频生成尝试也难以为多样化任务提供无缝衔接的统一框架。为弥补这一空白，我们提出OmniWeaving——一种具备强大多模态组合与推理感知能力的全层级视频生成模型。通过利用涵盖多样化组合与推理增强场景的大规模预训练数据集，该模型不仅能时序绑定交错输入的文本、多图像及视频数据，更能作为智能体推断复杂用户意图以实现精细化视频创作。此外，我们推出首个全面评估下一代智能统一视频生成能力的基准测试IntelligentVBench。大量实验表明，OmniWeaving在开源统一模型中实现了最先进的性能表现。代码与模型即将公开，项目页面详见：https://omniweaving.github.io。

English

While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.