通过高质量合成数据集扩展基于指令的视频编辑能力
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
October 17, 2025
作者: Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen
cs.AI
摘要
基于指令的视频编辑技术有望实现内容创作的民主化,但其发展却因大规模、高质量训练数据的匮乏而严重受阻。为此,我们推出了Ditto,一个旨在解决这一根本挑战的综合性框架。Ditto的核心在于其创新的数据生成流程,该流程将领先图像编辑器的创意多样性与上下文视频生成器相结合,突破了现有模型的局限。为了确保这一流程的可行性,我们的框架通过采用一种高效、经过蒸馏的模型架构,并辅以时间增强器,有效解决了成本与质量之间的权衡问题,既降低了计算开销,又提升了时间连贯性。最终,为了实现全面可扩展性,整个流程由智能代理驱动,该代理不仅生成多样化的指令,还严格筛选输出结果,确保大规模下的质量控制。利用这一框架,我们投入了超过12,000个GPU天,构建了Ditto-1M,一个包含一百万高保真视频编辑示例的新数据集。我们采用课程学习策略,在Ditto-1M上训练了我们的模型Editto。实验结果表明,Editto在遵循指令的能力上表现卓越,确立了基于指令视频编辑的新标杆。
English
Instruction-based video editing promises to democratize content creation, yet
its progress is severely hampered by the scarcity of large-scale, high-quality
training data. We introduce Ditto, a holistic framework designed to tackle this
fundamental challenge. At its heart, Ditto features a novel data generation
pipeline that fuses the creative diversity of a leading image editor with an
in-context video generator, overcoming the limited scope of existing models. To
make this process viable, our framework resolves the prohibitive cost-quality
trade-off by employing an efficient, distilled model architecture augmented by
a temporal enhancer, which simultaneously reduces computational overhead and
improves temporal coherence. Finally, to achieve full scalability, this entire
pipeline is driven by an intelligent agent that crafts diverse instructions and
rigorously filters the output, ensuring quality control at scale. Using this
framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of
one million high-fidelity video editing examples. We trained our model, Editto,
on Ditto-1M with a curriculum learning strategy. The results demonstrate
superior instruction-following ability and establish a new state-of-the-art in
instruction-based video editing.