基於指令的視頻編輯技術擴展與高質量合成數據集的應用
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
October 17, 2025
作者: Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen
cs.AI
摘要
基于指令的視頻編輯技術有望實現內容創作的民主化,然而其發展卻因大規模、高質量訓練數據的稀缺而嚴重受阻。我們推出了Ditto,這是一個旨在應對這一根本挑戰的整體框架。Ditto的核心特色在於其創新的數據生成流程,該流程融合了領先圖像編輯器的創意多樣性與上下文視頻生成器,從而突破了現有模型的局限範圍。為了使這一流程切實可行,我們的框架通過採用一種高效的蒸餾模型架構並輔以時間增強器,解決了成本與質量之間難以兼顧的矛盾,同時降低了計算開銷並提升了時間一致性。最終,為了實現完全的可擴展性,整個流程由一個智能代理驅動,該代理負責設計多樣化的指令並嚴格篩選輸出,確保了大規模下的質量控制。利用這一框架,我們投入了超過12,000個GPU天,構建了Ditto-1M,這是一個包含一百萬個高保真視頻編輯示例的新數據集。我們採用課程學習策略,在Ditto-1M上訓練了我們的模型Editto。結果顯示,Editto在遵循指令的能力上表現卓越,並在基於指令的視頻編輯領域樹立了新的技術標杆。
English
Instruction-based video editing promises to democratize content creation, yet
its progress is severely hampered by the scarcity of large-scale, high-quality
training data. We introduce Ditto, a holistic framework designed to tackle this
fundamental challenge. At its heart, Ditto features a novel data generation
pipeline that fuses the creative diversity of a leading image editor with an
in-context video generator, overcoming the limited scope of existing models. To
make this process viable, our framework resolves the prohibitive cost-quality
trade-off by employing an efficient, distilled model architecture augmented by
a temporal enhancer, which simultaneously reduces computational overhead and
improves temporal coherence. Finally, to achieve full scalability, this entire
pipeline is driven by an intelligent agent that crafts diverse instructions and
rigorously filters the output, ensuring quality control at scale. Using this
framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of
one million high-fidelity video editing examples. We trained our model, Editto,
on Ditto-1M with a curriculum learning strategy. The results demonstrate
superior instruction-following ability and establish a new state-of-the-art in
instruction-based video editing.