Kiwi-Edit：基于指令与参考引导的多功能视频编辑系统

摘要

基于指令的视频编辑技术虽发展迅速，但现有方法常因自然语言在描述复杂视觉细节时的固有局限而难以实现精准的视觉控制。尽管参考引导编辑提供了可靠解决方案，但其潜力目前受限于高质量配对训练数据的稀缺。为弥补这一缺口，我们提出了一种可扩展的数据生成流程，通过图像生成模型创建合成参考支架，将现有视频编辑对转换为高保真训练四元组。基于此流程，我们构建了专为指令-参考跟随任务设计的大规模数据集RefVIE，并建立RefVIE-Bench进行综合评估。此外，我们提出统一编辑架构Kiwi-Edit，通过可学习查询与潜在视觉特征的协同实现参考语义引导。采用渐进式多阶段训练策略后，我们的模型在指令跟随与参考保真度方面取得显著提升。大量实验表明，我们的数据与架构开创了可控视频编辑的新标杆。所有数据集、模型及代码均已发布于https://github.com/showlab/Kiwi-Edit。

English

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

Kiwi-Edit：基于指令与参考引导的多功能视频编辑系统

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

摘要

Support