EasyV2V:基于高质量指令的视频编辑框架
EasyV2V: A High-quality Instruction-based Video Editing Framework
December 18, 2025
作者: Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei
cs.AI
摘要
尽管图像编辑技术发展迅猛,视频编辑领域仍处于探索不足的状态,面临一致性、可控性与泛化能力等挑战。本研究系统探索了数据、架构与控制三个维度的设计空间,提出EasyV2V这一基于指令的简易高效视频编辑框架。在数据层面,我们通过组合现有专家模型与快速逆变换构建多样化视频对,借助单帧监督与共享仿射运动的伪配对将图像编辑对提升至视频维度,挖掘密集标注视频片段生成视频训练对,并引入转场监督机制以指导编辑过程的动态呈现。模型架构方面,我们发现预训练文生视频模型本身具备编辑潜力,由此提出简化设计:仅需通过序列拼接实现条件控制,配合轻量级LoRA微调即可训练出强大模型。控制机制上,我们通过统一掩码机制实现时空联合控制,并支持可选参考图像输入。整体而言,EasyV2V支持灵活输入组合(如视频+文本、视频+掩码+文本、视频+掩码+参考图+文本),在视频编辑效果上达到业界最优水平,超越同期研究成果及商业系统。项目页面:https://snap-research.github.io/easyv2v/
English
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/