EasyV2V：基于高质量指令的视频编辑框架

摘要

尽管图像编辑技术发展迅猛，视频编辑领域仍待深入探索，面临一致性、可控性与泛化能力等挑战。本研究系统梳理了数据、架构与控制三个维度的设计空间，提出EasyV2V这一基于指令的视频编辑框架。数据层面，我们整合现有专家模型与快速反演技术构建多样化视频对，通过单帧监督与仿射运动伪配对将图像编辑提升至视频维度，挖掘密集标注片段生成视频训练对，并引入转场监督以指导编辑过程的动态呈现。模型层面，我们发现预训练文生视频模型具备编辑潜力，由此提出简化架构：仅需序列拼接的条件输入配合轻量级LoRA微调即可训练出强大模型。控制方面，我们通过统一掩码机制实现时空协同控制，并支持可选参考图像输入。整体而言，EasyV2V支持灵活输入模式（如视频+文本、视频+掩码+文本、视频+掩码+参考图+文本），在视频编辑效果上超越同期成果与商业系统，达到当前最优水平。项目页面：https://snap-research.github.io/easyv2v/

English

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

EasyV2V：基于高质量指令的视频编辑框架

EasyV2V: A High-quality Instruction-based Video Editing Framework

摘要

Support