面向教学视频编辑的区域约束上下文生成
Region-Constraint In-Context Generation for Instructional Video Editing
December 19, 2025
作者: Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei
cs.AI
摘要
近期,上下文生成范式在指令式图像编辑中展现出强大的数据效率与合成质量。然而,将这种上下文学习模式应用于基于指令的视频编辑并非易事。若未明确编辑区域,结果可能面临编辑区域不准确以及去噪过程中编辑区与非编辑区之间的令牌干扰问题。为此,我们提出ReCo——一种创新的指令式视频编辑范式,通过深入建模上下文生成过程中编辑区与非编辑区之间的约束关系来解决上述问题。技术上,ReCo采用宽度拼接方式将源视频与目标视频联合去噪。为校准视频扩散学习过程,ReCo引入两种正则化项:潜在空间正则化与注意力正则化,分别作用于单步反向去噪后的潜在表示和注意力图谱。前者通过增大源视频与目标视频在编辑区域的潜在差异,同时减小非编辑区域的差异,从而强化编辑区域的修改效果并抑制外部非预期内容生成;后者通过抑制编辑区令牌对源视频对应区域的注意力权重,减轻目标视频中新对象生成时的干扰。此外,我们构建了大规模高质量视频编辑数据集ReCo-Data,包含50万条指令-视频对以促进模型训练。在四大主流指令式视频编辑任务上的大量实验证明了本方法的优越性。
English
The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.