ChatPaper.aiChatPaper

基於區域約束的上下文生成式教學影片編輯

Region-Constraint In-Context Generation for Instructional Video Editing

December 19, 2025
作者: Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei
cs.AI

摘要

近期,情境生成範式在指令式圖像編輯領域展現出強大的數據效率與合成品質。然而,將此種情境學習應用於基於指令的影片編輯並非易事。若未明確指定編輯區域,可能導致編輯區域不準確,以及在去噪過程中編輯與非編輯區域的標記相互干擾等問題。為解決這些難題,我們提出ReCo——一種新型指令式影片編輯範式,其創新之處在於深入探索情境生成過程中編輯與非編輯區域之間的約束建模。技術上,ReCo採用寬度維度串接源影片與目標影片進行聯合去噪。為校準影片擴散學習,ReCo運用兩項正則化項:潛在正則化與注意力正則化,分別作用於單步反向去噪後的潛在表徵與注意力圖譜。前者通過增大源影片與目標影片在編輯區域的潛在表徵差異,同時縮小非編輯區域的差異,強化編輯區域的修改效果並抑制外部非預期內容生成;後者則壓制編輯區域標記對源影片對應標記的注意力,從而減輕目標影片中新物件生成時的干擾。此外,我們構建了大規模高品質影片編輯數據集ReCo-Data,包含50萬個指令-影片配對以促進模型訓練。在四類主流指令式影片編輯任務上的大量實驗驗證了本方法的優越性。
English
The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.
PDF392December 24, 2025