**EffectErase:高质量特效擦除中的联合视频对象移除与插入技术**
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
March 19, 2026
作者: Yang Fu, Yike Zheng, Ziyun Dai, Henghui Ding
cs.AI
摘要
视频目标移除旨在消除动态目标物体及其视觉效应(如形变、阴影和反射),同时恢复无缝背景。近期基于扩散模型的视频修复与目标移除方法虽能去除物体,但往往难以彻底清除这些效应并生成连贯的背景。除方法局限外,该领域进展还因缺乏系统性涵盖不同环境中常见物体效应的综合性数据集而受阻。为此,我们推出VOR(视频目标移除)数据集——一个提供多样化配对视频的大规模资源,每组包含呈现目标物体及其效应的视频,以及物体与效应均被移除的对应视频,并附带物体掩码。VOR包含6万对来自实拍与合成源的高质量视频对,涵盖五种效应类型,涉及广泛物体类别以及复杂的动态多物体场景。基于VOR数据集,我们提出EffectErase方法,这是一种效应感知的视频目标移除技术,通过将视频物体插入作为逆向辅助任务融入对偶学习框架。该模型包含任务感知的区域引导机制,可聚焦于受影响区域进行学习,并支持灵活的任务切换;同时采用插入-移除一致性目标,促使模型在效应区域定位与结构线索捕捉方面形成互补行为与共享认知。在VOR上训练的EffectErase在大量实验中展现出卓越性能,能够跨多样场景实现高质量的视频物体效应消除。
English
Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.