OmnimatteZero：基于预训练视频扩散模型的无训练实时全场景抠像

摘要

Omnimatte致力于将给定视频分解为具有语义意义的层次，包括背景和独立物体及其相关效果，如阴影和反射。现有方法通常需要大量训练或昂贵的自监督优化。本文中，我们提出了OmnimatteZero，一种无需训练的方法，它利用现成的预训练视频扩散模型来实现omnimatte。该方法能够从视频中移除物体，提取包含其效果的独立物体层，并将这些物体合成到新的视频中。我们通过调整零样本图像修复技术，使其适用于视频物体移除任务，这一任务原本难以直接有效处理。随后，我们展示了自注意力图能够捕捉物体及其痕迹的信息，并利用它们来修复物体的效果，从而留下干净的背景。此外，通过简单的潜在空间运算，物体层可以被隔离并无缝重组到新的视频层中，以生成新的视频。评估结果表明，OmnimatteZero不仅在背景重建方面表现出色，还创下了最快的Omnimatte方法记录，实现了实时性能，且每帧运行时间极短。

English

Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

OmnimatteZero：基于预训练视频扩散模型的无训练实时全场景抠像

OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

摘要

Support