OmnimatteZero：基於預訓練視頻擴散模型的免訓練實時Omnimatte生成

摘要

Omnimatte旨在將給定的視頻分解為具有語義意義的層次，包括背景和單個物體及其相關效果，如陰影和反射。現有方法通常需要大量訓練或昂貴的自監督優化。在本文中，我們提出了OmnimatteZero，這是一種無需訓練的方法，利用現成的預訓練視頻擴散模型來實現omnimatte。它能夠從視頻中移除物體，提取單個物體層及其效果，並將這些物體合成到新的視頻中。我們通過將零樣本圖像修復技術適應於視頻物體移除任務來實現這一點，而這些技術在未經調整的情況下無法有效處理此類任務。我們隨後展示了自注意力映射捕捉了物體及其足蹟的信息，並利用它們來修復物體的效果，留下乾淨的背景。此外，通過簡單的潛在算術操作，物體層可以被隔離並無縫地與新的視頻層重新組合，以生成新的視頻。評估結果顯示，OmnimatteZero不僅在背景重建方面表現優異，還創下了最快的Omnimatte方法的新紀錄，實現了實時性能，且每幀運行時間極短。

English

Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

OmnimatteZero：基於預訓練視頻擴散模型的免訓練實時Omnimatte生成

OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

摘要

Support