InsertAnywhere：連接四維場景幾何與擴散模型，實現逼真影片物件插入

摘要

近期基於擴散模型的影片生成技術取得突破性進展，為可控影片編輯開闢了新途徑，然而受限於四維場景理解能力不足以及對遮擋與光照效應的處理不完善，實現逼真的影片物件插入仍面臨挑戰。我們提出InsertAnywhere新型VOI框架，該框架能實現幾何一致的物件佈局與外觀保真的影片合成。我們的方法首先通過四維感知遮罩生成模組重建場景幾何結構，將使用者指定的物件佈局跨幀傳播，同時保持時間連貫性與遮擋一致性。在此空間基礎上，我們擴展了基於擴散的影片生成模型，聯合合成插入物件及其周邊局部變化（如光照與陰影）。為實現監督式訓練，我們構建了ROSE++光照感知合成資料集，通過將ROSE物件移除資料集轉換為「物件移除影片-物件存在影片-VLM生成參考圖像」的三元組形式。經大量實驗驗證，我們的框架能在多樣化真實場景中產生幾何合理且視覺連貫的物件插入效果，顯著優於現有研究與商業模型。

English

Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.

InsertAnywhere：連接四維場景幾何與擴散模型，實現逼真影片物件插入

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

摘要

Support