InsertAnywhere:连接四维场景几何与扩散模型,实现逼真视频对象插入
InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
December 19, 2025
作者: Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo
cs.AI
摘要
基于扩散模型的视频生成技术近期取得突破性进展,为可控视频编辑开辟了新途径,然而由于四维场景理解能力有限以及对遮挡和光照效果处理不足,实现逼真的视频对象插入(VOI)仍面临挑战。我们提出InsertAnywhere框架,通过几何一致的对象布局和外观保真的视频合成来解决这一难题。该方法首先采用四维感知掩码生成模块重建场景几何结构,将用户指定的对象布局跨帧传播,同时保持时序连贯性与遮挡一致性。在此空间基础上,我们扩展了基于扩散模型的视频生成架构,联合合成插入对象及其周边局部变化(如光照与阴影)。为支持监督训练,我们开发了ROSE++光照感知合成数据集,通过对ROSE对象移除数据集进行重构,生成包含对象移除视频、对象存在视频和VLM生成参考图像的三元组数据。大量实验表明,我们的框架能在多样化的真实场景中生成几何合理且视觉连贯的对象插入效果,显著优于现有研究及商业模型。
English
Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.