Sketch3DVE：基于草图的3D感知场景视频编辑

摘要

近期视频编辑技术在风格迁移或外观修改方面取得了引人注目的成果。然而，在视频中编辑三维场景的结构内容仍面临挑战，尤其是在处理显著视角变化时，如大幅度的相机旋转或缩放。主要挑战包括生成与原始视频保持一致的新视角内容、保留未编辑区域，以及将稀疏的二维输入转化为逼真的三维视频输出。为解决这些问题，我们提出了Sketch3DVE，一种基于草图的、具备三维感知的视频编辑方法，旨在实现对具有显著视角变化的视频进行精细局部操控。针对稀疏输入带来的挑战，我们采用图像编辑方法生成首帧的编辑结果，并将其传播至视频的其余帧。我们利用草图作为精确几何控制的交互工具，同时也支持其他基于掩码的图像编辑方法。为应对视角变化，我们对视频中的三维信息进行了详尽分析与操作。具体而言，我们运用密集立体视觉方法估计输入视频的点云及相机参数。随后，我们提出了一种点云编辑方法，利用深度图表示新编辑组件的三维几何结构，使其与原始三维场景有效对齐。为了无缝融合新编辑内容与原始视频，同时保留未编辑区域的特征，我们引入了一种三维感知的掩码传播策略，并采用视频扩散模型生成逼真的编辑视频。大量实验验证了Sketch3DVE在视频编辑中的优越性。项目主页与代码详见：http://geometrylearning.com/Sketch3DVE/

English

Recent video editing methods achieve attractive results in style transfer or appearance modification. However, editing the structural content of 3D scenes in videos remains challenging, particularly when dealing with significant viewpoint changes, such as large camera rotations or zooms. Key challenges include generating novel view content that remains consistent with the original video, preserving unedited regions, and translating sparse 2D inputs into realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. To solve the challenge posed by sparse inputs, we employ image editing methods to generate edited results for the first frame, which are then propagated to the remaining frames of the video. We utilize sketching as an interaction tool for precise geometry control, while other mask-based image editing methods are also supported. To handle viewpoint changes, we perform a detailed analysis and manipulation of the 3D information in the video. Specifically, we utilize a dense stereo method to estimate a point cloud and the camera parameters of the input video. We then propose a point cloud editing approach that uses depth maps to represent the 3D geometry of newly edited components, aligning them effectively with the original 3D scene. To seamlessly merge the newly edited content with the original video while preserving the features of unedited regions, we introduce a 3D-aware mask propagation strategy and employ a video diffusion model to produce realistic edited videos. Extensive experiments demonstrate the superiority of Sketch3DVE in video editing. Homepage and code: http://http://geometrylearning.com/Sketch3DVE/