BlenderFusion：基于3D场景的视觉编辑与生成式合成

摘要

我们推出BlenderFusion，一个生成式视觉合成框架，通过重组物体、相机和背景来合成新场景。它遵循分层-编辑-合成的流程：(i) 将视觉输入分割并转换为可编辑的3D实体（分层），(ii) 在Blender中基于3D控制进行编辑（编辑），(iii) 使用生成式合成器将它们融合成一个连贯的场景（合成）。我们的生成式合成器扩展了预训练的扩散模型，使其能够并行处理原始（源）和编辑后（目标）场景。该模型在视频帧上进行了微调，采用了两项关键训练策略：(i) 源掩码，支持如背景替换等灵活修改；(ii) 模拟物体抖动，便于对物体和相机进行解耦控制。在复杂的组合场景编辑任务中，BlenderFusion显著优于现有方法。

English

We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.

BlenderFusion：基于3D场景的视觉编辑与生成式合成

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

摘要

Support