BlenderFusion：基於三維視覺的編輯與生成式合成技術

摘要

我們提出BlenderFusion，這是一個生成式視覺合成框架，通過重組物體、攝像機和背景來合成新場景。它遵循分層-編輯-合成的流程：(i)將視覺輸入分割並轉換為可編輯的三維實體（分層），(ii)在Blender中進行基於三維的編輯（編輯），(iii)使用生成式合成器將它們融合成一個連貫的場景（合成）。我們的生成式合成器擴展了預訓練的擴散模型，以並行處理原始（源）和編輯後（目標）場景。它通過兩種關鍵訓練策略在視頻幀上進行微調：(i)源遮罩，實現如背景替換等靈活修改；(ii)模擬物體抖動，促進對物體和攝像機的解耦控制。BlenderFusion在複雜的組合場景編輯任務中顯著優於先前的方法。

English

We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.

BlenderFusion：基於三維視覺的編輯與生成式合成技術

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

摘要

Support