BlenderFusion: 3D 기반 시각적 편집 및 생성적 합성

초록

우리는 새로운 장면을 객체, 카메라, 배경을 재구성하여 합성하는 생성형 시각적 합성 프레임워크인 BlenderFusion을 소개합니다. 이 프레임워크는 레이어링-편집-합성 파이프라인을 따릅니다: (i) 시각적 입력을 분할하고 편집 가능한 3D 엔티티로 변환(레이어링), (ii) Blender에서 3D 기반 제어를 통해 편집(편집), (iii) 생성형 합성기를 사용하여 일관된 장면으로 융합(합성). 우리의 생성형 합성기는 사전 훈련된 확산 모델을 확장하여 원본(소스) 장면과 편집된(타겟) 장면을 병렬로 처리합니다. 이 모델은 두 가지 주요 훈련 전략을 통해 비디오 프레임에 대해 미세 조정됩니다: (i) 소스 마스킹을 통해 배경 교체와 같은 유연한 수정 가능, (ii) 시뮬레이션된 객체 지터링을 통해 객체와 카메라에 대한 분리된 제어 가능. BlenderFusion은 복잡한 구성적 장면 편집 작업에서 기존 방법들을 크게 능가합니다.

English

We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.

BlenderFusion: 3D 기반 시각적 편집 및 생성적 합성

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

초록

Support