Blender에서의 사고: 시각-언어 모델을 활용한 단계적 실행 가능 역그래픽스

초록

역그래픽스는 오래된 문제이며 매우 제약 조건이 부족한 문제로, 렌더링, 재조명, 조작이 가능한 편집 가능한 3D 장면으로 이미지를 재구성하는 것을 목표로 한다. 본 연구에서는 사전 학습된 시각-언어 모델(VLM)이 특수 목적의 2D 또는 3D 기반 모델, 미분 가능 렌더링, 다중 시점 감독 없이도 단일 이미지로부터 직접 편집 가능한 블렌더 프로그램으로 장면을 재구성함으로써 실행 가능한 역그래픽스를 수행할 수 있는지 조사한다. 우리는 단일 이미지로부터 3D 장면을 재구성하기 위해 기하학, 재질, 구성, 조명을 포함한 장면 요소를 실행 가능한 블렌더 코드 공간에서 점진적으로 정제하는 에이전트 기반 프레임워크인 단계적 실행 가능 역그래픽스(SEIG)를 소개한다. 우리는 다양한 장면에 걸쳐 픽셀 수준, 지각적, 의미론적 충실도를 포괄하는 여러 재구성 지표를 사용하여 프레임워크를 평가한다. 실험 결과, 단계적 재구성이 재구성 충실도를 크게 향상시켜, 범용 VLM을 사용한 실행 가능 역그래픽스에서 작업 분해의 중요성을 강조한다. 마지막으로, 재구성된 편집 가능한 블렌더 장면을 통해 가능해진 다양한 다운스트림 응용 사례를 제시한다.

English

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.