在 Blender 中思考：結合視覺語言模型的分階段可執行逆向圖形學

摘要

逆向圖形學是一個長期存在且高度約束不足的問題，其目標是將圖像重建為可編輯的3D場景，使其能夠進行渲染、重新打光及操作。在本研究中，我們探討預訓練的視覺語言模型（VLMs）是否能直接從單張影像執行可操作的逆向圖形學，透過將場景重建為可編輯的Blender程式，而無需依賴專門的2D或3D基礎模型、可微分渲染或多視角監督。我們提出階段式可執行逆向圖形學（SEIG），這是一個自主框架，能從單張影像逐步精煉場景因子（包括幾何、材質、構圖與打光），直接在可執行的Blender程式碼空間中重建3D場景。我們透過一系列涵蓋像素層級、感知層級與語義層級的重建指標，在多樣場景中評估此框架。實驗結果顯示，階段式重建能顯著提升重建保真度，凸顯任務分解對使用通用視覺語言模型進行可執行逆向圖形學的重要性。最後，我們展示重建可編輯Blender場景所啟發的多種下游應用。

English

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.