在Blender中思考：基于视觉-语言模型的分阶段可执行逆向图形构念方法

摘要

逆图形学是一个长期存在且高度欠约束的问题，旨在将图像重建为可渲染、重新照明和操控的可编辑3D场景。在本研究中，我们探究了预训练的视觉-语言模型（VLMs）是否能够直接从单张图像执行可执行的逆图形学任务，通过将场景重建为可编辑的Blender程序，而无需依赖专门的2D或3D基础模型、可微分渲染或多视角监督。我们提出了阶段性可执行逆图形学（SEIG）框架，这是一种基于智能体的方法，能够通过逐步优化场景要素（包括几何形状、材质、构图和光照），直接在可执行的Blender代码空间中将单张图像重建为3D场景。我们使用涵盖像素级、感知级和语义保真度的一系列重建指标，在多种场景下评估了该框架。实验表明，阶段性重建显著提升了重建保真度，突出了任务分解对于使用通用VLMs进行可执行逆图形学的重要性。最后，我们展示了由重建的可编辑Blender场景所支持的各种下游应用。

English

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.