Blenderで思考する: 視覚言語モデルを用いた段階的実行可能逆グラフィックス

要旨

逆グラフィックスは、画像を編集可能な3Dシーンとして再構成し、レンダリング、再照明、操作を可能にする長年の課題であり、非常に制約の少ない問題である。本研究では、事前学習された視覚言語モデル（VLM）が、特殊な2Dまたは3D基盤モデル、微分可能レンダリング、多視点監視に依存することなく、単一画像から直接、編集可能なBlenderプログラムとしてシーンを再構成することで、実行可能な逆グラフィックスを遂行できるかどうかを調査する。我々は、Staged Executable Inverse Graphics（SEIG）を導入する。これは、エージェントベースのフレームワークであり、単一画像から3Dシーンを再構成するために、形状、マテリアル、構成、照明といったシーンの要素を、実行可能なBlenderコード空間内で段階的に精緻化する。我々は、ピクセルレベル、知覚的、意味的忠実度にわたる様々な再構成指標を用いて、多様なシーンで本フレームワークを評価する。実験結果から、段階的再構成が再構成の忠実度を大幅に向上させることが示され、汎用VLMによる実行可能逆グラフィックスにおけるタスク分解の重要性が明らかになった。最後に、再構成された編集可能なBlenderシーンによって可能となる様々な下流アプリケーションを紹介する。

English

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.