3DCodeBench：通过代码对智能体程序化3D建模进行基准测试

摘要

通过代码进行程序化3D建模正成为一种多功能的范式，能够提供确定性、引擎就绪且可精确编辑的资产，而这些是神经3D生成器所固有的缺失。然而，编写此类程序化内容需要对3D软件API、参数化设计以及代码级几何推理有深入的专业知识。本文提出3DCodeBench，这是一个系统化的基准测试，用于评估视觉语言模型（VLM）智能体在3D建模软件中执行程序化3D生成的能力。具体而言，3DCodeBench通过将文本和图像参考转换为3D建模软件的程序化代码，评估12种先进VLM作为程序化3D建模器的有效性。考虑到自动化指标可能无法完全捕捉3D形状的感知质量，我们构建了3DCodeArena，一个基于成对人工偏好对生成的3D输出进行排名的平台。通过广泛的评估和结果，我们观察到：（1）失败主要源于API不匹配，而成功渲染的模型仍存在3D几何组件断开或浮动的缺陷。（2）测试时扩展（如更高的思考预算和多轮优化）总体上提升了性能。我们的发现凸显了对高质量程序化编码数据的迫切需求，以推动商业VLM的进步。此外，有效的程序化3D建模需要一个稳健的执行环境，为迭代优化提供高保真反馈。我们发布了3DCodeBench，包括精心策划的大规模多模态（文本/图像）提示数据集、程序化代码、3D对象三元组、评估协议，以及公共3DCodeArena平台，作为探索基于VLM的程序化3D建模器的基础工具包。

English

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.