3DCodeBench: 코드를 통한 에이전트 기반 절차적 3D 모델링 벤치마킹

초록

코드를 통한 절차적 3D 모델링은 결정론적이고 엔진에 바로 적용 가능하며 정밀하게 편집 가능한 자산을 제공하는 다재다능한 패러다임으로 부상하고 있으며, 이는 신경망 기반 3D 생성기가 본질적으로 결여한 특성이다. 그러나 이러한 절차적 콘텐츠를 제작하려면 3D 소프트웨어 API, 파라메트릭 디자인, 코드 수준의 기하학적 추론에 대한 깊은 전문성이 요구된다. 본 논문에서는 3D 모델링 소프트웨어에서 절차적 3D 생성을 위한 비전-언어 모델(VLM) 에이전트를 평가하는 체계적인 벤치마크인 3DCodeBench를 제안한다. 구체적으로, 3DCodeBench는 텍스트 및 이미지 참조를 3D 모델링 소프트웨어용 절차적 코드로 변환함으로써 12개의 고급 VLM이 절차적 3D 모델러로서 얼마나 효과적으로 기능할 수 있는지 평가한다. 자동화된 지표가 3D 형상의 지각적 품질을 완전히 포착하지 못할 수 있음을 인식하여, 생성된 3D 출력에 대한 쌍대 인간 선호도 기반 순위 플랫폼인 3DCodeArena를 구축한다. 광범위한 평가와 결과를 통해 다음과 같은 관찰을 얻었다: (1) 실패는 주로 API 불일치에서 발생하며, 성공적인 렌더링의 경우에도 분리되거나 떠 있는 3D 기하학적 구성 요소가 여전히 문제가 된다. (2) 더 높은 사고 예산 및 다중 턴 개선과 같은 테스트 시간 스케일링은 전반적으로 성능을 향상시킨다. 이러한 발견은 상용 VLM을 발전시키기 위해 고품질의 절차적 코딩 데이터가 절실히 필요함을 강조한다. 또한, 효과적인 절차적 3D 모델링을 위해서는 반복적 개선을 위한 고충실도 피드백을 제공하는 강건한 실행 환경이 필요하다. 우리는 선별된 대규모 멀티모달(텍스트/이미지) 프롬프트 데이터셋, 절차적 코드, 3D 객체 삼중항, 평가 프로토콜, 공개 3DCodeArena 플랫폼을 포함한 3DCodeBench를 VLM 기반 절차적 3D 모델러 탐색을 위한 기초 도구 키트로서 공개한다.

English

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.