P3D-Bench: 매개변수 기반 3D 생성 및 구조적 추론을 위한 MLLM 벤치마킹

초록

멀티모달 대규모 언어 모델은 복잡한 프로그램을 생성하는 코드를 작성할 수 있을 뿐만 아니라, 프로그램을 활용하여 3D 모델링을 수행할 수 있으며, 이는 그들의 사전 지식, 세계 지식 및 추론 능력에 기반한 3D 생성의 새로운 경로를 열어줍니다. 그러나 기존 벤치마크는 코드를 통한 3D 모델링을 거의 평가하지 않습니다. 이러한 모델링은 단순히 실행 가능한 코드 이상을 요구합니다. 텍스트나 시각적 명세로부터 모델은 기하학적으로 정밀하고, 의미적으로 정합하며, 조립 일관성을 갖춘 매개변수형 3D 프로그램을 생성해야 합니다. 우리는 매개변수형 3D 생성을 위한 벤치마크인 P3D-Bench를 소개합니다. 3D 메시와 달리, 매개변수형 3D 프로그램은 명시적인 치수, 구성 연산 및 부품 관계를 드러내어, 모델이 디자인의 외형뿐만 아니라 구조를 복원하는지 보여줍니다. 통일된 프로토콜 하에, P3D-Bench는 세 가지 작업군(텍스트-3D, 이미지-3D, 조립-3D)을涵盖하며, 각 출력에 대해 실행 가능성, 기하학적 충실도, 위상, 텍스트 기반 제약 조건, 다중 시점 의미 정합성 및 부품 수준 구조를 평가합니다. 우리는 최첨단 MLLM과 텍스트 전용 LLM을 400개의 텍스트 사례, 400개의 이미지 사례, 203개의 주석이 달린 조립체에 대해 평가하였으며, 도메인 특화 모델을 참조점으로 사용했습니다. 광범위한 평가를 통해 세 가지 결과를 도출했습니다. 첫째, 조립체가 가장 어려운 설정으로, 모델이 여전히 여러 부품을 일관된 구조로 구성하는 데 실패합니다. 둘째, 모델은 종종 대상 객체의 전반적인 형태와 의미적 정체성을 복원할 수 있지만, 입력에 의해 지정된 정밀한 매개변수형 기하학을 재현하는 데는 실패합니다. 셋째, 부품 수준 모델링은 조립체에서 여전히 취약하여, 모델이 각 부품의 기하학적 구조나 적절한 부품 개수를 복원하지 못합니다. 이러한 결과는 P3D-Bench를 매개변수형 3D 생성에서 정밀한 매개변수형 기하학과 부품 수준 구조를 평가하기 위한 벤치마크로 자리매김하게 합니다.

English

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.