P3D-Bench: 多模态大语言模型参数化三维生成与结构推理的基准测试

摘要

多模态大语言模型能够编写代码生成复杂程序，并利用程序进行3D建模，这为基于其先验知识、世界模型和推理能力的3D生成开辟了新途径。然而现有基准测试极少通过代码来评估3D建模能力。此类建模的要求远超于生成可运行代码：模型需根据文本或视觉规范，生成在几何精度、语义对齐和装配一致性方面均满足要求的参数化3D程序。为此，我们提出P3D-Bench——一个用于参数化3D生成的基准测试。与3D网格不同，参数化3D程序可显式呈现尺寸参数、构建操作及零部件关系，从而揭示模型是否真正恢复设计结构而非仅重现外观。在统一协议框架下，P3D-Bench涵盖三大任务族（文本到3D、图像到3D和装配到3D），并从可执行性、几何保真度、拓扑结构、文本约束满足度、多视图语义对齐及零件级结构六个维度对输出进行评分。我们基于400个文本案例、400个图像案例和203个带注释装配体，评估了前沿多模态大语言模型与纯文本大语言模型的表现，并以领域专用模型作为参照基准。广泛评估得出三项发现：第一，装配任务最具挑战性，模型仍难以将多个部件组合成连贯结构；第二，模型通常能恢复目标物体的整体形状和语义特征，但无法复现输入所指定的精确参数化几何；第三，装配场景下零件级建模能力薄弱，模型既无法恢复每个零件的几何结构，也无法确定正确的零件数量。这些结果使P3D-Bench成为评估参数化3D生成中精确参数化几何和零件级结构的关键基准。

English

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.