P3D-Bench：針對參數化三維生成與結構推理的多模態大型語言模型基準測試

摘要

多模態大型語言模型能編寫程式碼以產生複雜的程式，也能利用程式進行3D建模，這為借助它們的先驗知識、世界知識與推理能力來進行3D生成開闢了新途徑。然而，現有的基準評測極少透過程式碼來評估3D建模。此類建模對程式碼的要求不僅在於可執行性：從文字或視覺規格出發，模型必須生成一個參數化3D程式，該程式在幾何上精確、語義上對齊且組合件一致。我們提出P3D-Bench，一個專為參數化3D生成設計的基準評測。不同於3D網格，參數化3D程式會揭露明確的尺寸、建構操作與零件關係，從而揭示模型是否還原了設計的結構，而不僅是其外觀。在統一的協議下，P3D-Bench涵蓋三大任務系列（文字轉3D、圖像轉3D及組合件3D），並針對每個輸出評估其可執行性、幾何保真度、拓撲結構、文字約束、多視角語義對齊以及零件層級結構。我們在400個文字案例、400個圖像案例及203個註釋組合件上，評測了前沿多模態大型語言模型與僅文字大型語言模型，並以領域特定模型作為參考點。廣泛的評估得出三項發現。首先，組合件是難度最高的設定，模型仍無法將多個零件組合成連貫的結構。其次，模型通常能還原目標物體的整體形狀與語義識別，但未能重現輸入所指定的精確參數化幾何。第三，在組合件上的零件層級建模仍顯薄弱，模型既無法還原每個零件的幾何，也無法確定正確的零件數量。這些結果將P3D-Bench定位為在參數化3D生成中評估精確參數化幾何與零件層級結構的基準評測。

English

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.