P3D-Bench: パラメトリック3D生成と構造的推論におけるMLLMのベンチマーク

要旨

マルチモーダル大規模言語モデルは、複雑なプログラムを生成するコードを記述できるだけでなく、プログラムを利用して3Dモデリングを行うことも可能であり、それらの事前知識、世界知識、推論能力に基づく3D生成の新たな道を切り開いている。しかし、既存のベンチマークではコードによる3Dモデリングを評価することはほとんどない。このようなモデリングには、実行可能なコード以上のものが必要である。テキストまたは視覚的な仕様から、モデルは幾何学的に正確で、意味的に整合し、組み立て一貫性のあるパラメトリック3Dプログラムを生成しなければならない。本稿では、パラメトリック3D生成のためのベンチマークであるP3D-Benchを紹介する。3Dメッシュとは異なり、パラメトリック3Dプログラムは明示的な寸法、構築操作、部品関係を露出し、モデルが対象の外観だけでなく設計構造を復元できるかどうかを明らかにする。統一プロトコルの下で、P3D-Benchは三つのタスクファミリー（テキスト→3D、画像→3D、組み立て→3D）をカバーし、各出力に対して実行可能性、幾何学的忠実性、トポロジー、テキストに基づく制約、多視点意味的整合性、部品レベルの構造を評価する。我々は、最先端のMLLMとテキスト専用LLMを、400件のテキストケース、400件の画像ケース、203件の注釈付き組み立てに対して評価し、ドメイン特化モデルを基準点として用いた。広範な評価から三つの知見が得られた。第一に、組み立て設定が最も難しく、モデルは複数の部品を一貫性のある構造に合成することが依然として困難である。第二に、モデルは対象物体の大局的な形状と意味的同一性を復元できることが多いが、入力で指定された正確なパラメトリック幾何形状を再現することはできない。第三に、部品レベルのモデリングは組み立てにおいて依然として弱く、モデルは各部品の幾何形状も適切な部品数も復元できない。これらの結果は、P3D-Benchをパラメトリック3D生成における正確なパラメトリック幾何形状と部品レベルの構造を評価するためのベンチマークとして位置づけるものである。

English

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.