ArtLLM：基于3D大语言模型的铰接式资产生成

摘要

为游戏、机器人及仿真创建交互式数字环境，需要依赖具有功能性的铰接式三维物体——其功能特性源于部件几何形态与运动学结构的结合。然而现有方法存在根本性局限：基于优化的重建方法需对每个物体进行耗时的关节拟合，且通常仅能处理简单的单关节物体；而基于检索的方法则从固定部件库中组装零件，导致几何重复度高且泛化能力弱。为应对这些挑战，我们提出ArtLLM创新框架，可直接从完整三维网格生成高质量铰接式资源。该框架核心是基于大规模铰接数据集训练的三维多模态大语言模型，该数据集融合了现有铰接数据集与程序化生成物体。与先前研究不同，ArtLLM能以自回归方式预测可变数量的部件与关节，通过物体点云统一推断其运动学结构。这种铰接感知的布局随后作为条件输入三维生成模型，合成高保真度的部件几何形态。在PartNet-Mobility数据集上的实验表明，ArtLLM在部件布局精度与关节预测方面显著优于现有最优方法，同时对现实物体展现出强大泛化能力。最后，我们通过数字孪生构建验证其应用价值，彰显其在可扩展机器人学习领域的潜力。

English

Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.