SIMART：基于多模态大语言模型将整体网格分解为仿真就绪的关节化资源

摘要

高质量可动三维资产对于具身AI与物理仿真至关重要，然而当前三维生成技术仍聚焦于静态网格，导致"仿真就绪"的交互式对象存在空白。现有的大多数可动物体创建方法依赖多阶段流水线，各解耦模块间的误差会不断累积。相比之下，统一的多模态大语言模型提供了单阶段实现路径，能同时完成静态资产理解与仿真就绪资产生成。但基于稠密体素的三维标记化方法会产生冗长的三维标记序列和高内存开销，限制了处理复杂可动物体的扩展性。为此，我们提出SIMART——一个统一的多模态大语言模型框架，可同步实现部件级分解与运动学预测。通过引入稀疏三维向量量化变分自编码器，SIMART较稠密体素标记将标记数量减少70%，从而支持高保真多部件装配。该框架在PartNet-Mobility数据集及野外AIGC数据集上达到最先进性能，并成功支撑了基于物理的机器人仿真应用。

English

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.