SIMART:基於大型多模態語言模型將整體網格分解為模擬就緒的關節化資源
SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM
March 24, 2026
作者: Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang
cs.AI
摘要
高品質的關節化3D資產對於具身人工智慧與物理模擬至關重要,然而現有3D生成技術仍聚焦於靜態網格,導致「即時模擬就緒」的互動式物件存在技術缺口。當前多數關節化物件生成方法依賴多階段流程,各解耦模組間的誤差會逐級累積。與此相對,統一多模態大語言模型提供了一條單階段路徑,能同步實現靜態資產理解與模擬就緒資產生成。但基於稠密體素的3D標記化會產生過長的3D標記序列與高記憶體開銷,限制其對複雜關節化物件的擴展性。為此,我們提出SIMART——一個統一的多模態大語言模型框架,可同步執行部件級分解與運動學預測。通過引入稀疏3D VQ-VAE,SIMART相較稠密體素標記將標記數量減少70%,從而實現高保真度的多部件組裝。該框架在PartNet-Mobility數據集與真實世界AIGC數據集上達到最先進性能,並能驅動基於物理的機器人模擬。
English
High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.