SIMART：基於大型多模態語言模型將整體網格分解為模擬就緒的關節化資源

摘要

高品質的關節化3D資產對於具身人工智慧與物理模擬至關重要，然而現有3D生成技術仍聚焦於靜態網格，導致「即時模擬就緒」的互動式物件存在技術缺口。當前多數關節化物件生成方法依賴多階段流程，各解耦模組間的誤差會逐級累積。與此相對，統一多模態大語言模型提供了一條單階段路徑，能同步實現靜態資產理解與模擬就緒資產生成。但基於稠密體素的3D標記化會產生過長的3D標記序列與高記憶體開銷，限制其對複雜關節化物件的擴展性。為此，我們提出SIMART——一個統一的多模態大語言模型框架，可同步執行部件級分解與運動學預測。通過引入稀疏3D VQ-VAE，SIMART相較稠密體素標記將標記數量減少70%，從而實現高保真度的多部件組裝。該框架在PartNet-Mobility數據集與真實世界AIGC數據集上達到最先進性能，並能驅動基於物理的機器人模擬。

English

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

SIMART：基於大型多模態語言模型將整體網格分解為模擬就緒的關節化資源

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

摘要

Support