SIMART: MLLM을 활용한 단일 메시를 시뮬레이션 준비된 관절형 에셋으로 분해

초록

고품질 관절형 3D 에셋은 구현형 AI 및 물리 시뮬레이션에 필수적이지만, 현재 3D 생성 기술은 정적 메쉬에 집중되어 '시뮬레이션 준비'된 상호작용 가능 객체 분야는 공백으로 남아 있습니다. 최근 대부분의 관절형 객체 생성 방법은 분리된 모듈 간 오류가 누적되는 다단계 파이프라인에 의존합니다. 이에 대한 대안으로 통합 MLLM(Multimodal Large Language Model)은 정적 에셋 이해와 시뮬레이션 준비 에셋 생성을 결합한 단일 단계 접근법을 제공합니다. 그러나 조밀한 복셀 기반 3D 토큰화는 긴 3D 토큰 시퀀스와 높은 메모리 오버헤드를 초래하여 복잡한 관절형 객체로의 확장성을 제한합니다. 이를 해결하기 위해 우리는 부품 수준 분해와 운동학적 예측을 결합하여 수행하는 통합 MLLM 프레임워크인 SIMART를 제안합니다. SIMART는 Sparse 3D VQ-VAE를 도입하여 조밀한 복셀 토큰 대비 토큰 수를 70% 줄여 고해상도 다중 부품 조립을 가능하게 합니다. SIMART는 PartNet-Mobility 및 실제 AIGC 데이터셋에서 최첨단 성능을 달성하고 물리 기반 로봇 시뮬레이션을 가능하게 합니다.

English

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

SIMART: MLLM을 활용한 단일 메시를 시뮬레이션 준비된 관절형 에셋으로 분해

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

초록

Support