WorldAct: 모놀리식 3D 세계를 상호작용 가능한 객체 중심 장면으로 활성화

초록

최근 Marble과 같은 생성적 장면 합성 기반의 3D 세계 모델링 시스템은 일관성 있고 탐색 가능한 3D 환경을 생성할 수 있지만, 그 출력물은 일반적으로 편집 가능성과 물리적 상호작용이 제한된 정적인 모놀리식 자산(monolithic assets)이다. 이는 생성된 세계를 능동적으로 수정하고 조작해야 하는 몰입형 콘텐츠 제작 및 구현 시뮬레이션(embodied simulation)에서의 활용을 제한한다. 이러한 과제를 해결하기 위해, 우리는 정적으로 생성된 3D 세계를 편집 및 상호작용이 가능한 장면으로 변환하는 프레임워크인 WorldAct를 제시한다. WorldAct는 멀티모달 에이전트를 사용하여 장면 분해를 안내하고, 조작 가능한 객체를 식별하며, 상호작용을 위해 기하학적으로 정렬된 객체 수준의 메시를 재구성하고, 3D 인페인팅을 통해 잔여 배경을 복원한다. 결과 장면은 객체 수준 편집, 충돌 인식 조작, 구현 작업 수행(embodied task execution)을 지원하면서 전역 장면의 일관성을 유지한다. 실험 결과는 WorldAct가 원래 생성된 장면보다 더 풍부한 상호작용 시나리오를 가능하게 함을 보여주며, 이는 편집 가능하고 상호작용적인 3D 세계 모델을 향한 실용적인 경로를 시사한다.

English

Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.