WorldAct：將整體式3D世界激活為可互動的以物件為中心的場景

摘要

基於生成式場景合成的最新3D世界建模系統（例如Marble）能夠產生連貫且可探索的3D環境，但其輸出通常是靜態的單一資產，編輯能力與物理互動性有限。這限制了它們在沉浸式內容創作與具身模擬中的應用，因為在這些場景中，生成的虛擬世界必須能被主動修改與操作。為應對此挑戰，我們提出WorldAct框架，可將靜態生成的3D世界轉換為可編輯且具備互動條件的場景。WorldAct利用多模態代理引導場景分解、識別可操作物體、重建幾何對齊的物體級網格以支援互動，並透過3D修復還原殘留背景。產生的場景支援物體級編輯、碰撞感知操作以及具身任務執行，同時維持整體場景的連貫性。實驗結果顯示，WorldAct相較於原始生成場景能實現更豐富的互動場景，為邁向可編輯與互動的3D世界模型提供了一條實用路徑。

English

Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.