WorldAct: 将整体式三维世界激活为交互就绪的以对象为中心的场景

摘要

最新基于生成式场景合成的3D世界建模系统（如Marble）能够生成连贯且可探索的三维环境，但其输出通常为静态单体资产，编辑性与物理交互能力有限。这限制了它们在沉浸式内容创作与具身仿真中的应用——此类场景要求生成的世界能够被主动修改和操控。为应对这一挑战，我们提出WorldAct框架，将静态生成的3D世界转化为可编辑且支持交互的场景。WorldAct利用多模态智能体引导场景分解、识别可交互物体、重建几何对齐的物体级网格以支持交互操作，并通过3D修补恢复残留背景。生成的场景支持物体级编辑、碰撞感知操控以及具身任务执行，同时保持全局场景一致性。实验表明，相较于原始生成场景，WorldAct能够实现更丰富的交互场景，为迈向可编辑与可交互的3D世界模型提供了可行路径。

English

Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.