WorldAct: モノリシックな3Dワールドをインタラクティブ対応のオブジェクト中心シーンへと活性化する

要旨

近年、Marbleに代表される生成的シーン合成に基づく3D世界モデリングシステムは、一貫性のある探索可能な3D環境を生成できる。しかし、その出力は通常、編集可能性や物理的相互作用が限られた静的なモノリシックアセットであり、生成された世界を積極的に修正・操作する必要がある没入型コンテンツ制作や身体性シミュレーションでの利用が制限される。この課題に対処するため、本稿では静的に生成された3D世界を編集可能かつ操作可能なシーンに変換するフレームワークWorldActを提案する。WorldActはマルチモーダルエージェントを用いて、シーンの分解を誘導し、操作可能なオブジェクトを特定し、相互作用のための幾何学的に整合したオブジェクトレベルのメッシュを再構築し、3Dインペインティングによって残留背景を復元する。得られたシーンは、オブジェクトレベルの編集、衝突を考慮した操作、および身体性タスク実行を、シーン全体の一貫性を維持しながら可能にする。実験により、WorldActは元の生成シーンよりも豊かな相互作用シナリオを実現し、編集可能かつインタラクティブな3D世界モデルへの実用的な道筋を示すことが明らかになった。

English

Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.