ChatPaper.aiChatPaper

场景匠:面向仿真的室内场景智能生成系统

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

February 9, 2026
作者: Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake
cs.AI

摘要

仿真技术已成为大规模训练和评估家庭机器人的关键工具,但现有环境难以体现真实室内空间的多样性与物理复杂性。当前场景生成方法仅能创建缺乏密集杂物、活动家具及机器人操作必需物理属性的稀疏布置空间。我们推出SceneSmith——一种分层智能体框架,能够根据自然语言提示生成可直接用于仿真的室内环境。该框架通过建筑布局、家具摆放到小物件填充的递进式构建流程,每个阶段均由设计师、评审员与协调器三类视觉语言模型智能体交互实现。该框架深度融合了静态物体的文生3D资产生成、活动家具的数据集检索以及物理属性估算技术。SceneSmith生成的对象数量达到现有方法的3-6倍,物体间碰撞率低于2%,且在物理仿真中96%的物体保持稳定。针对205名参与者的用户研究表明,其场景真实度与提示契合度的胜率分别达到92%和91%,显著优于基线方法。我们进一步验证了该环境可用于端到端的机器人策略自动评估流程。
English
Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stagesx2013from architectural layout to furniture placement to small object populationx2013each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
PDF01February 12, 2026