场景工匠：面向仿真就绪室内场景的智能体化生成

摘要

仿真技术已成为大规模训练和评估家庭机器人的关键工具，然而现有环境无法体现真实室内空间的多样性与物理复杂性。当前场景合成方法生成的房间仅稀疏布置家具，缺乏机器人操作所需的关键要素：密集杂物、可活动家具及物理属性。我们推出SceneSmith——一种分层智能体框架，能够根据自然语言提示生成可直接用于仿真的室内环境。该框架通过建筑布局、家具摆放到小物件填充的递进式构建流程，每个阶段均由设计师、评审员与协调器三类视觉语言模型智能体协同实现。该框架深度融合了静态物体的文生3D合成、可活动物体的数据集检索以及物理属性估算技术。SceneSmith生成的物体数量达到现有方法的3-6倍，物体间碰撞率低于2%，且在物理仿真中保持96%的物体稳定性。在205名参与者的用户研究中，其场景真实度与提示契合度的胜率分别达到92%和91%，显著优于基线方法。我们进一步验证了这些环境可用于端到端的机器人策略自动评估流程。

English

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stagesx2013from architectural layout to furniture placement to small object populationx2013each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

场景工匠：面向仿真就绪室内场景的智能体化生成

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

摘要

Support