全息甲板:语言引导的3D具身体AI环境生成
Holodeck: Language Guided Generation of 3D Embodied AI Environments
December 14, 2023
作者: Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark
cs.AI
摘要
在具体化人工智能中,3D模拟环境发挥着关键作用,但它们的创建需要专业知识和大量手动工作,限制了它们的多样性和范围。为了缓解这一限制,我们提出了Holodeck,这是一个系统,可以完全自动地生成与用户提供的提示相匹配的3D环境。Holodeck能够生成各种场景,例如游戏厅、温泉和博物馆,调整设计风格,并能捕捉复杂查询的语义,如“一名带猫的研究员的公寓”和“是星战迷的教授的办公室”。Holodeck利用大型语言模型(GPT-4)对场景可能的外观有常识性知识,并使用Objaverse的大量3D资产来填充场景中的各种对象。为了解决正确放置对象的挑战,我们提示GPT-4生成对象之间的空间关系约束,然后优化布局以满足这些约束。我们的大规模人类评估显示,注释者更喜欢Holodeck而不是手动设计的程序化基线在住宅场景中,Holodeck可以为各种场景类型生成高质量的输出。我们还展示了Holodeck在具体化人工智能中的一个令人兴奋的应用,即训练代理在像音乐室和托儿所这样的新颖场景中导航,而无需人工构建的数据,这是在发展通用具体化代理方面的重要一步。
English
3D simulated environments play a critical role in Embodied AI, but their
creation requires expertise and extensive manual effort, restricting their
diversity and scope. To mitigate this limitation, we present Holodeck, a system
that generates 3D environments to match a user-supplied prompt fully
automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and
museums, adjust the designs for styles, and can capture the semantics of
complex queries such as "apartment for a researcher with a cat" and "office of
a professor who is a fan of Star Wars". Holodeck leverages a large language
model (GPT-4) for common sense knowledge about what the scene might look like
and uses a large collection of 3D assets from Objaverse to populate the scene
with diverse objects. To address the challenge of positioning objects
correctly, we prompt GPT-4 to generate spatial relational constraints between
objects and then optimize the layout to satisfy those constraints. Our
large-scale human evaluation shows that annotators prefer Holodeck over
manually designed procedural baselines in residential scenes and that Holodeck
can produce high-quality outputs for diverse scene types. We also demonstrate
an exciting application of Holodeck in Embodied AI, training agents to navigate
in novel scenes like music rooms and daycares without human-constructed data,
which is a significant step forward in developing general-purpose embodied
agents.