阿拉丁：从抽象场景描述中零样式幻觉3D资产

摘要

一个特定场景的“氛围”包括什么？在“繁忙肮脏的城市街道”、“宁静田园风光”或“废弃客厅内的犯罪现场”中应该包含什么？现有系统在训练时基于严格和有限的室内数据集，无法以通用方式将抽象场景描述转化为风格化场景元素。在本文中，我们提出利用基础模型捕获的知识来完成这种转换。我们提出了一个系统，可以作为一种工具来生成由简短短语描述的3D场景的风格化资产，而无需列举场景中的对象或给出其外观的指令。此外，它对开放世界概念具有鲁棒性，这是传统方法所不具备的，为3D艺术家提供更多创造自由。我们的系统通过由大型语言模型、视觉-语言模型和几个图像扩散模型组成的基础模型“团队”来展示这一点，它们使用可解释和可用户编辑的中间表示进行通信，从而为3D艺术家提供更多样化和可控的风格化资产生成。我们为这一任务引入了新颖的度量标准，并通过人类评估显示，在91%的情况下，我们的系统输出被认为比基准更忠实于输入场景描述的语义，从而突显了这种方法加速3D艺术家的3D内容创作过程的潜力。

English

What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91% of the cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.

阿拉丁：从抽象场景描述中零样式幻觉3D资产

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

摘要

Support