阿拉丁：從抽象場景描述中零樣式幻覺生成立體化資產

摘要

一個特定場景的「氛圍」由什麼構成？在「繁忙骯髒的城市街道」、「田園牧歌風光」或「廢棄客廳的犯罪現場」中應該找到什麼？現有系統在訓練時僅使用僵化且有限的室內數據集，無法以一般性方式將抽象場景描述轉換為風格化場景元素。在本文中，我們提出利用基礎模型捕捉的知識來完成這種轉換。我們提出了一個系統，可以作為一個工具，根據簡短短語描述的3D場景生成風格化資產，而無需列舉場景中應該找到的物體或給出它們的外觀指示。此外，它對於開放世界概念具有韌性，這是傳統方法所不具備的，為3D藝術家提供更多創意自由。我們的系統展示了這一點，使用由大型語言模型、視覺語言模型和多個圖像擴散模型組成的基礎模型「團隊」，它們使用可解釋且可由用戶編輯的中間表示進行通信，從而為3D藝術家提供更多靈活和可控的風格化資產生成。我們為這一任務引入了新的指標，並通過人類評估顯示，在91%的情況下，我們的系統輸出被認為比基準更忠實於輸入場景描述的語義，從而突顯了這種方法加速3D藝術家進行3D內容創作過程的潛力。

English

What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91% of the cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.

阿拉丁：從抽象場景描述中零樣式幻覺生成立體化資產

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

摘要

Support