알라딘: 추상적 장면 설명에서 스타일화된 3D 자산의 제로샷 할루시네이션

초록

특정 장면의 "분위기"를 구성하는 요소는 무엇인가? "복잡하고 지저분한 도시 거리", "전원적인 시골 풍경", "버려진 거실의 범죄 현장"에는 어떤 것들이 있어야 할까? 추상적인 장면 묘사에서 스타일화된 장면 요소로의 변환은 엄격하고 제한된 실내 데이터셋으로 훈련된 기존 시스템들로는 일반적으로 수행할 수 없다. 본 논문에서는 이러한 변환을 달성하기 위해 파운데이션 모델들이 포착한 지식을 활용하는 방법을 제안한다. 우리는 짧은 구절로 묘사된 3D 장면을 위한 스타일화된 에셋을 생성할 수 있는 시스템을 제시하며, 이 시스템은 장면 내에 포함될 객체들을 열거하거나 그들의 외관에 대한 지시를 필요로 하지 않는다. 또한, 제한된 데이터로 훈련된 전통적인 방법들과 달리 개방형 세계 개념에 강건하여 3D 아티스트에게 더 많은 창의적 자유를 제공한다. 우리의 시스템은 이를 위해 대형 언어 모델, 시각-언어 모델, 그리고 여러 이미지 확산 모델로 구성된 파운데이션 모델 "팀"을 사용하며, 이 모델들은 해석 가능하고 사용자가 편집할 수 있는 중간 표현을 통해 소통함으로써 3D 아티스트들을 위해 더 다양하고 제어 가능한 스타일화된 에셋 생성을 가능하게 한다. 우리는 이 작업을 위한 새로운 메트릭들을 소개하고, 인간 평가를 통해 우리 시스템의 출력이 91%의 경우에서 입력 장면 설명의 의미에 더 충실하다고 판단되었음을 보여줌으로써, 이 접근 방식이 3D 콘텐츠 제작 프로세스를 획기적으로 가속화할 잠재력을 강조한다.

English

What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91% of the cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.

알라딘: 추상적 장면 설명에서 스타일화된 3D 자산의 제로샷 할루시네이션

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

초록

Support