Aladdin: 抽象的なシーン記述からのスタイライズされた3Dアセットのゼロショットハルシネーション

要旨

特定のシーンの「雰囲気」とは何で構成されるのか？「賑やかで汚れた都会の通り」、「牧歌的な田園地帯」、あるいは「廃墟となったリビングルームの犯罪現場」には何が見つかるべきか？抽象的なシーン記述からスタイライズされたシーン要素への変換は、既存のシステムでは、硬直的で限定的な屋内データセットで訓練されたものでは一般的に行うことができません。本論文では、この変換を達成するためにファウンデーションモデルが持つ知識を活用することを提案します。短いフレーズで記述された3Dシーンのためのスタイライズされたアセットを生成するツールとして機能するシステムを紹介します。このシステムは、シーン内に見つかるべきオブジェクトを列挙したり、それらの外観に関する指示を与える必要がありません。さらに、従来の限定的なデータで訓練された方法ではできないオープンワールドの概念に対して頑健であり、3Dアーティストにより多くの創造的自由を提供します。私たちのシステムは、大規模言語モデル、視覚言語モデル、および複数の画像拡散モデルで構成されるファウンデーションモデル「チーム」を使用してこれを実証します。これらのモデルは、解釈可能でユーザー編集可能な中間表現を使用して通信し、3Dアーティストのためのより多様で制御可能なスタイライズされたアセット生成を可能にします。このタスクのための新しいメトリクスを導入し、人間による評価を通じて、91％のケースで私たちのシステムの出力が入力シーン記述の意味により忠実であると判断されることを示し、このアプローチが3Dコンテンツ作成プロセスを劇的に加速する可能性を強調します。

English

What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91% of the cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.

Aladdin: 抽象的なシーン記述からのスタイライズされた3Dアセットのゼロショットハルシネーション

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

要旨

Support