CLAY: 高品質3Dアセット生成のための制御可能な大規模生成モデル

要旨

デジタルクリエイティブの領域において、私たちが想像力から複雑な3D世界を創造する可能性は、既存のデジタルツールの制約によってしばしば妨げられています。これらのツールは、膨大な専門知識と労力を要求します。このギャップを埋めるために、私たちはCLAYを紹介します。CLAYは、人間の想像力を簡単に複雑な3Dデジタル構造に変換するために設計された3Dジオメトリとマテリアルジェネレータです。CLAYは、従来のテキストや画像入力に加えて、多様なプリミティブ（マルチビュー画像、ボクセル、バウンディングボックス、ポイントクラウド、暗黙的表現など）からの3D対応コントロールをサポートします。その中核には、多解像度変分オートエンコーダ（VAE）とミニマルな潜在拡散トランスフォーマー（DiT）で構成される大規模生成モデルがあり、多様な3Dジオメトリから直接豊富な3D事前情報を抽出します。具体的には、ニューラルフィールドを採用して連続的で完全な表面を表現し、潜在空間内で純粋なトランスフォーマーブロックを使用するジオメトリ生成モジュールを使用します。私たちは、慎重に設計された処理パイプラインを通じて取得された超大型3DモデルデータセットでCLAYをトレーニングするための段階的なトレーニングスキームを提示し、15億パラメータを持つ3Dネイティブのジオメトリジェネレータを実現しました。外観生成に関して、CLAYは、拡散、粗さ、金属性のモダリティを持つ2K解像度のテクスチャを生成できるマルチビューマテリアル拡散モデルを採用して、物理ベースレンダリング（PBR）テクスチャを生成することを目指しています。私たちは、スケッチのようなコンセプトデザインから、複雑なディテールを持つプロダクションレディアセットまで、幅広い制御可能な3Dアセット作成にCLAYを使用することを実証します。初めてのユーザーでも、CLAYを簡単に使用して、鮮やかな3Dの想像力を現実に変え、無限の創造力を解き放つことができます。

English

In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.

CLAY: 高品質3Dアセット生成のための制御可能な大規模生成モデル

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

要旨

Support