Step1X-3D：高忠実度かつ制御可能なテクスチャ付き3Dアセット生成に向けて

要旨

生成人工知能はテキスト、画像、音声、ビデオの領域で大きな進歩を遂げてきたが、3D生成はデータの不足、アルゴリズムの制限、エコシステムの分断といった根本的な課題により、比較的未発展な状態にある。これに対処するため、我々はStep1X-3Dを提案する。これは以下の要素を通じてこれらの課題に取り組むオープンフレームワークである：(1) 500万以上のアセットを処理し、標準化された幾何学的およびテクスチャ特性を持つ200万の高品質データセットを作成する厳密なデータキュレーションパイプライン、(2) ハイブリッドVAE-DiTジオメトリ生成器と拡散ベースのテクスチャ合成モジュールを組み合わせた2段階の3Dネイティブアーキテクチャ、(3) モデル、トレーニングコード、および適応モジュールの完全なオープンソースリリース。ジオメトリ生成において、ハイブリッドVAE-DiTコンポーネントは、詳細を保持するための鋭いエッジサンプリングを伴うパーシバーベースの潜在符号化を用いてTSDF表現を生成する。拡散ベースのテクスチャ合成モジュールは、幾何学的条件付けと潜在空間の同期を通じて、ビュー間の一貫性を確保する。ベンチマーク結果は、既存のオープンソース手法を上回る最先端の性能を示し、プロプライエタリなソリューションと競争力のある品質を達成している。特に、このフレームワークは、2D制御技術（例：LoRA）を3D合成に直接転送することをサポートすることで、2Dと3D生成のパラダイムを独自に橋渡しする。データ品質、アルゴリズムの忠実度、再現性を同時に向上させることで、Step1X-3Dは制御可能な3Dアセット生成におけるオープンリサーチの新たな基準を確立することを目指している。

English

While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.

Step1X-3D：高忠実度かつ制御可能なテクスチャ付き3Dアセット生成に向けて

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

要旨

Support