Hunyuan3D-Omni: 3Dアセットの制御可能な生成のための統合フレームワーク

要旨

近年の3Dネイティブ生成モデルの進展により、ゲーム、映画、デザイン分野におけるアセット作成が加速しています。しかし、ほとんどの手法は依然として主に画像やテキストの条件付けに依存しており、細粒度のクロスモーダル制御が欠如しているため、制御性と実用性が制限されています。この課題を解決するため、我々はHunyuan3D 2.1を基盤とした、細粒度で制御可能な3Dアセット生成のための統一フレームワーク「Hunyuan3D-Omni」を提案します。Hunyuan3D-Omniは画像に加えて、点群、ボクセル、バウンディングボックス、骨格ポーズの事前情報を条件付け信号として受け入れ、形状、トポロジー、ポーズに対する精密な制御を可能にします。各モダリティごとに別々のヘッドを用意する代わりに、我々のモデルはすべての信号を単一のクロスモーダルアーキテクチャに統合しています。学習には、難易度を考慮した段階的サンプリング戦略を採用し、各サンプルに対して1つの制御モダリティを選択し、より難しい信号（例：骨格ポーズ）にサンプリングを偏らせながら、より簡単な信号（例：点群）の重みを下げることで、堅牢なマルチモーダル融合と入力欠損に対する優雅な処理を促進します。実験結果から、これらの追加制御により生成精度が向上し、形状を意識した変換が可能になり、プロダクションワークフローにおける堅牢性が高まることが示されています。

English

Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

Hunyuan3D-Omni: 3Dアセットの制御可能な生成のための統合フレームワーク

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

要旨

Support