Hunyuan3D-Omni：面向三维资产可控生成的统一框架

摘要

近期，原生3D生成模型的发展加速了游戏、影视及设计领域的资产创作。然而，多数方法仍主要依赖图像或文本作为条件输入，缺乏细粒度、跨模态的控制能力，这限制了其可控性与实际应用。为填补这一空白，我们推出了Hunyuan3D-Omni，一个基于Hunyuan3D 2.1构建的统一框架，旨在实现细粒度、可控的3D资产生成。除图像外，Hunyuan3D-Omni还接受点云、体素、边界框及骨骼姿态先验作为条件信号，从而实现对几何、拓扑及姿态的精确控制。不同于为每种模态单独设置处理模块，我们的模型将所有信号统一整合于一个跨模态架构中。我们采用了一种渐进式、难度感知的采样策略进行训练，该策略为每个样本选取一种控制模态，并倾向于采样难度较高的信号（如骨骼姿态），同时降低对较易信号（如点云）的权重，以此促进稳健的多模态融合及对缺失输入的优雅处理。实验表明，这些额外的控制手段提升了生成精度，实现了几何感知的变换，并增强了生产流程的鲁棒性。

English

Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

Hunyuan3D-Omni：面向三维资产可控生成的统一框架

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

摘要

Support