Hunyuan3D-Omni：可控3D資產生成的統一框架

摘要

近期，原生3D生成模型的進展加速了遊戲、電影和設計領域的資產創建。然而，大多數方法仍主要依賴於圖像或文本條件，缺乏細粒度的跨模態控制，這限制了可控性和實際應用。為解決這一問題，我們提出了Hunyuan3D-Omni，這是一個基於Hunyuan3D 2.1構建的統一框架，用於細粒度、可控的3D資產生成。除了圖像，Hunyuan3D-Omni還接受點雲、體素、邊界框和骨骼姿態先驗作為條件信號，從而實現對幾何、拓撲和姿態的精確控制。我們的模型並非為每種模態設置單獨的頭部，而是在單一的跨模態架構中統一所有信號。我們採用了一種漸進式、難度感知的採樣策略進行訓練，該策略為每個示例選擇一種控制模態，並偏向於更難的信號（如骨骼姿態），同時降低較易信號（如點雲）的權重，從而促進穩健的多模態融合和對缺失輸入的優雅處理。實驗表明，這些額外的控制提高了生成精度，實現了幾何感知的變換，並增強了生產工作流程的穩健性。

English

Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

Hunyuan3D-Omni：可控3D資產生成的統一框架

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

摘要

Support