SyncHuman:面向单视角人体重建的2D与3D生成模型同步技术
SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction
October 9, 2025
作者: Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, Yuan Liu
cs.AI
摘要
基於單一影像實現照片級真實感的三維人體全身重建,在影視和遊戲應用中是一項關鍵但極具挑戰性的任務,這主要源於固有的三維歧義性與嚴重的自遮擋問題。現有方法雖可通過SMPL模型估計與SMPL條件下的圖像生成模型來合成新視角,但其依賴從SMPL網格提取的粗糙三維先驗,難以處理複雜人體姿態並重建精細細節。本文提出SyncHuman框架,首次將二維多視角生成模型與三維原生生成模型相結合,即使在挑戰性姿態下也能實現從單視圖影像的高質量著衣人體網格重建。多視角生成模型擅長捕捉二維細節卻缺乏結構一致性,而三維原生生成模型能產生結構一致但粗糙的三維形狀。通過融合兩者的互補優勢,我們構建了更高效的生成框架。具體而言,我們首先聯合微調多視角生成模型與三維原生生成模型,並提出像素對齊的2D-3D同步注意力機制,以生成幾何對齊的三維形狀與二維多視角圖像。為進一步增強細節,我們引入特徵注入機制,將二維多視角圖像中的精細特徵映射至對齊的三維形狀,實現精確且高保真的重建。大量實驗表明,SyncHuman即使在包含挑戰性姿態的圖像上也能實現魯棒且逼真的三維人體重建。本方法在幾何精度與視覺保真度上均超越基準方法,為未來三維生成模型發展開辟了嶄新路徑。
English
Photorealistic 3D full-body human reconstruction from a single image is a
critical yet challenging task for applications in films and video games due to
inherent ambiguities and severe self-occlusions. While recent approaches
leverage SMPL estimation and SMPL-conditioned image generative models to
hallucinate novel views, they suffer from inaccurate 3D priors estimated from
SMPL meshes and have difficulty in handling difficult human poses and
reconstructing fine details. In this paper, we propose SyncHuman, a novel
framework that combines 2D multiview generative model and 3D native generative
model for the first time, enabling high-quality clothed human mesh
reconstruction from single-view images even under challenging human poses.
Multiview generative model excels at capturing fine 2D details but struggles
with structural consistency, whereas 3D native generative model generates
coarse yet structurally consistent 3D shapes. By integrating the complementary
strengths of these two approaches, we develop a more effective generation
framework. Specifically, we first jointly fine-tune the multiview generative
model and the 3D native generative model with proposed pixel-aligned 2D-3D
synchronization attention to produce geometrically aligned 3D shapes and 2D
multiview images. To further improve details, we introduce a feature injection
mechanism that lifts fine details from 2D multiview images onto the aligned 3D
shapes, enabling accurate and high-fidelity reconstruction. Extensive
experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D
human reconstruction, even for images with challenging poses. Our method
outperforms baseline methods in geometric accuracy and visual fidelity,
demonstrating a promising direction for future 3D generation models.