SyncHuman:面向单视角人体重建的2D与3D生成模型同步技术
SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction
October 9, 2025
作者: Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, Yuan Liu
cs.AI
摘要
基于单张图像进行照片级真实感的三维全身人体重建,在影视和游戏应用中至关重要却极具挑战性,这源于固有的模糊性和严重的自遮挡问题。现有方法虽能通过SMPL模型估计和基于SMPL的图像生成模型生成新视角图像,但存在SMPL网格三维先验估计不准、难以处理复杂人体姿态及重建精细细节的局限。本文提出SyncHuman创新框架,首次将二维多视角生成模型与三维原生生成模型相结合,即使在挑战性姿态下也能实现单视角图像的高质量着装人体网格重建。多视角生成模型擅长捕捉二维细节却难以保持结构一致性,而三维原生生成模型能生成结构一致但较为粗糙的三维形状。通过融合这两种方法的互补优势,我们构建了更高效的生成框架。具体而言,我们首先联合微调多视角生成模型与三维原生生成模型,并采用提出的像素对齐式二维-三维同步注意力机制,生成几何对齐的三维形状与二维多视角图像。为进一步提升细节表现,我们引入特征注入机制,将二维多视角图像的精细细节映射至对齐的三维形状上,实现精确的高保真重建。大量实验表明,SyncHuman即使对包含挑战性姿态的图像也能实现鲁棒且逼真的三维人体重建。在几何精度与视觉保真度方面,本方法均超越基线方法,为未来三维生成模型的发展指明了可行方向。
English
Photorealistic 3D full-body human reconstruction from a single image is a
critical yet challenging task for applications in films and video games due to
inherent ambiguities and severe self-occlusions. While recent approaches
leverage SMPL estimation and SMPL-conditioned image generative models to
hallucinate novel views, they suffer from inaccurate 3D priors estimated from
SMPL meshes and have difficulty in handling difficult human poses and
reconstructing fine details. In this paper, we propose SyncHuman, a novel
framework that combines 2D multiview generative model and 3D native generative
model for the first time, enabling high-quality clothed human mesh
reconstruction from single-view images even under challenging human poses.
Multiview generative model excels at capturing fine 2D details but struggles
with structural consistency, whereas 3D native generative model generates
coarse yet structurally consistent 3D shapes. By integrating the complementary
strengths of these two approaches, we develop a more effective generation
framework. Specifically, we first jointly fine-tune the multiview generative
model and the 3D native generative model with proposed pixel-aligned 2D-3D
synchronization attention to produce geometrically aligned 3D shapes and 2D
multiview images. To further improve details, we introduce a feature injection
mechanism that lifts fine details from 2D multiview images onto the aligned 3D
shapes, enabling accurate and high-fidelity reconstruction. Extensive
experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D
human reconstruction, even for images with challenging poses. Our method
outperforms baseline methods in geometric accuracy and visual fidelity,
demonstrating a promising direction for future 3D generation models.