SyncHuman：面向单视角人体重建的2D与3D生成模型同步技术

摘要

基于单张图像进行照片级真实感的三维全身人体重建，在影视和游戏应用中至关重要却极具挑战性，这源于固有的模糊性和严重的自遮挡问题。现有方法虽能通过SMPL模型估计和基于SMPL的图像生成模型生成新视角图像，但存在SMPL网格三维先验估计不准、难以处理复杂人体姿态及重建精细细节的局限。本文提出SyncHuman创新框架，首次将二维多视角生成模型与三维原生生成模型相结合，即使在挑战性姿态下也能实现单视角图像的高质量着装人体网格重建。多视角生成模型擅长捕捉二维细节却难以保持结构一致性，而三维原生生成模型能生成结构一致但较为粗糙的三维形状。通过融合这两种方法的互补优势，我们构建了更高效的生成框架。具体而言，我们首先联合微调多视角生成模型与三维原生生成模型，并采用提出的像素对齐式二维-三维同步注意力机制，生成几何对齐的三维形状与二维多视角图像。为进一步提升细节表现，我们引入特征注入机制，将二维多视角图像的精细细节映射至对齐的三维形状上，实现精确的高保真重建。大量实验表明，SyncHuman即使对包含挑战性姿态的图像也能实现鲁棒且逼真的三维人体重建。在几何精度与视觉保真度方面，本方法均超越基线方法，为未来三维生成模型的发展指明了可行方向。

English

Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.