シングルイメージ3D人体デジタル化における形状誘導拡散法

要旨

単一の入力画像から、一貫性のある高解像度の外観を持つ人物の360度ビューを生成する手法を提案する。NeRFやその派生手法は通常、異なる視点からの動画や画像を必要とする。単眼入力を扱う既存手法の多くは、教師データとしての3Dスキャンに依存するか、3D一貫性を欠いている。最近の3D生成モデルは3D一貫性のある人物のデジタル化の可能性を示しているが、これらの手法は多様な服装の外観にうまく一般化せず、結果としてフォトリアルさに欠ける。既存研究とは異なり、我々は一般的な画像合成タスクで事前学習された高容量の2D拡散モデルを、衣服を着た人物の外観事前分布として活用する。入力画像の人物の同一性を保ちつつ、より良い3D一貫性を達成するために、シルエットと表面法線を条件とした形状誘導拡散を用いて、欠損領域を補完しながら複数の視点を段階的に合成する。その後、これらの合成されたマルチビュー画像を逆レンダリングによって融合し、与えられた人物の完全なテクスチャ付き高解像度3Dメッシュを取得する。実験結果から、本手法が従来手法を上回り、単一画像から複雑なテクスチャを持つ多様な衣服を着た人物のフォトリアルな360度合成を実現することが示された。

English

We present an approach to generate a 360-degree view of a person with a consistent, high-resolution appearance from a single input image. NeRF and its variants typically require videos or images from different viewpoints. Most existing approaches taking monocular input either rely on ground-truth 3D scans for supervision or lack 3D consistency. While recent 3D generative models show promise of 3D consistent human digitization, these approaches do not generalize well to diverse clothing appearances, and the results lack photorealism. Unlike existing work, we utilize high-capacity 2D diffusion models pretrained for general image synthesis tasks as an appearance prior of clothed humans. To achieve better 3D consistency while retaining the input identity, we progressively synthesize multiple views of the human in the input image by inpainting missing regions with shape-guided diffusion conditioned on silhouette and surface normal. We then fuse these synthesized multi-view images via inverse rendering to obtain a fully textured high-resolution 3D mesh of the given person. Experiments show that our approach outperforms prior methods and achieves photorealistic 360-degree synthesis of a wide range of clothed humans with complex textures from a single image.

シングルイメージ3D人体デジタル化における形状誘導拡散法

Single-Image 3D Human Digitization with Shape-Guided Diffusion

要旨

Support