GALA: 単一スキャンからアニメーション可能な階層化アセットを生成

要旨

本論文では、単層の衣服付き3D人体メッシュを入力として、完全な多層3Dアセットに分解するフレームワーク「GALA」を提案します。この出力は他のアセットと組み合わせることで、任意のポーズを持つ新しい衣服付き人間アバターを作成することができます。既存の再構成手法では、衣服付き人間を単層のジオメトリとして扱い、髪型、衣服、アクセサリーを含む人間の本質的な構成性を見落とすことが多く、そのためメッシュの下流アプリケーションにおける有用性が制限されています。単層メッシュを個別の層に分解することは、深刻に遮蔽された領域の妥当なジオメトリとテクスチャを合成する必要があるため、困難な課題です。さらに、分解が成功した場合でも、メッシュはポーズや体型に関して正規化されておらず、新しいアイデンティティやポーズとの一貫した合成に失敗します。これらの課題に対処するため、我々は事前学習済みの2D拡散モデルの一般的な知識を、人間や他のアセットのジオメトリと外観の事前情報として活用することを提案します。まず、多視点2Dセグメンテーションから抽出した3D表面セグメンテーションを使用して入力メッシュを分離します。次に、新しいポーズ誘導型スコア蒸留サンプリング（SDS）損失を使用して、ポーズ空間と正規空間の両方で異なる層の欠損ジオメトリを合成します。高忠実度の3Dジオメトリのインペインティングが完了したら、同じSDS損失をテクスチャにも適用して、最初に遮蔽されていた領域を含む完全な外観を取得します。一連の分解ステップを通じて、ポーズと人体形状に関して正規化された共有正規空間内で複数の層の3Dアセットを取得し、新しいアイデンティティとの合成や新しいポーズでの再アニメーションを容易にサポートします。実験により、既存のソリューションと比較して、分解、正規化、合成タスクにおける我々のアプローチの有効性が実証されました。

English

We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.

GALA: 単一スキャンからアニメーション可能な階層化アセットを生成

GALA: Generating Animatable Layered Assets from a Single Scan

要旨

Support