ワンショット、ワントーク：単一画像からの全身話すアバター

要旨

リアルなアニメータブルなアバターを構築するには、複数の視点または単眼の自己回転ビデオを数分間必要とし、ほとんどの手法がジェスチャーや表情に対する正確な制御を欠いています。この限界を押し広げるために、私たちは単一の画像から全身の話すアバターを構築する課題に取り組んでいます。私たちは、2つの重要な課題、すなわち複雑なダイナミックモデリングと新しいジェスチャーや表情への一般化に対処する革新的なパイプラインを提案します。シームレスな一般化を実現するために、最近のポーズガイド画像からビデオへの拡散モデルを活用して、不完全なビデオフレームを疑似ラベルとして生成します。不一致やノイズのある疑似ビデオによって引き起こされるダイナミックモデリングの課題を克服するために、緊密に結合された3DGS-メッシュハイブリッドアバター表現を導入し、不完全なラベルによって引き起こされる不一致を緩和するためにいくつかの主要な正則化を適用します。多様な被験者に関する幅広い実験は、私たちの手法が、単一の画像から写実的で正確にアニメーション可能で表現豊かな全身の話すアバターの作成を可能にすることを示しています。

English

Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

ワンショット、ワントーク：単一画像からの全身話すアバター

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

要旨

Support