ChatAnyone: 階層型モーションディフュージョンモデルによるスタイライズされたリアルタイムポートレート動画生成

要旨

リアルタイムインタラクティブなビデオチャットポートレートは、特にテキストや音声チャット技術の著しい進歩により、未来のトレンドとしてますます認識されています。しかし、既存の手法は主に頭部の動きのリアルタイム生成に焦点を当てており、これらの頭部の動きと同期した身体の動きを生成することには苦戦しています。さらに、話し方のスタイルや顔の表情のニュアンスを細かく制御することも課題となっています。これらの制限に対処するため、私たちはスタイリッシュなリアルタイムポートレートビデオ生成のための新しいフレームワークを導入し、トーキングヘッドから上半身のインタラクションまで拡張した表現力豊かで柔軟なビデオチャットを実現します。私たちのアプローチは以下の2段階で構成されます。第1段階では、音声入力を基に明示的および暗示的な動き表現を考慮した効率的な階層的モーションディフュージョンモデルを使用し、スタイル制御と頭部と身体の動きの同期を伴う多様な顔の表情を生成します。第2段階では、手のジェスチャーを含む上半身の動きを特徴とするポートレートビデオを生成します。ジェネレーターに明示的な手の制御信号を注入してより詳細な手の動きを生成し、さらに顔のリファインメントを行ってポートレートビデオの全体的なリアリズムと表現力を向上させます。さらに、私たちのアプローチは4090 GPU上で最大512 * 768解像度で最大30fpsの上半身ポートレートビデオの効率的かつ連続的な生成をサポートし、リアルタイムでのインタラクティブなビデオチャットを可能にします。実験結果は、私たちのアプローチが豊かな表現力と自然な上半身の動きを伴うポートレートビデオを生成する能力を示しています。

English

Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.