テキストトゥーン：単一のビデオからリアルタイムでテキストをトゥーン化したヘッドアバター

要旨

TextToonという方法を提案します。これは、短い単眼ビデオシーケンスとアバタースタイルに関する指示文が与えられた場合、高品質なトゥーン化されたアバターを生成し、他のビデオによってリアルタイムで操作可能なものです。従来の関連研究は、幾何学を復元するためにテクスチャ埋め込みを介して静的な方法で提示される多視点モデリングに大きく依存しており、これにより制御が制限されています。また、多視点ビデオ入力は、これらのモデルを実世界のアプリケーションに展開することを難しくしています。これらの問題に対処するために、私たちは条件付き埋め込み三平面を採用して、ガウス変形フィールド内でリアルかつスタイリッシュな顔の表現を学習します。さらに、3Dガウススプラッティングのスタイリング能力を拡張するために、適応型ピクセル変換ニューラルネットワークを導入し、パッチに注意した対照的学習を活用して高品質の画像を実現します。私たちの作業を消費者向けアプリケーションに進展させるために、GPUマシンで48 FPS、モバイルマシンで15-18 FPSで動作するリアルタイムシステムを開発しました。幅広い実験により、品質とリアルタイムアニメーションの観点で既存の手法よりも優れたテキストアバターを生成するアプローチの効果を実証しています。詳細については、以下のプロジェクトページをご覧ください：https://songluchuan.github.io/TextToon/。

English

We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, presented in a static manner, leading to control limitations. The multi-view video input also makes it difficult to deploy these models in real-world applications. To address these issues, we adopt a conditional embedding Tri-plane to learn realistic and stylized facial representations in a Gaussian deformation field. Additionally, we expand the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images. To push our work into consumer applications, we develop a real-time system that can operate at 48 FPS on a GPU machine and 15-18 FPS on a mobile machine. Extensive experiments demonstrate the efficacy of our approach in generating textual avatars over existing methods in terms of quality and real-time animation. Please refer to our project page for more details: https://songluchuan.github.io/TextToon/.