TextToon：从单个视频实时生成卡通头像

摘要

我们提出了TextToon，一种生成可驾驶的卡通化头像的方法。给定一个短的单目视频序列和有关头像风格的书面指令，我们的模型可以生成一个高保真度的卡通化头像，可以通过另一个具有任意身份的视频实时驾驶。现有的相关工作严重依赖于多视角建模，通过纹理嵌入来恢复几何形状，以静态方式呈现，导致控制限制。多视角视频输入也使得难以将这些模型部署到现实世界的应用中。为了解决这些问题，我们采用了条件嵌入Tri-plane来学习高逼真度和风格化的面部表示，位于高斯变形场中。此外，我们通过引入自适应像素平移神经网络和利用面向补丁的对比学习来扩展3D高斯飞溅的风格化能力，从而实现高质量图像。为了将我们的工作推向消费者应用，我们开发了一个实时系统，可以在GPU机器上以48 FPS运行，而在移动设备上可以达到15-18 FPS。大量实验证明了我们的方法在生成文本头像方面相对于现有方法在质量和实时动画方面的有效性。更多详情请参阅我们的项目页面：https://songluchuan.github.io/TextToon/。

English

We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, presented in a static manner, leading to control limitations. The multi-view video input also makes it difficult to deploy these models in real-world applications. To address these issues, we adopt a conditional embedding Tri-plane to learn realistic and stylized facial representations in a Gaussian deformation field. Additionally, we expand the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images. To push our work into consumer applications, we develop a real-time system that can operate at 48 FPS on a GPU machine and 15-18 FPS on a mobile machine. Extensive experiments demonstrate the efficacy of our approach in generating textual avatars over existing methods in terms of quality and real-time animation. Please refer to our project page for more details: https://songluchuan.github.io/TextToon/.