TextToon：從單一視頻實時生成漫畫風格頭像

摘要

我們提出了TextToon，一種生成可驅動卡通化頭像的方法。給定一個短的單眼視頻序列和有關頭像風格的書面指示，我們的模型可以生成一個高保真度的卡通化頭像，可以通過另一個具有任意身份的視頻實時驅動。現有相關工作在很大程度上依賴多視圖建模來通過紋理嵌入恢復幾何形狀，呈現靜態方式，從而導致控制限制。多視圖視頻輸入也使得難以在現實應用中部署這些模型。為了解決這些問題，我們採用了一種條件嵌入三平面，以學習高逼真度和風格化的面部表示在高斯變形場中。此外，我們通過引入自適應像素平移神經網絡和利用面向補丁的對比學習來擴展3D高斯飛濺的風格化能力，以實現高質量圖像。為了將我們的工作應用於消費者應用程序，我們開發了一個可以在GPU機器上以48 FPS運行，並且在移動機器上以15-18 FPS運行的實時系統。大量實驗證明了我們的方法在生成文本頭像方面相對於現有方法在質量和實時動畫方面的有效性。詳細信息請參閱我們的項目頁面：https://songluchuan.github.io/TextToon/。

English

We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, presented in a static manner, leading to control limitations. The multi-view video input also makes it difficult to deploy these models in real-world applications. To address these issues, we adopt a conditional embedding Tri-plane to learn realistic and stylized facial representations in a Gaussian deformation field. Additionally, we expand the stylization capabilities of 3D Gaussian Splatting by introducing an adaptive pixel-translation neural network and leveraging patch-aware contrastive learning to achieve high-quality images. To push our work into consumer applications, we develop a real-time system that can operate at 48 FPS on a GPU machine and 15-18 FPS on a mobile machine. Extensive experiments demonstrate the efficacy of our approach in generating textual avatars over existing methods in terms of quality and real-time animation. Please refer to our project page for more details: https://songluchuan.github.io/TextToon/.

TextToon：從單一視頻實時生成漫畫風格頭像

TextToon: Real-Time Text Toonify Head Avatar from Single Video

摘要

Support