ChatAnything：與LLM增強人格進行Facetime聊天

摘要

在這份技術報告中，我們致力於以線上方式為基於LLM的角色生成具有拟人化人格的角色，包括視覺外觀、個性和語調，僅使用文字描述。為了實現這一目標，我們首先利用LLM的上下文學習能力進行個性生成，通過精心設計一組系統提示。然後，我們提出了兩個新概念：聲音混合（MoV）和擴散器混合（MoD）用於多樣化聲音和外觀生成。對於MoV，我們利用文本轉語音（TTS）算法，具有各種預定義的語調，並根據用戶提供的文本描述自動選擇最匹配的語調。對於MoD，我們結合了最近流行的文本轉圖像生成技術和說話頭算法，以簡化生成說話對象的過程。我們將整個框架稱為ChatAnything。使用它，用戶只需輸入少量文字即可為任何拟人化的角色創建動畫。然而，我們觀察到目前生成模型生成的拟人化對象通常無法被預先訓練的面部標記檢測器檢測到，導致面部運動生成失敗，即使這些面部具有類似人類的外觀，因為這些圖像在訓練過程中幾乎沒有見過（例如，OOD樣本）。為了解決這個問題，我們在圖像生成階段引入像素級引導，以注入人臉標記。為了評估這些指標，我們建立了一個評估數據集。基於這個數據集，我們驗證了面部標記的檢測率從57.0%增加到92.5%，從而實現基於生成的語音內容的自動面部動畫。代碼和更多結果可在https://chatanything.github.io/找到。

English

In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/.

ChatAnything：與LLM增強人格進行Facetime聊天

ChatAnything: Facetime Chat with LLM-Enhanced Personas

摘要

Support