ChatAnything:與LLM增強人格進行Facetime聊天
ChatAnything: Facetime Chat with LLM-Enhanced Personas
November 12, 2023
作者: Yilin Zhao, Xinbin Yuan, Shanghua Gao, Zhijie Lin, Qibin Hou, Jiashi Feng, Daquan Zhou
cs.AI
摘要
在這份技術報告中,我們致力於以線上方式為基於LLM的角色生成具有拟人化人格的角色,包括視覺外觀、個性和語調,僅使用文字描述。為了實現這一目標,我們首先利用LLM的上下文學習能力進行個性生成,通過精心設計一組系統提示。然後,我們提出了兩個新概念:聲音混合(MoV)和擴散器混合(MoD)用於多樣化聲音和外觀生成。對於MoV,我們利用文本轉語音(TTS)算法,具有各種預定義的語調,並根據用戶提供的文本描述自動選擇最匹配的語調。對於MoD,我們結合了最近流行的文本轉圖像生成技術和說話頭算法,以簡化生成說話對象的過程。我們將整個框架稱為ChatAnything。使用它,用戶只需輸入少量文字即可為任何拟人化的角色創建動畫。然而,我們觀察到目前生成模型生成的拟人化對象通常無法被預先訓練的面部標記檢測器檢測到,導致面部運動生成失敗,即使這些面部具有類似人類的外觀,因為這些圖像在訓練過程中幾乎沒有見過(例如,OOD樣本)。為了解決這個問題,我們在圖像生成階段引入像素級引導,以注入人臉標記。為了評估這些指標,我們建立了一個評估數據集。基於這個數據集,我們驗證了面部標記的檢測率從57.0%增加到92.5%,從而實現基於生成的語音內容的自動面部動畫。代碼和更多結果可在https://chatanything.github.io/找到。
English
In this technical report, we target generating anthropomorphized personas for
LLM-based characters in an online manner, including visual appearance,
personality and tones, with only text descriptions. To achieve this, we first
leverage the in-context learning capability of LLMs for personality generation
by carefully designing a set of system prompts. We then propose two novel
concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for
diverse voice and appearance generation. For MoV, we utilize the text-to-speech
(TTS) algorithms with a variety of pre-defined tones and select the most
matching one based on the user-provided text description automatically. For
MoD, we combine the recent popular text-to-image generation techniques and
talking head algorithms to streamline the process of generating talking
objects. We termed the whole framework as ChatAnything. With it, users could be
able to animate anything with any personas that are anthropomorphic using just
a few text inputs. However, we have observed that the anthropomorphic objects
produced by current generative models are often undetectable by pre-trained
face landmark detectors, leading to failure of the face motion generation, even
if these faces possess human-like appearances because those images are nearly
seen during the training (e.g., OOD samples). To address this issue, we
incorporate pixel-level guidance to infuse human face landmarks during the
image generation phase. To benchmark these metrics, we have built an evaluation
dataset. Based on it, we verify that the detection rate of the face landmark is
significantly increased from 57.0% to 92.5% thus allowing automatic face
animation based on generated speech content. The code and more results can be
found at https://chatanything.github.io/.