ChatAnything:与LLM增强人物进行Facetime聊天
ChatAnything: Facetime Chat with LLM-Enhanced Personas
November 12, 2023
作者: Yilin Zhao, Xinbin Yuan, Shanghua Gao, Zhijie Lin, Qibin Hou, Jiashi Feng, Daquan Zhou
cs.AI
摘要
在这份技术报告中,我们旨在通过仅凭文本描述,在线生成LLM(大型语言模型)角色的拟人化人物形象,包括视觉外观、个性和语调。为实现这一目标,我们首先利用LLM的上下文学习能力进行个性生成,通过精心设计一组系统提示。然后,我们提出了两个新颖概念:声音混合(MoV)和扩散器混合(MoD),用于多样化声音和外观生成。对于MoV,我们利用文本转语音(TTS)算法,具有各种预定义语调,并根据用户提供的文本描述自动选择最匹配的语调。对于MoD,我们结合了最近流行的文本转图像生成技术和说话头算法,简化了生成说话对象的过程。我们将整个框架命名为ChatAnything。借助这一框架,用户可以仅通过少量文本输入为任何事物赋予拟人化人物形象。然而,我们观察到当前生成模型生成的拟人化对象通常无法被预训练的面部关键点检测器检测到,导致面部运动生成失败,即使这些面部具有类似人类的外观,因为这些图像在训练过程中几乎没有出现(例如,OOD样本)。为解决这一问题,我们在图像生成阶段引入像素级引导,将人脸关键点融入图像生成中。为了评估这些指标,我们构建了一个评估数据集。基于此,我们验证了面部关键点的检测率从57.0%显著提高到92.5%,从而实现基于生成语音内容的自动面部动画。代码和更多结果可在https://chatanything.github.io/找到。
English
In this technical report, we target generating anthropomorphized personas for
LLM-based characters in an online manner, including visual appearance,
personality and tones, with only text descriptions. To achieve this, we first
leverage the in-context learning capability of LLMs for personality generation
by carefully designing a set of system prompts. We then propose two novel
concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for
diverse voice and appearance generation. For MoV, we utilize the text-to-speech
(TTS) algorithms with a variety of pre-defined tones and select the most
matching one based on the user-provided text description automatically. For
MoD, we combine the recent popular text-to-image generation techniques and
talking head algorithms to streamline the process of generating talking
objects. We termed the whole framework as ChatAnything. With it, users could be
able to animate anything with any personas that are anthropomorphic using just
a few text inputs. However, we have observed that the anthropomorphic objects
produced by current generative models are often undetectable by pre-trained
face landmark detectors, leading to failure of the face motion generation, even
if these faces possess human-like appearances because those images are nearly
seen during the training (e.g., OOD samples). To address this issue, we
incorporate pixel-level guidance to infuse human face landmarks during the
image generation phase. To benchmark these metrics, we have built an evaluation
dataset. Based on it, we verify that the detection rate of the face landmark is
significantly increased from 57.0% to 92.5% thus allowing automatic face
animation based on generated speech content. The code and more results can be
found at https://chatanything.github.io/.