ChatPaper.aiChatPaper

ChatAnything:与LLM增强人物进行Facetime聊天

ChatAnything: Facetime Chat with LLM-Enhanced Personas

November 12, 2023
作者: Yilin Zhao, Xinbin Yuan, Shanghua Gao, Zhijie Lin, Qibin Hou, Jiashi Feng, Daquan Zhou
cs.AI

摘要

在这份技术报告中,我们旨在通过仅凭文本描述,在线生成LLM(大型语言模型)角色的拟人化人物形象,包括视觉外观、个性和语调。为实现这一目标,我们首先利用LLM的上下文学习能力进行个性生成,通过精心设计一组系统提示。然后,我们提出了两个新颖概念:声音混合(MoV)和扩散器混合(MoD),用于多样化声音和外观生成。对于MoV,我们利用文本转语音(TTS)算法,具有各种预定义语调,并根据用户提供的文本描述自动选择最匹配的语调。对于MoD,我们结合了最近流行的文本转图像生成技术和说话头算法,简化了生成说话对象的过程。我们将整个框架命名为ChatAnything。借助这一框架,用户可以仅通过少量文本输入为任何事物赋予拟人化人物形象。然而,我们观察到当前生成模型生成的拟人化对象通常无法被预训练的面部关键点检测器检测到,导致面部运动生成失败,即使这些面部具有类似人类的外观,因为这些图像在训练过程中几乎没有出现(例如,OOD样本)。为解决这一问题,我们在图像生成阶段引入像素级引导,将人脸关键点融入图像生成中。为了评估这些指标,我们构建了一个评估数据集。基于此,我们验证了面部关键点的检测率从57.0%显著提高到92.5%,从而实现基于生成语音内容的自动面部动画。代码和更多结果可在https://chatanything.github.io/找到。
English
In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/.
PDF353December 15, 2024