ChatAnything: LLM 강화 페르소나와의 페이스타임 채팅

초록

본 기술 보고서에서는 텍스트 설명만을 사용하여 온라인 방식으로 LLM 기반 캐릭터를 위한 의인화된 페르소나(시각적 외모, 성격, 어조 포함)를 생성하는 것을 목표로 합니다. 이를 위해 먼저 LLM의 컨텍스트 내 학습 능력을 활용하여 시스템 프롬프트 세트를 신중하게 설계함으로써 성격 생성을 수행합니다. 그런 다음 다양한 음성과 외모 생성을 위해 두 가지 새로운 개념인 '음성의 혼합(MoV)'과 '디퓨저의 혼합(MoD)'을 제안합니다. MoV의 경우, 텍스트-음성 변환(TTS) 알고리즘을 다양한 미리 정의된 어조와 함께 사용하고, 사용자가 제공한 텍스트 설명에 가장 잘 맞는 어조를 자동으로 선택합니다. MoD의 경우, 최근 인기 있는 텍스트-이미지 생성 기술과 토킹 헤드 알고리즘을 결합하여 말하는 객체 생성 프로세스를 간소화합니다. 우리는 이 전체 프레임워크를 'ChatAnything'이라고 명명했습니다. 이를 통해 사용자는 단 몇 가지 텍스트 입력만으로도 의인화된 페르소나를 가진 어떤 것이든 애니메이션화할 수 있습니다. 그러나 현재 생성 모델이 만든 의인화된 객체는 사전 훈련된 얼굴 랜드마크 검출기로 감지되지 않는 경우가 많아, 이러한 얼굴이 인간과 유사한 외모를 가지고 있더라도 얼굴 움직임 생성이 실패하는 문제가 발생합니다. 이는 해당 이미지가 훈련 중 거의 보지 못한 데이터(예: OOD 샘플)이기 때문입니다. 이 문제를 해결하기 위해 이미지 생성 단계에서 픽셀 수준의 지도를 통합하여 인간 얼굴 랜드마크를 주입합니다. 이러한 지표를 벤치마킹하기 위해 평가 데이터셋을 구축했습니다. 이를 기반으로 얼굴 랜드마크의 검출률이 57.0%에서 92.5%로 크게 증가하여 생성된 음성 내용을 기반으로 한 자동 얼굴 애니메이션이 가능함을 확인했습니다. 코드와 추가 결과는 https://chatanything.github.io/에서 확인할 수 있습니다.

English

In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/.

ChatAnything: LLM 강화 페르소나와의 페이스타임 채팅

ChatAnything: Facetime Chat with LLM-Enhanced Personas

초록

Support