ChatAnything: LLM拡張パーソナとのFaceTimeチャット

要旨

本技術レポートでは、オンライン方式でLLMベースのキャラクターに対して、視覚的外観、性格、話し方を含む擬人化されたペルソナをテキスト記述のみから生成することを目指します。これを実現するため、まずLLMの文脈内学習能力を活用し、慎重に設計されたシステムプロンプトセットを用いて性格生成を行います。次に、多様な音声と外観生成のための2つの新しい概念、音声の混合（MoV）と拡散モデルの混合（MoD）を提案します。MoVでは、テキスト音声合成（TTS）アルゴリズムを多様な事前定義された話し方と組み合わせ、ユーザー提供のテキスト記述に基づいて最も適合するものを自動的に選択します。MoDでは、最近普及しているテキスト画像生成技術とトーキングヘッドアルゴリズムを組み合わせ、トーキングオブジェクト生成プロセスを効率化します。我々はこの全体フレームワークをChatAnythingと名付けました。これにより、ユーザーはわずかなテキスト入力で、擬人化された任意のペルソナを持つものをアニメーション化できるようになります。しかし、現在の生成モデルが生み出す擬人化オブジェクトは、事前学習済みの顔ランドマーク検出器で検出できないことが多く、顔の動き生成に失敗するという問題が観察されています。これらの顔が人間のような外観を持っていても、それらの画像はトレーニング中にほとんど見られないため（例：OODサンプル）です。この問題に対処するため、画像生成段階で人間の顔ランドマークを注入するピクセルレベルガイダンスを組み込みます。これらの指標をベンチマークするため、評価データセットを構築しました。これに基づき、顔ランドマークの検出率が57.0%から92.5%に大幅に向上し、生成された音声内容に基づく自動顔アニメーションが可能になることを確認しました。コードと詳細な結果はhttps://chatanything.github.io/でご覧いただけます。

English

In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/.

ChatAnything: LLM拡張パーソナとのFaceTimeチャット

ChatAnything: Facetime Chat with LLM-Enhanced Personas

要旨

Support