TalkingMachines: オートリグレッシブ拡散モデルによるリアルタイム音声駆動型FaceTimeスタイル動画生成

要旨

本論文では、TalkingMachinesを紹介します。これは、事前学習済みのビデオ生成モデルをリアルタイムの音声駆動型キャラクターアニメーターに変換する効率的なフレームワークです。TalkingMachinesは、音声大規模言語モデル（LLM）をビデオ生成基盤モデルと統合することで、自然な会話体験を実現します。主な貢献は以下の通りです：（1）事前学習済みのSOTA画像-ビデオDiTを、180億パラメータの音声駆動型アバター生成モデルに適応させました；（2）双方向教師モデルからスパース因果的自己回帰型学生モデルへの非対称知識蒸留により、エラー蓄積のない無限ビデオストリーミングを可能にしました；（3）高スループットで低遅延の推論パイプラインを設計し、以下の主要なエンジニアリング最適化を組み込みました：（a）DiTとVAEデコーダを別々のデバイスに分散配置、（b）CUDAストリームを使用したデバイス間通信と計算の効率的なオーバーラップ、（c）冗長な再計算を排除してフレーム生成スループットを最大化。デモビデオはこちらをご覧ください - https://aaxwaz.github.io/TalkingMachines/

English

In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/

TalkingMachines: オートリグレッシブ拡散モデルによるリアルタイム音声駆動型FaceTimeスタイル動画生成

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

要旨

Support