TalkingMachines：基於自回歸擴散模型的實時音頻驅動FaceTime風格視頻生成

摘要

在本論文中，我們介紹了TalkingMachines——一個高效的框架，將預訓練的視頻生成模型轉化為實時、音頻驅動的角色動畫生成器。TalkingMachines通過將音頻大型語言模型（LLM）與我們的視頻生成基礎模型相結合，實現了自然的對話體驗。我們的主要貢獻包括：（1）我們將一個預訓練的SOTA圖像到視頻DiT模型改造成為一個擁有180億參數的音頻驅動虛擬形象生成模型；（2）通過從雙向教師模型到稀疏因果自迴歸學生模型的不對稱知識蒸餾，我們實現了無錯誤累積的無限視頻流；（3）我們設計了一個高吞吐量、低延遲的推理管道，整合了多項關鍵工程優化，例如：（a）將DiT和VAE解碼器分佈在不同設備上，（b）利用CUDA流高效重疊設備間通信與計算，（c）消除冗餘重計算以最大化幀生成吞吐量。請參見演示視頻——https://aaxwaz.github.io/TalkingMachines/。

English

In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/

TalkingMachines：基於自回歸擴散模型的實時音頻驅動FaceTime風格視頻生成

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

摘要

Support