ChatPaper.aiChatPaper

TalkingMachines:基于自回归扩散模型的实时音频驱动FaceTime风格视频生成

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

June 3, 2025
作者: Chetwin Low, Weimin Wang
cs.AI

摘要

本文介绍了TalkingMachines——一个高效框架,它将预训练的视频生成模型转化为实时、音频驱动的人物动画生成器。TalkingMachines通过将音频大语言模型(LLM)与我们的视频生成基础模型相结合,实现了自然的对话体验。我们的主要贡献包括:(1)我们将一个预训练的最先进的图像到视频DiT模型调整为拥有180亿参数的音频驱动虚拟形象生成模型;(2)通过从双向教师模型到稀疏因果自回归学生模型的不对称知识蒸馏,实现了无误差累积的无限视频流;(3)我们设计了一个高吞吐量、低延迟的推理管道,融合了多项关键工程优化,如:(a)将DiT与VAE解码器分离部署于不同设备,(b)利用CUDA流高效重叠设备间通信与计算,(c)消除冗余重计算以最大化帧生成速率。演示视频请访问:https://aaxwaz.github.io/TalkingMachines/。
English
In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/
PDF72June 5, 2025