ChatPaper.aiChatPaper

实时虚拟化身:无限时长流式音频驱动虚拟形象生成技术

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

December 4, 2025
作者: Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi
cs.AI

摘要

现有基于扩散模型的视频生成方法受限于序列计算和长序列不一致性,难以实现实时流式音频驱动虚拟人生成。我们提出Live Avatar算法-系统协同设计框架,通过140亿参数扩散模型实现高效、高保真、无限时长的虚拟人生成。该方法创新性地引入时间步强制流水线并行(TPP)技术,将去噪步骤分布式部署于多GPU间,有效突破自回归瓶颈,确保稳定低延迟的实时流式生成。为增强时序一致性并缓解身份漂移与色彩失真,我们提出滚动锚定帧机制(RSFM),通过动态校准缓存参考图像的外观特征来保持序列保真度。此外,采用自强制分布匹配蒸馏技术,在保持视觉质量的前提下实现大规模模型的可流式因果适配。Live Avatar在5张H800 GPU上达到端到端20帧/秒的生成速度,据我们所知,这是首个实现实用级实时高保真虚拟人生成的大规模方案。本研究为工业级长视频合成应用中部署先进扩散模型建立了新范式。
English
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
PDF1113December 6, 2025