ChatPaper.aiChatPaper

實時虛擬化身:支援無限時長串流的即時音訊驅動虛擬化身生成技術

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

December 4, 2025
作者: Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi
cs.AI

摘要

現有的基於擴散的影片生成方法,本質上受到序列計算和長序列不一致性的制約,限制了其在即時串流音訊驅動虛擬人像合成中的實際應用。我們提出Live Avatar,這是一種算法-系統協同設計框架,能夠使用140億參數的擴散模型實現高效、高保真且無限時長的虛擬人像生成。我們的方法引入了時間步強制管道並行(TPP),這是一種將去噪步驟跨多個GPU進行管道化處理的分佈式推理範式,有效突破自回歸瓶頸並確保穩定、低延遲的即時串流。為進一步增強時間一致性並減輕身份漂移和色彩偽影,我們提出滾動沉澱幀機制(RSFM),通過動態使用緩存的參考圖像重新校準外觀來維持序列保真度。此外,我們利用自強制分佈匹配蒸餾技術,在不犧牲視覺品質的前提下實現大規模模型的可因果串流化適配。Live Avatar展現了最先進的性能,在5張H800 GPU上達到端到端20 FPS的生成速度,據我們所知,這是首個在此規模下實現實用化、即時、高保真虛擬人像生成的方法。我們的工作為在先進擴散模型於工業級長影片合成應用中的部署建立了新範式。
English
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
PDF1113December 6, 2025