ChatPaper.aiChatPaper

LiveTalk:基于改进策略蒸馏的实时多模态交互式视频扩散技术

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

December 29, 2025
作者: Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
cs.AI

摘要

通过扩散模型实现实时视频生成对于构建通用多模态交互式AI系统至关重要。然而,扩散模型中通过迭代过程对所有视频帧进行双向注意力的同步去噪,阻碍了实时交互。虽然现有蒸馏方法可使模型具备自回归特性并减少采样步数以缓解此问题,但这些方法主要关注文本到视频生成,导致人机交互仍显生硬且效率低下。本文旨在实现基于多模态上下文(包括文本、图像和音频)的实时交互式视频扩散,以弥合这一差距。基于对领先策略蒸馏方法Self Forcing在多模态条件下存在挑战(如闪烁、黑帧等视觉伪影及质量下降)的观察,我们研究了一种改进的蒸馏方案,重点关注条件输入质量以及策略优化初始化和调度策略。在HDTF、AVSpeech和CelebV-HQ等多模态条件(音频、图像和文本)虚拟人视频生成基准测试中,我们蒸馏出的模型以20倍更低推理成本和延迟,达到了与同等或更大规模全步骤双向基线模型相当的视觉质量。进一步地,我们将该模型与音频语言模型及长视频推理技术Anchor-Heavy Identity Sinks集成,构建了实时多模态交互虚拟人系统LiveTalk。在自建多轮交互基准上的系统级评估表明,LiveTalk在多轮视频连贯性和内容质量上优于最先进模型(Sora2、Veo3),同时将响应延迟从1-2分钟缩短至实时生成水平,实现了无缝的人机多模态交互。
English
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
PDF501December 31, 2025