LiveTalk:基於改進策略蒸餾的即時多模態互動影片擴散技術
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
December 29, 2025
作者: Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
cs.AI
摘要
基於擴散模型的即時影片生成對於建構通用多模態互動式人工智慧系統至關重要。然而,擴散模型中通過迭代過程對所有影格進行雙向注意力的同步去噪處理,阻礙了即時互動的實現。現有的蒸餾方法雖能將模型改為自回歸架構並減少取樣步數以緩解此問題,但這些方法主要專注於文字生成影片任務,導致人機互動仍顯生硬且效率低下。本文旨在實現基於多模態上下文(包含文字、圖像與音訊)的即時互動式影片擴散模型,以彌合此技術斷層。針對主流在策略蒸餾方法Self Forcing在多模態條件下會出現視覺瑕疵(如閃爍、黑幀與畫質劣化)的問題,我們提出改進的蒸餾方案,重點優化條件輸入的品質以及在策略最佳化的初始化與排程策略。在HDTF、AVSpeech和CelebV-HQ等多模態條件(音訊/圖像/文字)驅動的虛擬人影片生成基準測試中,我們蒸餾後的模型以僅需1/20的推理成本與延遲,實現了與完整步數雙向基線模型相當甚至更優的視覺品質。進一步地,我們將該模型與音訊語言模型及長影片推理技術Anchor-Heavy Identity Sinks整合,構建出即時多模態互動虛擬人系統LiveTalk。在自建的多輪互動基準上的系統級評估表明,LiveTalk在多輪影片連貫性與內容品質上超越現有頂尖模型(Sora2、Veo3),同時將響應延遲從1-2分鐘縮短至即時生成級別,實現了無縫的人機多模態互動。
English
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.