ChatPaper.aiChatPaper

StableAvatar:无限时长音频驱动的虚拟形象视频生成

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

August 11, 2025
作者: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

当前基于音频驱动的虚拟形象视频生成扩散模型在合成自然音频同步且身份一致的长视频方面面临挑战。本文提出了StableAvatar,这是首个无需后处理即可合成无限长度高质量视频的端到端视频扩散Transformer模型。StableAvatar以参考图像和音频为条件,集成了定制化的训练与推理模块,实现了无限长度视频的生成。我们发现,现有模型难以生成长视频的主要原因在于其音频建模方式。这些模型通常依赖第三方现成的提取器获取音频嵌入,随后通过交叉注意力直接注入扩散模型。由于当前扩散模型骨干缺乏与音频相关的先验知识,这种方法会导致视频片段间潜在分布误差的严重累积,使得后续片段的潜在分布逐渐偏离最优分布。为解决这一问题,StableAvatar引入了一种新颖的时间步感知音频适配器,通过时间步感知调制防止误差累积。在推理阶段,我们提出了一种音频原生引导机制,利用扩散模型自身演进的联合音频-潜在预测作为动态引导信号,进一步增强了音频同步性。为提升无限长度视频的流畅度,我们引入了动态加权滑动窗口策略,随时间融合潜在表示。基准测试实验从定性和定量两方面验证了StableAvatar的有效性。
English
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
PDF152August 14, 2025