StableAvatar: 무한 길이 오디오 기반 아바타 비디오 생성

초록

오디오 기반 아바타 비디오 생성을 위한 현재의 디퓨전 모델은 자연스러운 오디오 동기화와 신원 일관성을 유지하며 긴 비디오를 합성하는 데 어려움을 겪고 있습니다. 본 논문은 후처리 없이 무한 길이의 고품질 비디오를 합성할 수 있는 최초의 엔드투엔드 비디오 디퓨전 트랜스포머인 StableAvatar를 소개합니다. StableAvatar는 참조 이미지와 오디오를 조건으로 하여 무한 길이 비디오 생성을 가능하게 하는 맞춤형 학습 및 추론 모듈을 통합합니다. 기존 모델이 긴 비디오를 생성하지 못하는 주요 원인은 오디오 모델링에 있다는 것을 관찰했습니다. 기존 모델은 일반적으로 외부 오디오 추출기를 사용하여 오디오 임베딩을 얻은 후 이를 크로스-어텐션을 통해 디퓨전 모델에 직접 주입합니다. 현재의 디퓨전 백본은 오디오 관련 사전 지식이 부족하기 때문에, 이 접근 방식은 비디오 클립 간에 잠재 분포 오차가 누적되어 후속 세그먼트의 잠재 분포가 점차 최적 분포에서 벗어나게 만듭니다. 이를 해결하기 위해 StableAvatar는 시간 단계 인식 오디오 어댑터를 도입하여 오차 누적을 방지합니다. 추론 과정에서는 디퓨전의 진화하는 오디오-잠재 예측을 동적 가이드 신호로 활용하여 오디오 동기화를 더욱 강화하는 새로운 오디오 네이티브 가이던스 메커니즘을 제안합니다. 또한, 무한 길이 비디오의 부드러움을 향상시키기 위해 시간에 따라 잠재를 융합하는 동적 가중 슬라이딩 윈도우 전략을 도입합니다. 벤치마크 실험을 통해 StableAvatar의 효과를 정성적 및 정량적으로 입증했습니다.

English

Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

StableAvatar: 무한 길이 오디오 기반 아바타 비디오 생성

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

초록

Support