ライブアバター：無限長のリアルタイム音声駆動型アバター生成ストリーミング

要旨

既存の拡散ベースの動画生成手法は、逐次計算と長期的な不一致に根本的に制約されており、リアルタイム・ストリーミング型の音声駆動アバター合成における実用的な採用を妨げている。本論文ではLive Avatarを提案する。これはアルゴリズムとシステムを協調設計したフレームワークであり、140億パラメータの拡散モデルを用いて効率的で高精細、かつ無限長のアバター生成を実現する。我々のアプローチでは、Timestep-forcing Pipeline Parallelism（TPP）を新たに導入する。これは複数のGPU間でノイズ除去ステップをパイプライン化する分散推論パラダイムであり、自己回帰的ボトルネックを効果的に解消し、安定した低遅延のリアルタイムストリーミングを保証する。さらに時間的一貫性を強化し、アイデンティティの変動や色のアーティファクトを軽減するため、Rolling Sink Frame Mechanism（RSFM）を提案する。これはキャッシュされた参照画像を用いて外観を動的に再較正することで、シーケンスの忠実度を維持する。加えて、Self-Forcing Distribution Matching Distillationを活用し、視覚品質を損なうことなく大規模モデルの因果的・ストリーミング可能な適応を促進する。Live Avatarは最先端の性能を示し、5台のH800 GPU上でエンドツーエンド生成において20 FPSを達成する。我々の知る限り、この規模で実用的なリアルタイム高精細アバター生成を実現した初めての事例である。本研究は、産業向け長尺動画合成アプリケーションへの先進的拡散モデル導入における新たなパラダイムを確立する。

English

Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

ライブアバター：無限長のリアルタイム音声駆動型アバター生成ストリーミング

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

要旨

Support