StreamChar: 長程串流角色音視頻生成之解耦編排
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration
May 25, 2026
作者: Linrui Tian, Qi Wang, Bang Zhang
cs.AI
摘要
即時串流聯合音訊-視訊生成應用於角色動畫時,需同時滿足生成器朗讀指定文本、跨區塊維持視覺一致性、並在嚴格的播放時程預算內運作等需求。這些要求難以同時達成:逐區塊自迴歸生成可能累積文本-音訊對齊誤差與視覺漂移,而為達成低延遲所需之少量步驟蒸餾,往往會降低空間多樣性與時間品質。我們提出 StreamChar,這是一個將長程編排與短窗音訊-視訊去噪分離的串流框架。基於 LLM 的編排器利用文本與歷史背景產生與影格對齊的音訊條件,而聯合音訊-視訊 DiT 則在參考影格與動態影格條件下進行局部雙向去噪。為實現高效部署,我們採用兩階段蒸餾流程:首先壓縮取樣器,接著在線上區塊滾動中微調學生模型。在滾動式訓練期間,進度感知指標將部分文本與生成音訊對齊,而沉澱區塊記憶體則提供持久視覺錨點,以減少長程漂移。在短片與長時程協議上的實驗顯示,StreamChar 可在單一 H100 GPU 上即時運行,且與近期聯合式與音訊驅動基準相比,在文本忠實度、聲畫同步、視覺品質與串流穩定性之間提供了有利的系統級權衡。
English
Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.