ChatPaper.aiChatPaper

StreamChar:基于解耦编排的长程流式角色音频-视频生成

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

May 25, 2026
作者: Linrui Tian, Qi Wang, Bang Zhang
cs.AI

摘要

实时流式联合音视频生成用于角色动画时,要求生成器能够朗读指定文本、在片段间保持视觉一致性,并在严格的播放预算内运行。这些要求难以同时满足:逐片段的自回归生成会累积文本-音频错位和视觉漂移,而低延迟所需的少步数蒸馏则常损害空间多样性与时间质量。我们提出StreamChar,一种将长程编排与短窗音视频去噪分离的流式框架。基于LLM的编排器利用文本和历史上下文生成帧对齐的音频条件,而联合音视频DiT则通过参考帧和运动帧条件进行局部双向去噪。为高效部署,我们采用两阶段蒸馏流水线:先压缩采样器,再在在线块展开下微调学生模型。在展开训练中,进度感知指针对齐部分文本与生成的音频,sink块记忆则提供持久视觉锚点以减少长程漂移。在短片段和长程协议上的实验表明,StreamChar在单个H100 GPU上实现实时运行,在文本保真度、音视频同步、视觉质量和流式稳定性方面,相比近期联合式及音频驱动基线,提供了更优的系统级权衡。
English
Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.