StreamChar: 分離型オーケストレーションによる長期的連続キャラクター音声映像生成

要旨

キャラクターアニメーションのためのリアルタイムストリーミングによる音声と映像の同時生成には、生成器が要求された台詞を話し、チャンク間で視覚的一貫性を維持し、厳格な再生予算内で動作することが求められる。これらの要件を同時に満たすことは困難である：チャンク単位の自己回帰生成では、台詞と音声のミスアライメントや視覚的ドリフトが蓄積される可能性があり、一方で低遅延に必要な少数ステップの蒸留は空間的多様性と時間的品質を低下させることが多い。我々はStreamCharを提案する。これは、長期的なオーケストレーションを短いウィンドウの音声-映像ノイズ除去から分離するストリーミングフレームワークである。LLMベースのオーケストレータは、台詞と過去のコンテキストを用いてフレームに整合した音声条件を生成し、音声-映像統合DiTが参照フレームおよびモーションフレーム条件付けを用いて局所的な双方向ノイズ除去を実行する。効率的なデプロイメントのために、まずサンプラを圧縮し、その後オンラインチャンクロールアウト下で生徒モデルを微調整する2段階蒸留パイプラインを使用する。進行認識ポインタはロールアウトトレーニング中に部分的な台詞を生成音声と整合させ、シンクチャンクメモリは長期的なドリフトを低減するための持続的な視覚的アンカーを提供する。短いクリップと長期プロトコルでの実験により、StreamCharは単一のH100 GPU上でリアルタイムに動作し、最近の統合的および音声駆動ベースラインと比較して、台詞の忠実度、音声-映像同期、画質、ストリーミング安定性の間で好ましいシステムレベルのトレードオフを提供することが示された。

English

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.