StreamChar: 분리된 오케스트레이션을 통한 장기 스트리밍 캐릭터 오디오-비디오 생성

초록

실시간 스트리밍 오디오-비디오 공동 생성을 통한 캐릭터 애니메이션을 위해서는 생성기가 요청된 대본을 말하고, 청크 간 시각적 정체성을 유지하며, 엄격한 재생 예산 내에서 실행되어야 한다. 이러한 요구사항을 동시에 충족하는 것은 어렵다. 청크 단위 자기회귀 생성은 대본-오디오 불일치와 시각적 드리프트를 누적시킬 수 있는 반면, 낮은 지연시간을 위해 필요한 소수 단계 증류는 종종 공간 다양성과 시간적 품질을 저하시킨다. 본 논문에서는 장기 조정과 단기 윈도우 오디오-비디오 잡음 제거를 분리하는 스트리밍 프레임워크인 StreamChar를 제시한다. LLM 기반 조정기는 대본과 과거 맥락을 사용하여 프레임 정렬 오디오 조건을 생성하며, 공동 오디오-비디오 DiT는 참조 및 모션 프레임 조건화를 통해 로컬 양방향 잡음 제거를 수행한다. 효율적인 배포를 위해 먼저 샘플러를 압축한 후 온라인 청크 롤아웃 하에서 학생 모델을 미세 조정하는 2단계 증류 파이프라인을 사용한다. 진행 인식 포인터는 롤아웃 훈련 중 부분 대본과 생성된 오디오를 정렬하며, 싱크 청크 메모리는 장기 드리프트를 줄이기 위한 지속적 시각적 앵커를 제공한다. 단일 클립 및 장기 프로토콜에 대한 실험 결과, StreamChar는 단일 H100 GPU에서 실시간으로 실행되며, 최근의 공동 및 오디오 기반 베이스라인과 비교하여 대본 충실도, 시청각 동기화, 시각적 품질 및 스트리밍 안정성 간에 유리한 시스템 수준 트레이드오프를 제공함을 보여준다.

English

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.