Wan-Streamer v0.1: 엔드 투 엔드 실시간 대화형 파운데이션 모델

초록

본 논문에서는 실시간 저지연 전이중 오디오-비주얼 상호작용을 위해 처음부터 설계된 네이티브 스트리밍 종단간 상호작용 기반 모델인 Wan-Streamer를 제안한다. Wan-Streamer는 단일 트랜스포머 내에서 언어, 오디오, 비디오를 입력과 출력 모두로 원활하게 모델링하며, 시퀀스는 블록 인과 어텐션(block-causal attention)에 의해 조정되어 점진적 스트리밍을 지원하는 시각, 오디오, 텍스트 입력 토큰과 시각, 오디오, 텍스트 출력 토큰이 교차된 형태로 표현된다. 별도의 VAD, ASR, 언어, TTS, 오디오 기반 애니메이션 또는 비디오 생성 모듈에 의존하는 계층적 상호작용 시스템과 달리, Wan-Streamer는 외부 언어, 음성, 아바타 또는 비디오 생성 모듈에 의존하지 않는다. 인지, 추론, 생성, 응답 타이밍, 턴 관리, 교차 모달 동기화가 하나의 통합 모델 내에서 공동으로 학습되므로 파이프라인 지연 시간과 오류 누적이 감소한다. 자연스러운 오디오-비주얼 응답성을 지원하기 위해, 인과적 인코더, 인과적 디코더, 블록 인과 어텐션, 저지연 멀티모달 토큰 스케줄링을 포함한 전체 스택을 스트리밍 가능성을 중심으로 재설계하여 25fps에서 160ms의 짧은 스트리밍 단위를 가능하게 한다. Wan-Streamer는 약 200ms의 모델 측 응답 지연 시간을 달성하며, 350ms의 양방향 네트워크 지연 시간과 결합 시 약 550ms의 총 상호작용 지연 시간을 보여 초 단위 이하의 전이중 오디오-비주얼 통신을 지원한다. 이러한 결과는 Wan-Streamer를 저지연 스트리밍 상호작용을 위한 통합적이고 종단간 멀티모달 상호작용 기반 모델로 자리매김하게 한다.

English

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.