Wan-Streamer v0.1：端到端实时交互式基础模型

摘要

我们推出Wan-Streamer，一款原生流式、端到端交互基础模型，专为实时、低延迟、全双工音视频交互从头设计。Wan-Streamer在单一Transformer中无缝建模语言、音频和视频作为输入与输出，其序列以交错的视觉、音频和文本输入令牌与视觉、音频和文本输出令牌的形式表示，并通过块级因果注意力机制协调增量流式处理。与依赖独立VAD、ASR、语言、TTS、音频驱动动画或视频生成模块的级联交互系统不同，Wan-Streamer不依赖外部语言、语音、虚拟形象或视频生成模块：感知、推理、生成、响应时序、话轮管理以及跨模态同步均在统一模型中联合学习，从而降低流水线延迟和误差积累。为支持自然的视听响应，我们围绕流式能力重新设计了整个技术栈，包括因果编码器、因果解码器、块级因果注意力以及低延迟多模态令牌调度，使得流式单元短至160毫秒（25 fps）。Wan-Streamer实现了约200毫秒的模型侧响应延迟，结合350毫秒双向网络延迟后总交互延迟约550毫秒，支持亚秒级双工音视频通信。这些成果使Wan-Streamer成为面向低延迟流式交互的统一、端到端、多模态交互基础模型。

English

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.