Wan-Streamer v0.1:端到端实时交互式基础模型
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
June 23, 2026
作者: Lianghua Huang, Zhifan Wu, Wei Wang, Yupeng Shi, Mengyang Feng, Junjie He, Chenwei Xie, Yu Liu, Jingren Zhou, Ang Wang, Bang Zhang, Baole Ai, Chen Liang, Cheng Yu, Chongyang Zhong, Jinwei Qi, Kai Zhu, Pandeng Li, Peng Zhang, Wenyuan Zhang, Xinhua Cheng, Yitong Huang, Yun Zheng, Zoubin Bi
cs.AI
摘要
我们推出Wan-Streamer,一款原生流式、端到端交互基础模型,专为实时、低延迟、全双工音视频交互从头设计。Wan-Streamer在单一Transformer中无缝建模语言、音频和视频作为输入与输出,其序列以交错的视觉、音频和文本输入令牌与视觉、音频和文本输出令牌的形式表示,并通过块级因果注意力机制协调增量流式处理。与依赖独立VAD、ASR、语言、TTS、音频驱动动画或视频生成模块的级联交互系统不同,Wan-Streamer不依赖外部语言、语音、虚拟形象或视频生成模块:感知、推理、生成、响应时序、话轮管理以及跨模态同步均在统一模型中联合学习,从而降低流水线延迟和误差积累。为支持自然的视听响应,我们围绕流式能力重新设计了整个技术栈,包括因果编码器、因果解码器、块级因果注意力以及低延迟多模态令牌调度,使得流式单元短至160毫秒(25 fps)。Wan-Streamer实现了约200毫秒的模型侧响应延迟,结合350毫秒双向网络延迟后总交互延迟约550毫秒,支持亚秒级双工音视频通信。这些成果使Wan-Streamer成为面向低延迟流式交互的统一、端到端、多模态交互基础模型。
English
We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.