Wan-Streamer v0.1: エンドツーエンドのリアルタイム対話型基盤モデル

要旨

我々は、リアルタイムかつ低遅延な全二重音声・映像対話を目的として、ゼロから設計されたネイティブストリーミング対応のエンドツーエンド対話基盤モデル、Wan-Streamerを提案する。Wan-Streamerは、単一のTransformer内で言語、音声、映像を入力と出力の両方としてシームレスにモデル化し、シーケンスは、インクリメンタルなストリーミングのためにブロック因果注意によって調整された、視覚、音声、テキストの入力トークンと、視覚、音声、テキストの出力トークンが交互に配置された形で表現される。個別のVAD、ASR、言語、TTS、音声駆動アニメーション、または動画生成モジュールに依存するカスケード型対話システムとは異なり、Wan-Streamerは外部の言語、音声、アバター、動画生成モジュールを必要としない。知覚、推論、生成、応答タイミング、ターン管理、モーダル間同期は、単一の統合モデル内で共同に学習され、パイプラインのレイテンシと誤差の蓄積を低減する。自然な音声・映像応答性を実現するために、我々はストリーマビリティを中心にスタック全体を再設計し、因果エンコーダ、因果デコーダ、ブロック因果注意、低遅延マルチモーダルトークンスケジューリングを含めることで、25fpsにおいて160ミリ秒という短いストリーミング単位を可能にした。Wan-Streamerは、モデル側の応答レイテンシとして約200ミリ秒、350ミリ秒の双方向ネットワークレイテンシと組み合わせた場合の総相互作用レイテンシとして約550ミリ秒を達成し、サブ秒の全二重音声・映像通信を実現する。これらの結果は、Wan-Streamerを、低遅延ストリーミング対話のための統合的かつエンドツーエンドのマルチモーダル対話基盤モデルとして位置づけるものである。

English

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.