Wan-Streamer v0.1: End-to-end real-time interactieve foundationmodellen

Samenvatting

We presenteren Wan-Streamer, een native-streaming, end-to-end interactief fundamentmodel dat vanaf de basis is ontworpen voor real-time, lage-latentie, full-duplex audio-visuele interactie. Wan-Streamer modelleert naadloos taal, audio en video als zowel input als output binnen één enkele Transformer, waarbij de sequentie wordt gerepresenteerd als afgewisselde visuele, audio- en tekstinputtokens samen met visuele, audio- en tekstoutputtokens, gecoördineerd door blok-causale aandacht voor incrementele streaming. In tegenstelling tot gecascadeerde interactieve systemen die vertrouwen op afzonderlijke VAD-, ASR-, taal-, TTS-, audio-gestuurde animatie- of videogeneratiemodules, vertrouwt Wan-Streamer niet op externe taal-, spraak-, avatar- of videogeneratiemodules: perceptie, redenering, generatie, reactietiming, gespreksbeheer en cross-modale synchronisatie worden gezamenlijk aangeleerd binnen één verenigd model, waardoor pijplijnlatentie en foutaccumulatie worden verminderd. Om natuurlijke audio-visuele responsiviteit te ondersteunen, herontwerpen we de gehele stack rondom streamability, met inbegrip van causale encoders, causale decoders, blok-causale aandacht en lage-latentie multimodale tokenplanning, waardoor streamingeenheden zo kort als 160 ms bij 25 fps mogelijk worden. Wan-Streamer bereikt een modelzijde responslatentie van ongeveer 200 ms en een totale interactielatentie van ongeveer 550 ms in combinatie met 350 ms bidirectionele netwerklatentie, waarmee subsecondes duplex audio-visuele communicatie wordt ondersteund. Deze resultaten positioneren Wan-Streamer als een verenigd, end-to-end, multimodaal interactief fundamentmodel voor laag-latentie streaming interactie.

English

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.