X-Streamer：基於視聽互動的統一人類世界建模

摘要

我們推出X-Streamer，這是一個端到端的多模態人類世界建模框架，旨在構建能夠在單一統一架構內實現文本、語音和視頻無限交互的數字人類代理。從單一肖像出發，X-Streamer支持由流式多模態輸入驅動的實時、開放式視頻通話。其核心是一個Thinker-Actor雙變換器架構，統一了多模態理解與生成，將靜態肖像轉化為持久且智能的視聽交互。Thinker模塊感知並推理流式用戶輸入，而其隱藏狀態由Actor模塊實時轉化為同步的多模態流。具體而言，Thinker利用預訓練的大型語言-語音模型，而Actor則採用分塊自回歸擴散模型，該模型交叉關注Thinker的隱藏狀態，以生成時間對齊的多模態響應，其中交織著離散的文本和音頻標記以及連續的視頻潛在表示。為了確保長時程穩定性，我們設計了跨塊和塊內注意力機制，配備時間對齊的多模態位置嵌入，以實現細粒度的跨模態對齊和上下文保留，並通過分塊擴散強制和全局身份參考進一步強化。X-Streamer在兩塊A100 GPU上實時運行，支持從任意肖像開始的長時間一致視頻聊天體驗，為交互式數字人類的統一世界建模鋪平道路。

English

We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.