X-Streamer:基于视听交互的统一人类世界建模
X-Streamer: Unified Human World Modeling with Audiovisual Interaction
September 25, 2025
作者: You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, Linjie Luo
cs.AI
摘要
我们推出X-Streamer,一个端到端的多模态人类世界建模框架,旨在构建能够在单一统一架构下实现文本、语音和视频无限交互的数字人类代理。仅需一张肖像,X-Streamer便能驱动实时、开放式的视频通话,响应流式多模态输入。其核心是Thinker-Actor双Transformer架构,统一了多模态理解与生成,将静态肖像转化为持久且智能的视听交互。Thinker模块感知并推理流式用户输入,而Actor模块则实时将这些隐藏状态转化为同步的多模态流。具体而言,Thinker利用预训练的大型语言-语音模型,而Actor则采用分块自回归扩散模型,通过交叉注意力机制关注Thinker的隐藏状态,生成时间对齐的多模态响应,包括交错的离散文本与音频标记以及连续视频潜在表示。为确保长期稳定性,我们设计了跨块与块内注意力机制,结合时间对齐的多模态位置嵌入,实现细粒度的跨模态对齐与上下文保留,并通过分块扩散强制与全局身份参考进一步强化。X-Streamer在两块A100 GPU上实时运行,支持从任意肖像出发的长时间一致视频聊天体验,为交互式数字人类的统一世界建模开辟了道路。
English
We introduce X-Streamer, an end-to-end multimodal human world modeling
framework for building digital human agents capable of infinite interactions
across text, speech, and video within a single unified architecture. Starting
from a single portrait, X-Streamer enables real-time, open-ended video calls
driven by streaming multimodal inputs. At its core is a Thinker-Actor
dual-transformer architecture that unifies multimodal understanding and
generation, turning a static portrait into persistent and intelligent
audiovisual interactions. The Thinker module perceives and reasons over
streaming user inputs, while its hidden states are translated by the Actor into
synchronized multimodal streams in real time. Concretely, the Thinker leverages
a pretrained large language-speech model, while the Actor employs a chunk-wise
autoregressive diffusion model that cross-attends to the Thinker's hidden
states to produce time-aligned multimodal responses with interleaved discrete
text and audio tokens and continuous video latents. To ensure long-horizon
stability, we design inter- and intra-chunk attentions with time-aligned
multimodal positional embeddings for fine-grained cross-modality alignment and
context retention, further reinforced by chunk-wise diffusion forcing and
global identity referencing. X-Streamer runs in real time on two A100 GPUs,
sustaining hours-long consistent video chat experiences from arbitrary
portraits and paving the way toward unified world modeling of interactive
digital humans.