X-Streamer：視聴覚的相互作用を統合した人間世界モデリング

要旨

本論文では、X-Streamerを紹介する。これは、テキスト、音声、ビデオを単一の統合アーキテクチャ内で無限に相互作用させる能力を持つデジタルヒューマンエージェントを構築するための、エンドツーエンドのマルチモーダル人間世界モデリングフレームワークである。単一のポートレートから始めて、X-Streamerはストリーミングマルチモーダル入力によって駆動されるリアルタイムのオープンエンドなビデオ通話を可能にする。その中核には、マルチモーダル理解と生成を統合し、静的なポートレートを持続的で知的な視聴覚相互作用に変えるThinker-Actorデュアルトランスフォーマーアーキテクチャがある。Thinkerモジュールはストリーミングユーザー入力を知覚し、推論を行い、その隠れ状態はActorによってリアルタイムで同期されたマルチモーダルストリームに変換される。具体的には、Thinkerは事前学習済みの大規模言語-音声モデルを活用し、Actorはチャンク単位の自己回帰拡散モデルを使用して、Thinkerの隠れ状態にクロスアテンションを行い、時間的に整列したマルチモーダル応答を生成する。これには、離散的なテキストと音声トークン、および連続的なビデオ潜在変数が交互に含まれる。長期的な安定性を確保するために、時間的に整列したマルチモーダル位置埋め込みを用いたチャンク間およびチャンク内アテンションを設計し、細粒度のクロスモダリティアラインメントとコンテキスト保持を強化する。さらに、チャンク単位の拡散強制とグローバルアイデンティティ参照によって補強される。X-Streamerは2つのA100 GPU上でリアルタイムに動作し、任意のポートレートから長時間にわたる一貫したビデオチャット体験を維持し、インタラクティブなデジタルヒューマンの統合世界モデリングへの道を開く。

English

We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.

X-Streamer：視聴覚的相互作用を統合した人間世界モデリング

X-Streamer: Unified Human World Modeling with Audiovisual Interaction

要旨

Support